ACCELERATING MACHINE LEARNING VIA MULTI-OBJECTIVE OPTIMIZATION by ROBERT LIM A DISSERTATION Presented to the Department of Computer and Information Science and the Division of Graduate Studies of the University of Oregon in partial fulfillment of the requirements for the degree of Doctor of Philosophy September 2021 DISSERTATION APPROVAL PAGE Student: Robert Lim Title: Accelerating Machine Learning via Multi-Objective Optimization This dissertation has been accepted and approved in partial fulfillment of the requirements for the Doctor of Philosophy degree in the Department of Computer and Information Science by: Allen Malony Chair Boyana Norris Core Member Dejing Dou Core Member Camille Coti Core Member William Cresko Institutional Representative and Andy Karduna Interim Vice Provost for Graduate Studies Original approval signatures are on file with the University of Oregon Division of Graduate Studies. Degree awarded September 2021 ii ©c 2021 Robert Lim All rights reserved. iii DISSERTATION ABSTRACT Robert Lim Doctor of Philosophy Department of Computer and Information Science September 2021 Title: Accelerating Machine Learning via Multi-Objective Optimization This dissertation work presents various approaches toward accelerating training of deep neural networks with the use of high-performance computing resources, while balancing learning and systems utilization objectives. Acceleration of machine learning is formulated as a multi-objective optimization problem that seeks to satisfy multiple objectives, based on its respective constraints. In machine learning, the objective is to strive for a model that has high accuracy, while eliminating false positives and generalizing beyond the training set. For systems execution performance, maximizing utilization of the underlying hardware resources within compute and power budgets are constraints that bound the problem. In both scenarios, the search space is combinatorial and contains multiple local minima that in many cases satisfies the global optimum. This dissertation work addresses the search of solutions in both performance tuning and neural network training. Specifically, subgraph matching is proposed to bound the search problem and provide heuristics that guide the solver toward the optimal solution. Mixed precision operations is also proposed for solving systems of linear equations and for training neural networks for image classification for evaluating the stability and robustness of the operations. Use cases are presented with CUDA performance tuning and neural network training, demonstrating the effectiveness iv of the proposed technique. The experiments were carried out on single and multi- node GPU clusters, and reveals opportunities for further exploration in this critical hardware/software co-design space of accelerated machine learning. v CURRICULUM VITAE NAME OF AUTHOR: Robert Lim GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: University of Oregon, Eugene, OR, USA University of California, Irvine, Irvine, CA, USA University of California, Los Angeles, Los Angeles, CA, USA DEGREES AWARDED: Doctor of Philosophy, Computer and Information Science, 2021, University of Oregon Master of Science, Computer Science, 2014, University of California, Irvine Bachelor of Science, Cognitve Science, Specialization in Computing, 2005, University of California, Los Angeles AREAS OF SPECIAL INTEREST: Numerical Analysis Automatic Performance Tuning for GPUs Multi-Objective Optimization High Performance Computing PROFESSIONAL EXPERIENCE: Intern, U.S. Army Engineer Research and Development Center, Geospatial Research Lab, Alexandria, VA Sept – Dec, 2019 Intern, Université de Versailles, Versailles, France Mar – Sept, 2019 Intern, Defence Science & Technology Lab, Salisbury, U.K. Aug – Sept, 2017 Intern, The Alan Turing Institute, London, U.K. June – Aug, 2017 Intern, U.S. Army Engineer Research and Development Center, Geospatial Research Lab, Alexandria, VA June – Sept, 2016 ASTRO Intern, Oak Ridge National Lab, Oak Ridge, TN June – Sept, 2014 Extreme Blue Technical Intern, IBM, Austin, TX May – Aug, 2013 vi GRANTS, AWARDS AND HONORS: Awards Graduate Research Fellowship, University of Oregon 2014 SMART Scholarship Award, U.S. Department of Defense 2015 Chateaubriand Fellowship, Embassy of France in U.S. 2017 Gurdeep Pall Fellowship, University of Oregon 2016, 2017 Student Travel Award, Ph.D. Forum Supercomputing Conference, Denver, CO Nov 2019 International Conference on Parallel Processing, Eugene, OR Aug 2018 IEEE Cluster, Chicago, IL Aug 2015 ACM Object-Oriented Programming, Systems, Languages & Applications, Portland, OR Sept 2014 PUBLICATIONS: Lim, R., Oliveira, P., Coti, C., Jalby, W., and Malony, A. “Reduced Precision Computation for Accurate and Robust Learning Systems.” 5th Workshop on Naval Applications of Machine Learning, 2021 (poster) Lim, R., Norris, B., and Malony, A. “A Similarity Measure for GPU Kernel Subgraph Matching.” 31st International Workshop on Languages and Compilers for Parallel Computing, 2019 Lim, R., Heafield, K., Hoang, H., Briers, M., and Malony, A. “Exploring Hyper-Parameter Optimization for Neural Machine Translation on GPU Architectures.” 2nd Workshop on Naval Applications of Machine Learning, 2018 Lim, R., Norris, B., and Malony, A. “Autotuning GPU Kernels via Static and Predictive Analysis.” 46th International Conference on Parallel Processing, 2017 Lim, R., Norris, B., and Malony, A. “Tuning Heterogeneous Computing Architectures through Integrated Performance Tools.” GPU Technology Conference, 2016 (poster) vii Lim, R., Malony, A., Norris, B., and Chaimov, N. “Identifying Optimization Opportunities within Kernel Execution in GPU Codes.” International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms, 2015 Sreepathi, S., Grodowitz, M., Lim, R., Taffet, P., Roth, P., Meredith, J., Lee, S., Li, D., and Vetter, J. “Application Characterization using Oxbow Toolkit and Pads Infrastructure.” International Workshop on Hardware-Software Co-Design for High Performance Computing, 2014 Lim, R., Carrillo-Cisneros, D., Alkowaileet, W., and Scherson, I. “Computationally Efficient Multiplexing of Events on Hardware Counters.” Linux Symposium, 2014 Alkowaileet, W., Carrillo-Cisneros, D., Lim, R., and Scherson, I. “NUMA- aware Multicore Matrix Multiplication.” Parallel Processing Letters, 2013 viii ACKNOWLEDGEMENTS I want to send my sincerest gratitude to the following individuals who have facilitated in the embarkment of this endeavour. Dissertation committee. Profs. Allen Malony, Boyana Norris, Dejing Dou, Camille Coti, Bill Cresko. Collaborators. Oak Ridge National Lab: Drs. Jeff Vetter, Megan Grodowitz, Sarat Sreepathi. The Alan Turing Institue: Drs. Kenneth Heafield, Hieu Hoang, Mark Briers. University of Versailles: Drs. William Jalby, Pablo Oliveira. Compute resources. Argonne National Lab: Drs. Kevin Harms, Phil Carns. NVIDIA: J-C Vasnier, Duncan Poole, Barton Fiske. University of Reims, Champagne Ardenne: Drs. Michael Krajecki, Arnaud Renaud. Colleagues. TAU Developers: Sameer Shende, Kevin Huck, Wyatt Spear, Nicholas Chaimov, Aurele Maheo, Josefina Lenis, Alister Johnson, Srinivasan Ramesh. Office mates: Chad Wood, Daniel Ellsworth, David Ozog. SMART: Arnold Boedihardjo, Alan Van Nevel, Marisa Garcia, Jessica Holland, Jess Molina, Brandon Cochenour, Karrin Felton. University of Oregon. CIS Department: Drs. Joe Sventek, Hank Childs, Reza Rejaie. CIS Staff: Cheri Smith, Jan Saunders, Rob Yelle, Charlotte Wise. Neuroinformatics Center: Dr. Don Tucker, Erik Keever. Division of Graduate Studies: Jered Nagel, Lesley Yates-Pollard, Andy Karduna. Family members. Mom, Dad, Susa, Melody, Tabitha, Lucas, Chad. I want to acknowledge my advisor, Prof. Allen Malony, who has undoubtedly created an environment for me to thrive in. That car ride to Falling Sky during my recruitment and the conversation we had about GPUs and harsh ix journal reviewers was beyond convincing for me to attend the University of Oregon, not to mention the Eugene mist and Ducks Football. I feel very fortunate for the unconditional advisement I received, having a channel to brainstorm ideas, and plowing through the countless deadlines. I also want to acknowledge my dissertation committee, especially Profs. Boyana Norris and Camille Coti, for their unequivocal support throughout various phases in this process. Many thanks! x To my dad, whose work ethic has inspired me in many ways. xi TABLE OF CONTENTS Chapter Page I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . 1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Organization of Dissertation . . . . . . . . . . . . . . . . . . . 2 Background Information . . . . . . . . . . . . . . . . . . . 2 Optimizing Code Generation . . . . . . . . . . . . . . . . . 2 GPU Subgraph Matching . . . . . . . . . . . . . . . . . . . 3 Hyper-Parameter Optimization . . . . . . . . . . . . . . . . 3 Numerical Representation . . . . . . . . . . . . . . . . . . . 4 Summary and Future Work . . . . . . . . . . . . . . . . . . 4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 II. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . 5 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Deep Learning Application Domains . . . . . . . . . . . . . . 5 ML HPC Architectures . . . . . . . . . . . . . . . . . . . . 5 Graphic Processing Units . . . . . . . . . . . . . . . . . 6 Tensor Processing Unit . . . . . . . . . . . . . . . . . . 7 Scalable CPUs . . . . . . . . . . . . . . . . . . . . . . 8 Mixed Precision Numerical Methods . . . . . . . . . . . . . . 8 AI and HPC . . . . . . . . . . . . . . . . . . . . . . . . 9 Background Information . . . . . . . . . . . . . . . . . . . . . 9 xii Chapter Page Multi-Objective Optimization . . . . . . . . . . . . . . . . . 10 ML Terminology . . . . . . . . . . . . . . . . . . . . . . . 11 Optimization in Machine Learning . . . . . . . . . . . . . . . 11 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . 13 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 III. OPTIMIZING CODE GENERATION . . . . . . . . . . . . . . . 15 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 18 CUDA Programming Model and Control Flow Divergence . . . . 18 GPU Performance Tools . . . . . . . . . . . . . . . . . . . 19 Autotuning . . . . . . . . . . . . . . . . . . . . . . . . . 19 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Occupancy . . . . . . . . . . . . . . . . . . . . . . . . . 22 Occupancy Calculation . . . . . . . . . . . . . . . . . . 22 Definition of Occupancy . . . . . . . . . . . . . . . . . . 23 Theoretical Occupancy . . . . . . . . . . . . . . . . . . 23 Instruction Mix Metrics . . . . . . . . . . . . . . . . . . 25 Pipeline Utilization . . . . . . . . . . . . . . . . . . . . 26 Infer Kernel Execution Time . . . . . . . . . . . . . . . . 27 ORIO Code Generation . . . . . . . . . . . . . . . . . . . 28 OCC Results . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Environment . . . . . . . . . . . . . . . . . . . . . . . . 28 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 xiii Chapter Page Improved Autotuning with Static Analyzer . . . . . . . . . . . 32 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 35 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 IV. CONTROL FLOW SUBGRAPH MATCHING . . . . . . . . . . . . 41 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Kernel Control Flow Graphs . . . . . . . . . . . . . . . . 46 Transition Probability . . . . . . . . . . . . . . . . . . . 48 Hybrid Static and Dynamic Analysis . . . . . . . . . . . . 49 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Bilinear Interpolation . . . . . . . . . . . . . . . . . . . . 50 Pairwise Comparison . . . . . . . . . . . . . . . . . . . . . 50 CFG Results . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Applications . . . . . . . . . . . . . . . . . . . . . . . . 51 Rodinia . . . . . . . . . . . . . . . . . . . . . . . . . 51 SHOC Benchmark Suite . . . . . . . . . . . . . . . . . . 51 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Application Level . . . . . . . . . . . . . . . . . . . . . . 51 CFG Subgraph Matching . . . . . . . . . . . . . . . . . . . 53 Distribution of Matched Pairs . . . . . . . . . . . . . . . 53 xiv Chapter Page Error Rates from Instruction Mixes . . . . . . . . . . . . . 54 Pairwise Matching of Kernels . . . . . . . . . . . . . . . . 56 Clustering of Kernels . . . . . . . . . . . . . . . . . . . 57 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 58 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 V. OPTIMIZING HYPER-PARAMETERS FOR NEURAL NETWORKS . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 64 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Machine Translation . . . . . . . . . . . . . . . . . . . . . 66 Recurrent Neural Networks . . . . . . . . . . . . . . . . 66 RNN Encoder-Decoder . . . . . . . . . . . . . . . . . . 67 Neural Machine Translation . . . . . . . . . . . . . . . . 67 Optimization Objectives . . . . . . . . . . . . . . . . . . . 68 SGD Optimizers . . . . . . . . . . . . . . . . . . . . . 68 Activation Functions . . . . . . . . . . . . . . . . . . . 69 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . 70 Combination of Optimizers . . . . . . . . . . . . . . . . . . 71 Marian NMT . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Translation Quality . . . . . . . . . . . . . . . . . . . . . 74 Training Stability . . . . . . . . . . . . . . . . . . . . . . 77 xv Chapter Page Convergence Speed . . . . . . . . . . . . . . . . . . . . . . 81 Cost of Tuning a Hyper-Parameter . . . . . . . . . . . . . . . 83 Summarize Findings . . . . . . . . . . . . . . . . . . . . . . . 84 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 VI. NUMERICAL REPRESENTATION . . . . . . . . . . . . . . . . 88 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Floating-Point Numbers in Machine Learning . . . . . . . . . . . . 90 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Verificarlo Modes . . . . . . . . . . . . . . . . . . . . . . 92 Primitive Types from Composite Types . . . . . . . . . . . 94 Rounding . . . . . . . . . . . . . . . . . . . . . . . . 95 Multi-Objective Optimization . . . . . . . . . . . . . . . . . 96 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 97 Applications and Execution Environment . . . . . . . . . . . . 98 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 VII. SUMMARY AND FUTURE DIRECTIONS . . . . . . . . . . . . . 109 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 109 xvi Chapter Page Optimizing Code Generation . . . . . . . . . . . . . . . . . 109 GPU Subgraph Matching . . . . . . . . . . . . . . . . . . . 110 Hyper-Parameter Optimization . . . . . . . . . . . . . . . . 110 Numerical Representation . . . . . . . . . . . . . . . . . . . 110 APPENDICES A. THE FIRST APPENDIX . . . . . . . . . . . . . . . . . . . . 112 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . 112 B. THE SECOND APPENDIX . . . . . . . . . . . . . . . . . . . 114 Bilinear Interpolation . . . . . . . . . . . . . . . . . . . . . . 114 Efficiency and Goodness . . . . . . . . . . . . . . . . . . . . . 115 REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . 116 xvii LIST OF FIGURES Figure Page 1. Tensor processing unit (image source Cloud TPU (2019)). . . . . . . . 7 2. Quantizing multiply-add-accumulate operation. . . . . . . . . . . . 9 3. AI training runs, showing 3-4 month doubling time of petaflops. Image source AI and Compute (2018). . . . . . . . . . . . 10 4. Branch divergence problem and performance loss incurred. . . . . . . 18 5. Optimization framework for GPU kernels incorporating static and dynamic analysis, with autotuning and code transformation. . . . . . . . . . . . . . . . . . . . . . . . . . 20 6. Performance autotuning specification. . . . . . . . . . . . . . . . . 27 7. Thread counts for Orio autotuning exhaustive search, comparing architectures and kernels. . . . . . . . . . . . . . . . . 29 8. Time-to-instruction mix ratio, comparing architectures and kernels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 9. Improved search time over exhaustive autotuning, comparing static and rule-based approaches. . . . . . . . . . . . . . 32 10. Occupancy calculator displaying thread, register and shared memory impact for current (top) and potential (bottom) thread optimizations for the purposes of increasing occupancy. . . . . . 36 11. Overview of our proposed CUDAflow methodology. . . . . . . . . . . 45 12. Control flow graphs generated for each CUDA kernel, comparing architecture families (Kepler, Maxwell, Pascal). . . . . . . . 46 13. Transition probability matrices for Pathfinder (dynproc kernel) application, comparing Kepler (left) and Maxwell (right) versions. . . . . . . . . . . . . . . . . . . . 48 xviii Figure Page 14. Left: The static goodness metric (Eq. B.2) is positively correlated with the dynamic efficiency metric (Eq. B.1). The color represents the architecture and the size of bubbles represents the number of operations. Right: Differences in vertices between two graphs, as a function of Euclidean metric for all GPU kernel combinations. Color represents intensity. . . . 53 15. Error rates when estimating instruction mixes statically from runtime observations for selected matched kernels (x-axis), with IsoRank scores near 1.30. . . . . . . . . . . . . . . . 54 16. Similarity measures for Euclidean, IsoRank and Cosine distances for 12 arbitarily selected kernels. . . . . . . . . . . . . . . 55 17. Similarity measures for Jaccard, Minkowski and Manhattan distances for 12 arbitarily selected kernels. . . . . . . . . . . . . . . 56 18. Dendrogram of clusters for 26 kernels, comparing Maxwell (left) and Pascal (right) GPUs. . . . . . . . . . . . . . . . . . . . 57 19. Dendrogram of clusters for pairwise comparison for 3 kernels across 3 GPUs (9 total). . . . . . . . . . . . . . . . . . . . . . . 59 20. Dendrogram of clusters for pairwise comparison for 3 kernels across 3 GPUs (27 total). . . . . . . . . . . . . . . . . . . . . . 59 21. RNN encoder-decoder, illustrating a sentence translation from English to French. The architecture includes a word embedding space, a 1-of-K coding and a recurrent state on both ends.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 22. BLEU scores as a function of training time (seconds), comparing GPUs (color), activation units (sub-columns), learning rates and translation directions. . . . . . . . . . . . . . . 78 23. BLEU scores as a function of training time (seconds), comparing GPUs (color), activation units (sub-columns), learning rates and translation directions. . . . . . . . . . . . . . . 79 24. Cross entropy over the number of epochs for RO → EN and EN → RO, comparing activation functions and GPUs. . . . . . . . . 80 25. Cross-entropy over the number of epochs for DE → EN and EN → DE, comparing activation functions and GPUs. . . . . . . . . 80 xix Figure Page 26. Average words-per-second for the RO → EN translation task, comparing systems. . . . . . . . . . . . . . . . . . . . . . 81 27. Comparison of floating point representations (image source TF32 (2019a)). . . . . . . . . . . . . . . . . . . . . . . . 89 28. Verificarlo workflow. . . . . . . . . . . . . . . . . . . . . . . . 92 29. Virtual precision in Verificarlo, showing r = 5 and p = 10, simulating a binary16 embedded inside a binary 32. . . . . . . . . . . 94 30. Search for precision and range settings during training. . . . . . . . . 96 31. Accuracy per epoch, comparing vision models. . . . . . . . . . . . . 100 32. Rounding errors for various mantissa sizes, comparing vision models. . . 102 33. Accuracy over time (top) and loss over time (bottom), comparing vision models. . . . . . . . . . . . . . . . . . . . . . 104 A.1. Gradient descent for different learning rates. . . . . . . . . . . . . . 113 xx LIST OF TABLES Table Page 1. Comparing ML computer processors and accelerators. . . . . . . . . . 6 2. Selected optimization methods targeting machine learning and high performance computing. . . . . . . . . . . . . . . . . . . 12 3. Instruction throughput per number of cycles. . . . . . . . . . . . . . 26 4. A subset of features used for thread block classification. . . . . . . . . 27 5. Kernel specifications. . . . . . . . . . . . . . . . . . . . . . . . 29 6. Statistics for autotuned kernels for top performers (top half) and poor performers (bottom half), comparing GPU architecture generations. . . . . . . . . . . . . . . . . . . . . . . 30 7. Error rates when estimating dynamic instruction mixes from static mixes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 8. Suggested parameters to achieve theoretical occupancy. . . . . . . . . 33 9. Distance measures considered in this paper. . . . . . . . . . . . . . 50 10. Description of SHOC (top) and Rodinia (bottom) benchmarks studied. . . . . . . . . . . . . . . . . . . . . . . . . 52 11. Stochastic gradient descent and its variants. . . . . . . . . . . . . . 68 12. Activation units for RNN. . . . . . . . . . . . . . . . . . . . . . 69 13. Dropout versus a standard update function. . . . . . . . . . . . . . 70 14. Marian hyper-parameters, with options in brackets. . . . . . . . . . . 73 15. Datasets used in experiments. . . . . . . . . . . . . . . . . . . . 74 16. Graphical processors used in this experiment. . . . . . . . . . . . . . 75 17. Hardware and execution environment information. . . . . . . . . . . 75 18. BLEU scores for validation (top) and test (bottom) datasets. . . . . . . 76 xxi Table Page 19. Dropout rates, BLEU scores and total training time for test set, comparing systems. . . . . . . . . . . . . . . . . . . . . . . 76 20. Words-per-second (average) and number of epochs, comparing activation units, learning rates and GPUs. . . . . . . . . . 82 21. Total training time for four translation directions, comparing systems. . . 83 22. Average time spent per iteration for RO → EN and EN → RO translation directions, comparing systems, with standard deviation in parenthesis and epochs in brackets. . . . . . . . . . . . . 84 23. Average time spent per iteration for DE → EN and EN → DE translation directions, comparing systems, with standard deviation in parenthesis and epochs in brackets. . . . . . . . . . . . . 85 24. IEEE-754 Numbers and exceptions. . . . . . . . . . . . . . . . . . 91 25. Verificarlo backends and options. . . . . . . . . . . . . . . . . . . 93 26. Neural networks for image classification evaluated in this study. . . . . . 98 27. Intel Xeon Platinum hardware and execution environment information. . 99 28. Statistics for first ten epochs of training, comparing precision sizes and models. . . . . . . . . . . . . . . . . . . . . . . . . . 101 29. Displaying multi-objective results when accounting for accuracy and average time spent per epoch. . . . . . . . . . . . . . 103 xxii CHAPTER I INTRODUCTION Overview This dissertation work presents various approaches toward accelerating training of deep neural networks with the use of high-performance computing (HPC) resources, while balancing learning and systems utilization objectives. Acceleration of machine learning (ML) is formulated as a multi-objective optimization problem, which seeks to jointly optimize performance and learning objectives, based on its respective constraints. Within each scope, the solver seeks an optimal solution. In machine learning, the solver optimizes a prediction function h : X → Y from an input space X to an output space Y , for x ∈ X , such that ŷ = Fh(x) accurately predicts the output and minim∫izes its empirical risk, defined as the expectation, R(Fh) = E[L(F (h(x)), y)] = L(h(x), y) dP (x, y), for some loss function L(ŷ, y) estimating ŷ from y, and the goal is to find F ∗h = arg min R(F ), or a prediction function F ∗F ∈F h h ∈ F where R(Fh) is minimal.h When optimizing for execution performance, the solver seeks to minimize a cost function g : X → Y by selecting the combination of compute resources a ∈ A and transformation options o ∈ O, X (A,O), which yield minimal execution time while maximizing utilization of har∫dware resources, or F ∗g = arg minFg∈FR(Fg), where R(Fg) = E[F (g(x)), y)] = L(g(x), y) dP (x, y). Joint optimization of machine learning and high performance computing, discussed in Sec. II, is defined as minFg∈F minF ∈FR(F), for Fg, Fh ∈ F, where F represents a vector of objectiveh functions satisfying constraints for its respective domains. 1 Organization of Dissertation This section outlines the dissertation format and provides an overview for each chapter. Background Information. Accelerating machine learning and multi-objective optimization are presented in Chapter II, which motivates the discussion and provides the background information for the dissertation. The subject matter is on the intersection of high-performance computing and machine learning, specifically how the innovations of heterogeneous parallel programming and methods for analyzing massive amounts of data has transformed industries and society, making this an important field to investigate. Performance optimization depends on numerous elements involved in the computation, including both hardware and software. Thus, the computer architectures, the algorithms, and numerical methods related to machine learning are briefly covered in this chapter. In the chapters that follow, several approaches are presented for optimizing performance and learning by accelerating the computation kernels used by the machine learning algorithms, by tuning the hyper-parameters of these algorithms, and by understanding the numerical representation of the data handled by these algorithms. The next subsections provide a brief overview and the research questions raised for each chapter. Optimizing Code Generation. Chapter III discusses the work where we proposed metrics for automatic performance tuning of GPU applications. This work seeks to address the following questions: 1. Given the difficult requirements by the user in writing CUDA code, where the user is forced to set threads, blocks, shared memory, and writing efficient 2 parallel programs, could we automatically come up with ways that discover optimal parameter settings? 2. What metrics can we define statically, such as occupancy, that capture the performance requirements of a computational kernel, and can we use those to help improve our search during automatic performance tuning? GPU Subgraph Matching. Chapter IV proposes several techniques for performing subgraph matching with GPU kernels. The proposed techniques incorporate the shape and traversal of the graph, its transition probabilities, and hardware information such as the GPU the graph was generated in and instruction mixes. The following questions are proposed: 1. Can we come up with compact ways of representing execution performance information of GPU kernels that captures the essence of runtime information, but at the same time, enable us to reason about an unseen kernel’s behavior? 2. Can we define a similarity metric that enables us to match graphs with one another? This similarity metric needs to be GPU architecture independent, provide a correct measure when measuring with itself, and can match graphs of arbitrary shapes and sizes. Hyper-Parameter Optimization. Chapter V optimizes hyper- parameters for a neural machine translation system. The hyper-parameters explored varies and the study accounts for training stability, trajectories and speed. Other statistics include words processed per second and time to convergence. In this work, we address the following questions: 1. Can we identify which hyper-parameters contribute the most to a model’s learning trajectory, while accounting for stability, quality, and speed? 3 2. Can we identify which hyper-parameters matter most, in terms of system execution performance? Numerical Representation. Chapter VI examines the numerical requirements when training deep neural networks. In particular, mixed precision operations is proposed that enables users to set the precision and range sizes during training run. The questions raised in this research are the following: 1. What are the precision requirements during various iterations of the phase when training deep neural networks? 2. Can we propose a dynamic mixed-precision approach toward training neural networks, where the precision sizes can be set during the phase of the training run? Summary and Future Work. Chapter VII closes by summarizing the work that was discussed in previous chapters and presents areas to pursue for future work. Conclusion An overview of multi-objective optimization and the dissertation format was presented. Next, motivation and background information relating to accelerated machine learning are presented. 4 CHAPTER II BACKGROUND This chapter provides motivation for the dissertation work and covers the basic concepts needed to be discussed further in the dissertation. Motivation Deep Learning Application Domains. The U.S. Department of Energy has outlined the artifical intelligence (AI) objectives to couple machine learning methods with HPC workloads in its concurrent quest for building the first Exaflop supercomputing machine Stevens et al. (2020). The report identifies areas where ML could augment existing scientific workflows, including chemistry, materials and nanoscience; earth and environmental sciences; biology and life sciences; high energy physics and nuclear physics, amongst others. The nascent social media industry, which provides free services to millions of users, has made billions of dollars deploying deep neural networks to analyze petaflops of user data on-line and off-line for its own use of recommendation systems, computer vision, and language comprehension Park et al. (2018). What makes this innovation possible is both the advancement in AI and the compute infrastructure that delivers instantaneous results to users on their end devices. ML HPC Architectures. This subsection discusses various computer architectures and accelerators that have been developed for machine learning purposes. Because the majority of operations during machine learning are matrix vector products, much effort has been made to fabricate parallel matrix multipliers to accelerate machine learning. Also referred to as neural processors Neural Processor (2020), these architectures include GPUs, CPUs, and custom ASICs, such as tensor processing units (TPU). Not discussed in this dissertation but also part of 5 CPU GPU TPU Skylake Cas Lake Volta Ampere v2 v3 Processor Cores 28 56 80 96 512 2048 Freq (MHz) 2500 2600 1328 1328 700 940 Peak Perf 2T 3.2T 125T 312T 180T 420T Memory Type DDR4 DDR4 HBM2 HBM2 HBM HBM Off-chip (GB) 120 140 16 40 32 40 BW (GB/s) 16.6 23.4 900 1555 2400 3600 Hardware TDP 205 400 300 450 280 450 MXU n/a n/a 4× 4 (8) 4× 4 (4) 128×128 128×128 (1) (2) Table 1. Comparing ML computer processors and accelerators. neural processors are neuromorphic chips, such as IBM True North Modha (2017) and Intel Loihi Davies et al. (2018), and field programmable gate arrays (FPGA). Table 1, drawn from Wang, Wei, and Brooks (2019); Intel Sky Lake (2017); Intel Cascade Lake (2019), displays a general comparison of CPUs, GPUs and TPUs. Graphic Processing Units. NVIDIA GPUs, which dominate the market share of accelerators for machine learning due to the proven use of GPUs and the CUDA Deep Neural Network (cuDNN) library, has been aggressively introducing new hardware capabilities for fused operations and reduced precision. Designed originally for real-time graphics rendering using fixed-function pipelines with each pixel performing independent operations in parallel, CUDA debuted as a parallel computing platform in 2007 that enabled user defined programs called shaders that combined the vertex and fragment operations, which is the foundation of the data parallel programming units that run on GPUs and currently 6 Figure 1. Tensor processing unit (image source Cloud TPU (2019)). powers some of the world’s fastest supercomputers. NVIDIA Volta V100 marks the release of hardware support for tensor cores in 2017 that are capable of executing 4×4 matrices in 16-bit, which uses warps of 32 parallel threads. The NVIDIA Ampere A100, released in 2020, adds the tensorfloat32 instruction set TF32 (2019a), which utilizes 8 bits for the exponent and 10 bits for the mantissa, and is essentially a 19 bit format in a 32 bit register. Software frameworks such as PyTorch, TensorFlow and MxNet rely on the cuDNN library for acceleration. The library can be downloaded for free with a registered NVIDIA account. Tensor Processing Unit. The tensor processing unit (TPU), created by Google in 2016, is a high-performance, application-specific integrated chip (ASIC) designed for neural network training with 64 GB high-bandwidth memory and 180 teraflops (TFLOPS) peak performance Cloud TPU (2019). Figure 1 displays a schematic representation of the TPUs for v1 and v2. Organized as 7 a systolic array of 65536 8-bit matrix multiplier units (MXU) with hardwired activation units and a 24 MB unified buffer, inputs are read once and reused, in contrast to the current approach that loads and stores each value TF32 (2019b). Values are further quantized to 8-bits to increase throughput of the instructions. The TPU performs well for training deep neural networks with large batch sizes and is capable of low latency inferencing. Scalable CPUs. Intel, known traditionally as a CPU manufacturer, has also been agressively wedging in the market share of high-performance computing and machine learning. The CPU’s ability for multi-processing with mixed-mode accelerators can be advantageous for AI models, such as reinforcement learning and agent-based modeling. Intel also open sourced the Math Kernel Library Deep Neural Network (MKL-DNN) in 2019 as part of the oneAPI initiative, a unified programming model that targets all Intel devices, including CPUs, GPUs, and FPGAs. The Intel Xeon Scalable processors enable multi-node multi-socket connection, with 8-bit multiplies and 32-bit accumulates with instructions, such as 8 bit operations accumulated in 16-bit registers (VPMADDUBSW), 16-bit to 32-bit broadcasts (VPMADDWD), and neural network instructions (AVX512 VNNI), which are 512-bit vectorized intrinsics that perform 8-bit integer operations and accumulates the results to 32-bit registers. Other product lines that Intel provide for AI include Xe GPU accelerators, Arrix FPGAs, and high-performance memory subsystems such as X-point. Mixed Precision Numerical Methods. Mixed precision algorithms have been successful in providing performance execution benefits, such as increased instruction throughput, less memory footprint and savings in energy. Figure 2 8 Figure 2. Quantizing multiply-add-accumulate operation. illustrates how the precision of the output of the activation is reduced to N bits (image source: Sze, Chen, Yang, and Emer (2017)). Reduced precision is commonplace on modern microprocessors and accelerators, such as TPUs that have 8-bit integer arithmetic and GPUs that have mixed precision execution modes, including the Pascal (8-, 16-bit to 32-bit) and the Turing (1-, 4-, 8-, 16-bit to 32- bit) models. AI and HPC. The advancement of AI has been driven by the algorithmic innovations, massive training sets, and compute capabilities AI and Compute (2018). Figure 3 shows the total amount of compute, in petaflop/s-days, used to train selected AI models. Although one may criticize that the shortcomings of current AI aproaches are exposed with the massive compute requirements, the amount of compute availability often facilitates in algorithmic advances, as seen in the evolving AI approaches, from neural networks to reinforcement learning, and the types of problems being solved, including image classification and competitive gaming. Background Information This section covers the prerequisite information needed for discussing the dissertation topics. The topics include multi-objective optimization, machine learning terminology, optimization in machine learning, and stochastic gradient descent. 9 Figure 3. AI training runs, showing 3-4 month doubling time of petaflops. Image source AI and Compute (2018). Multi-Objective Optimization. Multi-objective optimization seeks to optimize multiple criteria, leveraging overlapping objectives while balancing trade-offs associated with such choices. For accelerating machine learning, multi- objective optimization seeks to achieve quality model learning while leveraging high-performance computing resources. Multi-objective optimization is formulated as follows. Taken from Miettinen (2012), let k represent the number of objective functions with m inequality constraints and e equality constraints. Let x ∈ En be a vector of decision variables with n independent variables, xi, and F(x) ∈ Ek be a vector of objective functions, Fi(x) : E n → E1, where E1 represents the criteria, or the cost, of the objective. 10 Multi-objective optimization is defined as follows: min F(x) = [F1(x), F2(x), ..., Fk(x)] T x subject to (2.1) gj(x) ≤ 0, j = 1, 2, ...,m, hl(x) = 0, l = 1, 2..., e. The gradient of Fi(x) is ∇xFi(x) ∈ En, where x∗i is a point that minimizes the objective function, Fi(x). The feasible design space X, or the decision space, is defined as the set {x | gj(x) ≤ 0, j = 1, 2, ...,m;hi(x) = 0, i = 1, 2, ..., e}, whereas the feasible criterion space Z, or the attainable set, is defined as {F(x) | x ∈ X}. Feasiblilty implies that no constraints are involved, whereas attainability implies that a point in the criterion space Z maps to the decision space X. The Pareto optimal is defined as a point x∗ ∈ X, if and only if there does not exist another point, such that F(x) ≤ F(x∗), and Fi(x) < Fi(x∗) for at least one criterion. ML Terminology. Machine learning is a subcategory of artificial intelligence, and neural networks are a subcategory of machine learning that takes a brain-inspired approach toward learning. Optimization in Machine Learning. Optimization is a mathematical procedure for finding a maximum or a minimum value of a function of several variables, subject to a set of constraints, as in linear programming or systems analysis Chong and Zak (2013). Decision making, where optimization plays a central role, entails selecting the best option amongst a set of alternatives. An objective function, or performance index, provides a “goodness” measure, where the optimization procedure selects the best alternative in light of the given objective function. 11 Table 2. Selected optimization methods targeting machine learning and high performance computing. Algorithmic Solver Search Compilation Node Distributed SVM SGD Random Loop GPU Thread Multi- Transforms Block GPU Neural Adam Grid Vector- Memory Cluster Nets /SIMD Hierarchy Parallelism Least-squares Adagrad Bayesian IR Intrinsics Block Partition KNN FFT MCTS Mixed Operator Lazy/Eager Precision Fusion Reinforcement Newton Heuristic Kernel Rounding BFGS, Learning Fusion Downpour Table 2 lists the objectives to optimize for accelerating machine learning. The methods are categorized according to the level of the software stack. Note that this is not an exhaustive list, but a subset that incorporates both ML and HPC. Note, also, that there may be overlaps and that each optimization target may fall under several categories. This dissertation attempts to address the areas that are highlighted. Algorithmic optimization involves selecting the ML classifier for the task at hand, whether supervised or unsupervised, and its complexity, such as the number of learned parameters. Examples of machine learning algorithms include support vector machines (SVM), neural networks, least-squares methods, K- nearest neighbors (KNN), and reinforcement learning. The solver is the iterative method that provides a performance index during the learning process, and includes stochastic gradient descent (SGD) and its variants, such as AdaGrad and Adam, and second-order methods such as Newton’s method, and transformative approaches such as Fast Fourier transform (FFT). Search optimization is concerned with the identification of the maximum or minimum of values. Techniques to 12 perform search include random, grid, Bayesian, Monte Carlo tree search, and heuristics-based search. Once the weights have been trained for the model, compilation attempts to optimize the code for execution performance. Code transformations that exploit data locality include loop transformations, vectorization and SIMD approaches, source-to-source translation, reduced precision, and kernel fusion. Single-node optimization targets features available at the computer architecture level, and includes memory hierarchy, intrinsics, and stochastic rounding. Distributed optimization makes use of multiple clusters and multiple accelerators for machine learning, accounting for compute availability, scheduling, checkpointing and problem partitioning. Collectively, this illustrates the complexity and tradeoffs of the landscape when accounting for all factors in optimizing the multiple objectives in machine learning and high-performance computing. Stochastic Gradient Descent. Stochastic gradient descent is an iterative method that minimizes an objective function F by estimating parameter w for a Fi(w), for the i-th observation Stochastic Gradient Descent (2021). For training neural networks, the weights are learned with each batch of data and updated iteratively. Refer to Appendix A for the derivation of stochastic gradient descent. Algorithm 1 lists the stochastic gradient method, which performs the following steps. A random variable Ek is generated via a Taylor expansion series, with {Ek} representing a sequence of jointly independent random variables. Given an iterate w ∈ Rdk and the realization of Ek, a stochastic vector g(wk, Ek) ∈ Rd is computed. Then, given an iteration number k ∈ N, a scalar stepsize αk > 0 is 13 Algorithm 1 Stochastic gradient method. 1: Choose an initial iterate w1 2: for k = 1, 2, ... do 3: Generate a realization of the random variable Ek 4: Compute a stochastic vector g(wk, Ek) 5: Choose a step size αk > 0 6: Set the new iterate as wk+1 ← wk = αkg(wk, Ek) computed. The stochastic gradient estimate∑for g with S samples is defined as ∇ E 1fS (wk; k) = ∇f(w ; E ). (2.2)k |Sk| k k,i i∈Sk Conclusion This section covered the background information needed to be discussed for the dissertation. Next, we discuss core areas of the dissertation. The core areas of the dissertation include optimizing code generation, control flow subgraph matching, optimizing hyper-parameters, and numerical representation. 14 CHAPTER III OPTIMIZING CODE GENERATION This chapter includes previously published co-authored material that was published at the 46th International Conference on Parallel Processing Lim, Norris, and Malony (2017). I was the primary contributor to this work in developing the algorithm, writing the new code, and writing the paper. Dr. Boyana Norris initially identified the need for this work and provided the application that this work was performed in. Dr. Allen Malony assisted in editing the paper. Abstract Optimizing the performance of GPU kernels is challenging for both human programmers and code generators. For example, CUDA programmers must set thread and block parameters for a kernel, but might not have the intuition or experience to make a good choice. Similarly, compilers can generate working code, but may miss tuning opportunities by not targeting GPU models or performing code transformations. Although empirical autotuning addresses some of these challenges, it requires extensive experimentation and search for optimal code variants. This research presents an approach for tuning CUDA kernels based on static analysis that considers fine-grained code structure and the specific GPU architectural features. Notably, unlike most autotuning systems, our approach does not require any program runs in order to discover near-optimal parameter settings. We demonstrate the applicability of our approach in enabling code autotuners such as Orio to produce competitive code variants comparable with empirical-based methods, without the high cost of experiments. 15 Motivation Heterogeneous computing poses several challenges to the application developer. Identifying which parts of an application are parallelizable on a SIMD accelerator and writing efficient data parallel code are the most difficult tasks. For instance, CUDA programmers must set block and thread sizes for application kernels, but might not have the intuition to make a good choice. With NVIDIA GPUs, each streaming multiprocessor (SM) has a finite number of registers, limited shared memory, a maximum number of allowed active blocks, and a maximum number of allowed active threads. Variation in block and thread sizes results in different utilization of these hardware resources. A small block size may not provide enough warps for the scheduler for full GPU utilization, whereas a large block size may lead to more threads competing for registers and shared memory. Writing kernel functions require setting block and thread parameters, and the difficulty is in deciding which settings will yield the best performance. One procedure entails testing the kernel with block sizes suggested by the CUDA Occupancy Calculator (OCC) CUDA Occupancy Calculator (2016). Although the OCC takes into account the compute capability (NVIDIA virtual architecture) when calculating block sizes and thread counts, inaccuracies may arise because variations in runtime behavior may not be considered, which can potentially result in suboptimal suggested hardware parameters. How do variations in runtime behavior arise? Accelerator architectures specialize in executing SIMD in lock-step. When branches occur, threads that do not satisfy branch conditions are masked out. If the kernel programmer is unaware of the code structure or the hardware underneath, it will be difficult for them to make an effective decision about thread and block parameters. 16 CUDA developers face two main challenges, which we aim to alleviate with the approach described in this paper. First, developers must correctly select runtime parameters as discussed above. A developer or user may not have the expertise to decide on parameter settings that will deliver high performance. In this case, one can seek guidance from an optimization advisor. The advisor could consult a performance model based on static analysis of the kernel properties, or possibly use dynamic analysis to investigate alternative configurations. A second concern is whether the kernel implementation is not optimized yet. In this case, advice on parameter settings could still be insufficient because what is really required is a transformation of the kernel code itself to improve performance. For both concerns, static and dynamic analysis techniques are applicable. However, to address the second concern, an autotuning framework based on code transformation is required. This research presents our static analyzer that can be used by developers, autotuners, and compilers for heterogeneous computing applications. Unlike most existing analysis techniques, our approach does not require any program runs to discover optimal parameter settings. The specific contributions described in this paper include the following. – A static analyzer for CUDA programs. – Predictive modeling based on static data. – Example use cases of the new methodology in an autotuning context. 17 Figure 4. Branch divergence problem and performance loss incurred. Background This section briefly discusses the background for our research contributions, including the CUDA programming model, performance measurement approaches, and autotuning. CUDA Programming Model and Control Flow Divergence. In CUDA kernels, threads are organized in groups called blocks, which consist of one or more warps (each of which has 32 threads). Each block is assigned to one of the GPU’s streaming multiprocessors, and each SM is composed of multiple streaming processors, or multiprocessors (MP) that execute individual threads in SIMD. In a given execution cycle, a SM executes instructions from one of the thread block’s warps, where threads within a warp are executed together. However, if threads within a warp take different paths on conditional branches, execution of those paths become serialized. In the worst case, only 1 of the 32 threads within a warp will make progress in a cycle. Figure 4 shows how performance is affected when branches diverge. Measuring the occupancy of a kernel execution can determine whether branch divergence exists and suggest parameter adjustments to the program, a subject of this current work. 18 GPU Performance Tools. To date, GPU performance tools have mainly focused on the measurement and analysis of kernel execution, reporting time and counters associated with kernel execution. For instance, the TAU Performance System provides scalable, profile and trace measurement and analysis for high-performance parallel applications Shende and Malony (2006), including support for CUDA and OpenCL codes Malony et al. (2011). Even though profile measurements can help answer certain types of questions (e.g., how long did foo() take?), improving performance requires more detailed information about the program structure. While TAU and other profiling tools provide performance measurement Adhianto et al. (2010); ddt (2016); nvprof (2016), they do not shed much light on the divergent branch behavior and its effects on making good decisions about thread and block sizes. Our work introduces several static analysis techniques that deliver fine-grained information that can be used for predictive modeling. These techniques include the ability to analyze instruction mixes and occupancy for estimating thread and register settings. In a complementary approach (not discussed in this paper), we have also developed dynamic analysis techniques to compute instruction execution frequencies and control flow information Lim, Norris, and Malony (2016). In the remainder of this section, we discuss how we model different performance-relevant metrics by using primarily static analysis of CUDA binaries. Autotuning. By themselves, performance models can produce adequate predictions of parameter settings, but can not change the kernel to improve performance. Autotuning systems have been important in exploring alternative parameter choices by providing a kernel experimentation and optimization 19 Static analysis IM : Instruction Mix static-based OC : Occupancy .ptx . . . performance CF : Control Flow .log IM OC models . . . : new analysis CF # threads block size code C, C++ NVIDIA Fortran nvcc registers variants predicted per thread performance parameter suggestions BF .exe IC MD dynamic-based IC : Instruction Count . . . performance BF : Branch Frequency models MR : Memory Distance Dynamic analysis . . . : new analysis Figure 5. Optimization framework for GPU kernels incorporating static and dynamic analysis, with autotuning and code transformation. framework. For example, the open-source Orio autotuning framework Hartono, Norris, and Sadayappan (2009) generates many code variants for each kernel computation. The objective of the GPU portions of Orio is to accelerate loops Chaimov, Norris, and Malony (2014); Mametjanov, Lowell, C.C. Ma, and Norris (2012) since loops consume a large portion of program execution time. We use the term kernels to refer to deeply nested loops that arise frequently in a number of scientific application codes. Existing C loops are annotated with transformation and tuning specifications. Transformations are parameterized with respect to various performance constraints, such as block sizes, thread counts, preferred L1 sizes and loop unroll factors. Each kernel specification generates a family of variant translations for each parameter and each variant is measured for its overall execution time, with the fastest chosen as the top performing autotuned translation. The main challenge in the optimization space search is the costly empirical measurement of many code variants in autotuning systems. The main contribution 20 of our work is to demonstrate the use of static predictive models in autotuning, reducing the need for experimental performance testing. Methodology Figure 5 is a high-level depiction of our framework, which illustrates not only the different processes involved, but also the analysis support and tradeoffs inherent in them. For instance, providing a user with runtime parameters for kernel launch could engage static and/or dynamic analysis, but not necessarily code transformation. Dynamic analysis would be expected to be more costly because experiments would be involved. Transforming the implementation allows new variants to be explored, but these could be analyzed either statically or dynamically, or both. However, it is in the integration of these models with an autotuning system that can transform the kernel code where the greatest power for delivering optimizations is found. Static Analysis Our static analysis approach consists of the following steps: 1. Extract kernel compilation information with nvcc’s --ptxas-options=-v flag. 2. Disassemble CUDA binary with nvdisasm for instruction operations executed. The subsequent sections define metrics resulting from our static analysis approach, including occupancy and instruction mixes. These metrics are then used to significantly reduce or even eliminate the empirical tests in autotuning several kernels. 21 Occupancy. Threads, registers and shared memory are factors that influence a CUDA kernel’s ability to achieve high occupancy. In this section, we will group threads, warps, and blocks into one category for simplifying the discussion, although each term has its own distinct meaning. Threads (T ) are the work units performing the computation, whereas warps (W ) are the schedulable units for the streaming multiprocessor and blocks (B) consist of groups of warps. Each has memory local to its level. For instance, threads access private registers (R), warps and blocks use shared memory (S ), and grids utilize global memory. The following subsections define factors that contribute to a kernel’s GPU occupancy. Table 16 lists the GPUs used in this research, along with hardware features and associated notation. We adopt the naming convention where superscripts denote the source of the variable, with subscripts as constraints of the variable. Compute capability (cc) represents the GPU architecture family (also listed in Tab. 16), meaning nvcc will target an architecture based on the assigned compute capability flag (e.g. -arch=sm xx). User input (u) includes threads, registers and shared memory parameters at compile time. Active (∗) represents the results provided by our static analyzer tool. Occupancy is the metric we are calculating and is defined in the next subsections. Occupancy Calculation. The objective of the occupancy calculation is to minimize the number of active thread blocks per multiprocessor constrained by hardware resource ψ: B∗mp = min {Gψ(u)} , (3.1) 22 where G(·) calculates the maximum allocable blocks for each SM, and ψ = {ψW , ψR, ψS} denotes warps, registers, and shared memory. Each Gψ will be defined in Eqs. 3.3, 3.4, and 3.5. Definition of Occupancy. Occupancy is defined as the ratio of active warps on a SM to the maximum number of active warps supported for each SM: W ∗mp occmp = (3.2) W ccmp where W ∗ = B∗mp mp ×WB, with B∗mp as defined in Eq. 3.1 and WB = 32 for all GPUs (Tab. 16). Note that in an ideal world, occmp = 1. However, in practice, occupancy rates are on average at 65-75%, and should not be used in isolation for setting CUDA parameters Volkov (2010). Occupancy is one of several metrics we incorporated in our static analyzer. Theoretical Occupancy. The number of blocks which can execute concurrently on an SM is limited by either warps, registers, or shared memory. Warps per SM The SM has a maximum number of warps that can be active at once. To calculate the maximum number of blocks constrained by warps Gψ ,W find the minimum of blocks supported per multiprocessor and the rate of warps per SM and warps per block: { ⌊ ⌋} G Wsmψ (T u) =⌈min cc W ⌉Bmp , (3.3)WB T u where W = W ccsm mp and WB = , with variables as defined in Table 16.T ccW 23 Registers per SM The SM has a set of registers shared by all active threads. Deciding whether registers is limiting occupancy Gψ is described by the followingR cases:  ⌈0 ⌉ ⌈ if Ru > RccW , G u  Rcc ⌉ψ (R ) =  Rsm × fs if Ru > 0, (3.4)R cc ⌊  RB RB B⌋ccmp ⌈ oth⌉erwise. Rcc T u where R Bsm = and RB = . Case 1 represents whendRu × T ccW e T ccW the user declares a register value beyond the maximum allowable per thread that is supported for the cc, an illegal operation. Case 2 describes when the user provides a valid register value, where we take the product of the number of registers per SM supported over the number of registers per block and the register file size per MP over the maximum register block supported in this architecture. Case 3 is when the user does not provide a value, where the value is set to the thread block per multiprocessor supported by the cc. Shared memory per SM Shared memory per thread is defined as the sum of static shared memory, the total size needed for all shared variables and dynamic shared memory. If active blocks are constrained by shared memory, reducing S per T could increase occupancy. To compute Gψ , take the ceiling ofS the shared memory per multiprocessor provided by its compute capability over the shared memory per block. 24  0 if Su > SccB , G uψ (S ) =S  ⌈ ⌉ Scc  mp if S u > 0, (3.5)  SBBccmp otherwise. where shared memory per block SB = bSuc, shared memory per SM Ssm = SccB , and with cases following similarly to Eq. 3.4. Instruction Mix Metrics. Instruction mix is defined as the number of specific operations that a processor executes. Instruction mix-based characterizations have been used in a variety of contexts, including to select loop unrolling factors Monsifrot, Bodin, and Quiniou (2002); Stephenson and Amarasinghe (2005), unlike hardware counters which are prone to miscounting events Lim, Carrillo-Cisneros, Alkowaileet, and Scherson (2014). In this work, we use instruction mixes to characterize whether a kernel is memory-bound, compute- bound, or relatively balanced. Refer to Lim, Malony, Norris, and Chaimov (2015) for definitions for Ofl,Omem,Octrl, and Oreg according to category type. The intensity (magnitude) of a particular metric can suggest optimal block and thread sizes for a kernel. Memory-intensive kernels require a high number of registers, where a large block size consists of more registers per block. The tradeoff with big block sizes is that fewer blocks can be scheduled on the SM. Small block sizes will constrain the number of blocks running on the SM by the physical limit of blocks allowed per SM. Compute-intensive kernels perform well with larger block sizes because the threads will be using GPU cores with fewer memory latencies. Small block sizes will result in many active blocks running on the SM in a time- shared manner, where unnecessary switching of blocks may degrade performance. For control-related synchronization barriers, smaller block sizes are preferred 25 Table 3. Instruction throughput per number of cycles. Category Op SM20 SM35 SM52 SM60 FPIns32 FLOPS 32 192 128 64 FPIns64 FLOPS 16 64 4 32 CompMinMax FLOPS 32 160 64 32 Shift, Extract, Shuffle, FLOPS 16 32 64 32 SumAbsDiff Conv64 FLOPS 16 8 4 16 Conv32 FLOPS 16 128 32 16 LogSinCos FLOPS 4 32 32 16 IntAdd32 FLOPS 32 160 64 32 TexIns, LdStIns, MEM 16 32 64 16 SurfIns PredIns, CtrlIns CTRL 16 32 64 16 MoveIns CTRL 32 32 32 32 Regs REG 16 32 32 16 because many active blocks can run simultaneously on the SM to hide memory latency. Pipeline Utilization. Each streaming multiprocessor (SM) consists of numerous hardware units that are specialized in performing a specific task. At the chip level, those units provide execution pipelines to which the warp schedulers dispatch instructions. For example, texture units provide the ability to execute texture fetches and perform texture filtering, whereas load/store units fetch and save data to memory. Understanding the utilization of pipelines and its relation to peak performance on target devices helps identify performance bottlenecks in terms of oversubscription of pipelines based on instruction type. The NVIDIA Kepler GK100 report objdump (2012) lists instruction operations and corresponding pipeline throughputs per cycle. Pipeline utilization describes observed utilization for each pipeline at runtime. High pipeline utilization 26 /*@ begin PerfTuning ( def performance_params { param TC[] = range(32,1025,32); param BC[] = range(24,193,24); param UIF[] = range(1,6); param PL[] = [16,48]; param SC[] = range(1,6); param CFLAGS[] = [’’, ’-use_fast_math’]; } ... ) @*/ Figure 6. Performance autotuning specification. would indicate that the corresponding compute resources were used heavily and kept busy often during the execution of the kernel. Infer Kernel Execution Time. Because the majority of CUDA applications are accelerated loops, we hypothesize that the execution time of a CUDA program is proportional to the input problem size N . Hence, f (N ) = cf ·Ofl + cm ·Omem + cb ·Octrl + cr ·Oreg (3.6) where cf , cm, cb, and cr are coefficients that represent the reciprocal of number of instructions that can execute in a cycle, or CPI. Equation 3.6 represents how a program will perform for input size N without running the application. Table 4. A subset of features used for thread block classification. Feature Size Thread Count 32 – 1024 (with 32 increments) Block size 1 24 – 192 (with 16 increments) Unroll loop factor {1 – 6} Compiler flags {‘’, ‘-use fast math’} Instructions {FLOPS, memory, control} Occupancy calculation {registers, threads, OCC rate, etc.} 1Block sizes are compute capability specific. 27 ORIO Code Generation. OCC Results This section reports on the autotuning execution environment for the CUDA kernels listed in Table 5. Results comparing our static analyzer approach with the existing methods are also reported. Environment. The open-source Orio autotuner was used to generate and autotune CUDA implementations by varying the feature space listed in Table 4. The details of CUDA code generation and autotuning with Orio are described in Mametjanov et al. (2012). The TC parameter specifies the number of simultaneously executing threads. BC is the number of thread blocks (independently executing groups of threads that can be scheduled in any order across any number of SMs) and is hardware-specific. UIF specifies how many times loops should be unrolled; PL is the L1 cache size preference in KB. For each code variant, the experiment was repeated ten times, and the fifth overall trial time was selected to be displayed. The execution times were sorted in ascending order and the ranks were split along the 50th percentile. Rank 1 represents the upper-half of the 50th percentile (good performers), while Rank 2 represents the lower portion (poor performers). On average, the combination of parameter settings generated 5,120 code variants. The GPUs used in this work are listed in Table 16 and include the Fermi M2050, Kepler K20, Maxwell M40, and Pascal P100. Subsequently we will refer to the GPUs by the architecture family name (Fermi, Kepler Maxwell, Pascal). CUDA nvcc v7.0.28 was used as the compiler. Each of the benchmarks executed with five different input sizes, where all benchmarks consisted of inputs {32, 64, 128, 256, 512}, except ex14FJ, which had inputs {8, 16, 32, 64, 128}. 28 Thread Counts arch = k20 arch = m2050 arch = m40 arch = p100 200 100 0 200 100 0 rank 200 1 2 100 0 200 100 0 0 00 00 00 00 00 0 00 00 00 00 00 0 00 00 00 00 0 0 0 0 0 02 4 6 8 0 2 4 6 8 0 2 4 6 8 00 20 40 60 80 00 0 1 1 1 1 Threads Threads Threads Threads Figure 7. Thread counts for Orio autotuning exhaustive search, comparing architectures and kernels. To demonstrate our approach, we considered the kernels described in Table 5. Because the chosen kernels (except ex14FJ, which is application-specific) contribute significantly to the overall execution time of many different applications, tuning these kernels can result in significant overall application performance improvements. Table 5. Kernel specifications. Kernel Category Description Operations atax Elementary Matrix transpose, y = AT (Ax) linear algebra vector multiplication q = Ap, BiCG Linear solvers Matrix transpose, s = AT r Subkernel of BiCGStab linear solver F (x) = A(x)x− b = 0, ex14FJ 3-D Jacobi Stencil code kernels A(u)v ' −5 (κ(u)Ov) computation MatVec2D Elementary Matrix vector y = Ax linear algebra multiplication Discussion We empirically autotuned the kernels listed in Table 5 using exhaustive search and uncovered distinct ranges for block and thread sizes, based on ranking. The dynamic analysis of autotuning is displayed in Figure 7, projecting thread settings and frequency for each kernel, and comparing various architectures. In 29 Count Count Count Count kkeerrnneell  ==  AATTAAXX kkeerrnneell  ==  BBiiCCGG kkeerrnneell  ==  eexx1144FFJJ kkeerrnneell  ==  mmaattVVeecc22DD Execution Time from Static Instruction Mixes 0.025 0.000 0.025 0.2 0.1 arch 0.0 F K M P 0.025 0.000 0.025 1.025 1.000 0.975 Order Figure 8. Time-to-instruction mix ratio, comparing architectures and kernels. Table 6. Statistics for autotuned kernels for top performers (top half) and poor performers (bottom half), comparing GPU architecture generations. Occupancy Register Instructions Threads Mean Std Dev Mode Mean Std Dev Allocated 25th 50th 75th Fer 77.46 24.18 100.00 39613.1 35673.2 21 152 272 416 ATAX Kep 85.21 19.03 93.75 34833.1 30954.5 27 160 288 416 Max 90.59 11.87 93.75 104285.9 85207.1 30 160 320 448 Pas 90.86 12.24 93.75 227899.7 202120.2 30 152 272 392 Fer 60.55 15.54 75.00 35321.3 32136.6 27 160 288 416 BiCG Kep 85.14 19.05 98.44 35485.7 31535.9 28 160 288 416 Max 89.09 11.50 98.44 158963.8 135681.2 32 224 448 736 Pas 90.93 12.19 93.75 228350.6 201865.8 30 152 272 392 Fer 53.69 8.83 62.50 98418.5 45166.64 30 608 768 896 ex14FJ Kep 88.44 9.98 93.75 54345.4 47526.8 31 288 512 768 Max 89.23 9.61 98.44 4141130.6 158537.4 28 320 608 832 Pas 89.04 11.10 98.44 4335986.6 409162.6 32 192 480 768 Fer 72.21 14.17 87.50 307425.50 69330.06 23 448 640 864 matVec2D Kep 89.29 8.17 96.88 274359.93 65373.88 23 416 640 864 Max 89.53 9.22 98.43 693752.81 146799.80 18 288 576 800 Pas 88.42 9.08 90.63 1264815.81 316252.38 18 480 672 864 Fer 74.23 15.98 100.00 102946.9 58009.0 21 640 768 896 ATAX Kep 86.27 10.97 93.75 89906.9 51102.5 27 640 768 896 Max 87.04 10.09 87.50 253714.1 151973.5 30 608 736 896 Pas 86.77 9.54 87.50 605300.3 337615.5 30 640 768 896 Fer 56.12 10.73 66.67 35321.3 32136.6 27 608 768 896 BiCG Kep 86.34 10.93 93.75 89254.3 51141.2 28 608 768 896 Max 88.55 10.80 93.75 199036.3 149373.1 32 352 608 832 Pas 86.70 9.57 87.5 605169.4 338092.4 30 640 768 896 Fer 55.55 14.03 62.50 26321.5 21137.2 30 152 288 448 ex14FJ Kep 83.05 19.21 93.75 70394.6 51953.5 31 256 544 800 Max 88.40 12.50 93.75 4079589.4 120401.0 28 224 480 704 Pas 88.59 11.22 93.75 4359934.4 241618.2 32 352 544 800 Fer 68.93 21.93 87.50 210334.50 47850.90 23 160 352 672 matVec2D Kep 82.19 19.54 93.75 219636.56 57185.33 23 160 384 704 Max 88.09 12.77 93.75 645687.18 137182.93 18 224 480 736 Pas 89.22 12.89 93.75 877505.0 225900.05 18 160 320 576 30 MAE MAE MAE MAE kernel = ATAX kernel = matVec2D kernel = BiCG kernel = ex14FJ Table 7. Error rates when estimating dynamic instruction mixes from static mixes. Metrics FLOPS MEM CTRL Itns Fer 0.07 1.69 2.01 3.4 ATAX Kep 0.11 1.75 2.20 3.4 Max 0.23 0.06 0.12 1.8 Fer 0.03 3.68 2.40 1.8 BiCG Kep 0.02 3.80 2.67 1.8 Max 0.57 1.30 0.06 1.3 Fer 0.20 0.14 0.00 12.7 ex14FJ Kep 1.01 0.18 0.21 12.7 Max 1.97 0.14 0.89 16.3 Fer 0.04 0.92 0.80 4.6 matVec2D Kep 0.07 0.97 0.99 4.6 Max 0.29 0.06 0.36 7.2 general, ATAX and BiCG kernels performed well in lower thread range settings, whereas matVec2D performed better with higher thread settings. The ex14FJ is a more complex kernel2, and thread behavior patterns for Rank 1 were less apparent. Table 6 reports statistics on occupancy, registers, and threads for all benchmarks and architectures. The top half represents good performers (Rank 1), whereas the bottom half represents poor performers (Rank 2). In general, occupancy did not seem to matter much, since the reported means were somewhat similar for both ranks, with Fermi achieving low occ for all kernels. However, register instructions varied considerably, with Rank 1 consisting of lower mean and standard deviations, versus Rank 2 which had higher values. Thread behavior patterns were apparent when comparing Rank 1 and Rank 2. For instance, one could conclude that ATAX and BiCG prefers smaller range thread sizes, whereas ex14FJ prefers higher ranges. 2The ex14FJ kernel is the Jacobian computation for a solid fuel ignition simulation in 3D rectangular domain. 31 Figure 8 illustrates the use of static instruction mixes to predict execution time. Execution time was normalized and sorted in ascending order (x-axis). The mean absolute error was used to estimate execution time based on static instruction mixes. Equation 3.6 was used to calculate the instruction mix ratio, which consisted of weighting instructions according to its number of achievable executed instructions per clock cycle. In general, our model was able to estimate the execution time within a reasonable margin of error, including ex14FJ with MAE near 1.00, which validates instruction mixes as good indicators of kernel execution performance. Table 7 reports the error rates calculated, using sum of squares, when estimating dynamic behavior of the kernel from static analysis of the instruction mix. Intensity is also displayed in the last column and is defined as the ratio of floating-point operations to memory operations. Although our static estimator performed poorly for BiCG (memory, control ops), our static analysis, driven by Equation 3.6, closely matches that of the observed dynamic behavior for the other kernels. ATAX BiCG ex14FJ matVec2D 0.8 m2050 0.6 k20 0.4 m40 p100 0.2 0.0 Static RB Static RB Static RB Static RB Figure 9. Improved search time over exhaustive autotuning, comparing static and rule-based approaches. Improved Autotuning with Static Analyzer. Finally, we wanted to determine whether our static analyzer tool could be used to improve the efficiency and effectiveness of Orio. We use the exhaustive empirical autotuning results from 32 Improvement (%) Table 8. Suggested parameters to achieve theoretical occupancy. T ∗ [Ru : R∗] S∗ occ∗ Fer 192, 256, 384, 512, 768 [21 : 0] 6144 1 Kep 128, 256, 512, 1024 [27 : 5] 3072 1 AT Max 64, 128, 256, 512, 1024 [30 : 2] 1536 1 Pas 64, 128, 256, 512, 1024 [30 : 2] 1536 1 Fer 192, 256, 384, 512, 768 [27 : 0] 8192 .75 Kep 128, 256, 512, 1024 [28 : 4] 3072 1 Bi Max 64, 128, 256, 512, 1024 [32 : 0] 12288 .71 Pas 64, 128, 256, 512, 1024 [30 : 2] 1536 1 Fer 192, 256, 384, 512, 768 [30 : 0] 24576 .71 Kep 128, 256, 512, 1024 [31 : 1] 3072 1 ex Max 64, 128, 256, 512, 1024 [28 : 4] 1536 1 Pas 64, 128, 256, 512, 1024 [32 : 0] 1536 1 Fer 192, 256, 384, 512, 768 [20 : 1] 12288 .92 Kep 128, 256, 512, 1024 [20 : 11] 3072 1 ma Max 64, 128, 256, 512, 1024 [13 : 18] 1536 1 Pas 64, 128, 256, 512, 1024 [15 : 17] 1536 1 Sec. III as the baseline for validating whether our search approach could find the optimal solution. Table 8 reports static information for register usage and intensity for each kernel, as well as the thread parameters suggested by our static analyzer, comparing different architectures. T ∗ displays the suggested thread ranges for the kernel that would yield occ∗. [Ru : R∗] displays the number of registers used and its increase potential. S∗ displays (in KB) the amount of shared memory that could be increased to achieve theoretical occupancy. The basis of our contribution is that the instruction mix and occupancy metrics from our static analyzer ge∏ts fed into the autotuner. In general, an exhaustive autotuning consists of mi=1 Xi trials, where Xi represents a parameter, each having m options. In the case of ATAX, five thread settings were suggested for Fermi and Maxwell, which represents a 84% improvement, and Kepler 33 representing a 87.5% improvement, with the search space reduced from 5,120 to 640. The search space could be reduced further by invoking our rule-based heuristic. Figure 9 displays the overall results of the improved search module. The first set displays how the static based method improves near 87.5%. When combining with the rule-based heuristic, the search space is further reduced, which results in a 93.8% overall improvement. Figure 10 displays the occupancy calculator for the ATAX kernel, comparing the current kernel and the potentially optimized version. The model-based search space reduction does involve generating and compiling the code versions, but it does not require executing them. Note that empirical testing typically involves multiple repeated executions of the same code version, hence the time saved over exhaustive search is approximately t ∗ r, where t is the average trial time and r is the number of repetitions. Even when not using exhaustive search, our new technique can be used as the first stage of the regular empirical-based autotuning process to dramatically reduce the search space, significantly speeding up the entire process and increasing the likelihood of finding a global optimum. Unlike runtime measurement, which requires several runs of each test, static analysis does not suffer from the effects of noise and hence only has to be performed once on each code version. The search space reduced through static binary analysis can then be explored using one of the existing search methods. If it’s feasible and desirable to determine the optimal value, then exhaustive search is appropriate, otherwise one of the other methods such as Nelder-Mead simplex or random can be used to strictly control the time spent autotuning. 34 Related Work Several prior efforts have attempted to discover optimal code forms and runtime parameter settings for accelerator-based programming models, typically by taking a domain-specific approach. For instance, Nukada and Matsuoka demonstrated automated tuning for a CUDA-based 3-D FFT library based on selection of optimal number of threads Nukada and Matsuoka (2015). Tomov et al. developed the MAGMA system for dense linear algebra solvers for GPU architectures, which incorporates a DAG representation and empirical-based search process for modeling and optimization Tomov, Nath, Ltaief, and Dongarra (2010). The use of autotuning systems based on program transformations, such as Orio Hartono et al. (2009) and CHiLL CHiLL: A Framework for Composing High-Level Loop Transformations (2008), enable optimization exploration on more general application code and across accelerator architectures Chaimov et al. (2014). However, the complexity of the optimization space and the cost of empirical search is high. A recent work on autotuning GPU kernels focuses on loop scheduling and is based on the OpenUH compiler Xu, Chandrasekaran, Tian, and Chapman (2016). Our approach attempts to leverage more static code analysis to help better inform an autotuning process, thereby reducing the dependence on pure dynamic measurement and analysis to generate performance guidance. The NVIDIA CUDA Toolkit NVIDIA (n.d.) includes occupancy calculation functions in the runtime API that returns occupancy estimates for a given kernel. In addition, there are occupancy-based launch configuration functions that can advise on grid and block sizes that are expected to achieve the estimated maximum potential occupancy for the kernel. Because these functions take as input intended per-block dynamic shared memory usage and maximum block size (in 35 Figure 10. Occupancy calculator displaying thread, register and shared memory impact for current (top) and potential (bottom) thread optimizations for the purposes of increasing occupancy. addition to knowing user-defined registers per thread), it is possible to retrieve a set of configuration choices. It is important to note that the CUDA Occupancy Calculator/API takes into account the GPU architecture being used. Thus, we can integrate the estimates it generates over the full range of options (e.g., letting registers per thread to be variable) with the other static models. A project closely related to ours is STATuner R. Gupta et al. (2015), which identifies a feature set of static metrics that characterize a CUDA kernel code and uses machine learning to build a classifier model trained on a CUDA benchmark suite. Kernel codes are compiled in LLVM and static analysis of the generated binary code and IR provide metrics for instruction mix, loops, register usage, shared memory per block, and thread synchronization. The classifier model inputs these metric features for a new kernel to predict which block size would give the best performance. STATuner is shown to give smaller average error compared to NVIDIA’s CUDA Occupancy Calculator/API. Only a single block size is predicted by STATuner, whereas the Occupancy Calculator/API offers block size choices 36 given user input about registers per thread and per-block shared memory. Our approach differs in several respects. First, static analysis is done on the PTX code generated by the NVIDIA nvcc compiler, rather that on the upper level source code (as seen in LLVM). While there are some benefits in incorporating higher-level code information, nvcc produces different PTX code for different GPU architectures, allowing hardware-specific code effects to be seen. Furthermore, our static analysis extracts metrics similar to STATuner, but also builds a CFG to help understand flow divergence Lim et al. (2016). Second, our prediction models are based on estimating performance given the instruction mix, control flow, and problem size. They are not based on learned classifiers. Third, the objective of our work is to integrate predictive models in an autotuning framework, beyond just giving a single block size result to the user. Milepost GCC Fursin (2011) is a publicly-available open-source machine learning-based compiler for C (but not CUDA) that extracts program features and exchanges optimization data with the cTuning.org open public repository. It automatically adapts the internal optimization heuristic at function-level granularity to improve execution time, code size and compilation time of a new program on a given architecture. The Oxbow toolkit Sreepathi et al. (2014) is a collection of tools to empirically characterize (primarily CPU) application behaviors, including computation, communication, memory capacity and access patterns. The eventual goal is to build a repository that users can upload and access their datasets, and can provide analysis, plots, suggested parameters, etc. 37 Discussion Getting the most performance out of applications is important for code generators and end users, but the process in making the best settings is often convoluted (for humans) and time-consuming (for empirical autotuners). With our static analyzer tool, we show its accuracy in estimating the runtime behavior of a kernel without the high costs of running experiments. Using our tool, we’ve identified the computational intensity of a kernel, constructed a control flow graph, estimated the occupancy of the multiprocessors, and suggested optimizations in terms of threads and register usage. Finally, we’ve shown how the integration of our static analyzer in the Orio autotuning framework improved the performance in narrowing the search space for exploring parameter settings. The field of heterogeneous accelerated computing is rapidly changing, and we expect several disruptions to take place with the introduction of 3D-memory subsystems, point-to-point communication, and more registers per computational cores. Traditional approaches to measuring performance may no longer be sufficient to understand the behavior of the underlying system. Our static analyzer approach can facilitate optimizations in a variety of contexts through the automatic discovery of parameter settings that improve performance. Future Work The optimization spectrum is a continuum from purely static-based methods to ones that incorporate empirical search across an optimization landscape. In general, the objective of our work is on exploring the tradeoffs involving optimization accuracy and cost over this spectrum, with a specific focus on how well purely static methods perform as a guide for autotuning. While static analysis side-steps the need for empirical testing, it is not to say that static models can not 38 be informed by prior benchmarking and knowledge discovery. We will investigate several avenues for enhancing our static models, including algorithm-specific optimizations and machine learning for code classification. Furthermore, we regard the methodology we have developed as a knowledge discovery framework where the degree of empirical testing can be “dialed in” during the autotuning process, depending on what the user accepts. By recording the decisions and code variants at each step, it is also possible to replay tuning with empirical testing for purpose of validation. In this way, the framework can continually evaluate the static models and refine their predictive power. We will further develop this capability. While our static analysis tools will working with any CUDA kernel code, the real power of our approach is in the ability to transform the code in Orio. However, this requires the source to be in a particular input form. We are exploring source analysis technology de Oliveira Castro, Akel, Petit, Popov, and Jalby (2015) to translate kernel code to the input required by Orio, thereby allowing any kernel to be a candidate for CUDA autotuning. Conclusion This chapter defined the metrics necessary for optimizing the performance of GPU kernels. Specifically, threads, registers and shared memory, as well as architectural factors were included in the metrics definition. This research revealed that certain computation patterns, whether memory, compute or control bound, have an influence on the parameter settings of a CUDA application. A static model was proposed, based on the instruction mixes, that was able to predict the performance of an execution kernel with a mean absolute error near 1.00. The next 39 chapter builds on these approaches and defines a similarity measure for matching control flow graphs. 40 CHAPTER IV CONTROL FLOW SUBGRAPH MATCHING This chapter includes previously published co-authored material from a NVIDIA GPU Technology Conference poster Lim et al. (2016) and a workshop paper at the 31st International Workshop on Languages and Compilers for Parallel Computing Lim, Norris, and Malony (2019). I was the primary contributor to this work in developing the algorithm, writing the new code, and writing the paper. Dr. Boyana Norris initially identified the need for this work and provided the application that this work was performed in. Dr. Allen Malony assisted in editing the paper. Abstract Accelerator architectures specialize in executing SIMD (single instruction, multiple data) in lockstep. Because the majority of CUDA applications are parallelized loops, control flow information can provide an in-depth characterization of a kernel. CUDAflow is a tool that statically separates CUDA binaries into basic block regions and dynamically measures instruction and basic block frequencies. CUDAflow captures this information in a control flow graph (CFG) and performs subgraph matching across various kernel’s CFGs to gain insights into an application’s resource requirements, based on the shape and traversal of the graph, instruction operations executed and registers allocated, among other information. The utility of CUDAflow is demonstrated with SHOC and Rodinia application case studies on a variety of GPU architectures, revealing novel control flow characteristics that facilitate end users, autotuners, and compilers in generating high performing code. 41 Motivation Structured programming consists of base constructs that represent how programs are written Böhm and Jacopini (1966); Williams and Ossher (1978). When optimizing programs, compilers typically operate on the intermediate representation (IR) of a control flow graph (CFG), which is derived from program source code analysis and represents basic blocks of instructions (nodes) and control flow paths (edges) in the graph. Thus, the overall program structure is captured in the CFG and the IR abstracts machine-specific intrinsics that the compiler ultimately translates to machine code. The IR/CFG allows the compiler to reason more efficiently about optimization opportunities and apply transformations. In particular, compilers can benefit from prior knowledge of optimizations that may be effective for specific CFG structures. In the case of accelerated architectures that are programmed for SIMD parallelism, control divergence encountered by threads of execution presents a major challenge for applications because it can severely reduce SIMD computational efficiency. It stands to reason that by identifying the structural patterns of a CFG from an accelerator (SIMD) program, insight on the branch divergence problem Sabne, Sakdhnagool, and Eigenmann (2016) might be gained to help in their optimization. Current profiling approaches to understanding thread divergence behavior (e.g., ddt (2016); nvprof (2016); Shende and Malony (2006)) do not map performance information to critical execution paths in the CFG. While accelerator devices (e.g., GPUs) offer hardware performance counters for measuring computational performance, it is more difficult to apply them to capture divergence behavior Lim et al. (2015). 42 Our research focuses on improving the detail and accuracy of control flow graph information in accelerator (GPU) programs. We study the extent to which CFG data can provide sufficient context for understanding a GPU kernel’s execution performance. Furthermore, we want to investigate how effective knowledge of CFG shapes (patterns) could be in enabling optimizing compilers and autotuners to infer execution characteristics without having to resort to running execution experiments. To this end, we present CUDAflow, a scalable toolkit for heterogeneous computing applications. Specifically, CUDAflow provides a new methodology for characterizing CUDA kernels using control flow graphs and instruction operations executed. It performs novel kernel subgraph matching to gain insights into an application’s resource requirements. To the knowledge of the authors, this work is a first attempt at employing subgraph matching for revealing control flow behavior and generating efficient code. Contributions described in this paper include the following. – Systematic process to construct control flow graphs for GPU kernels. – Techniques to perform subgraph matching on various kernel CFGs and GPUs. – Approaches to reveal control flow behavior based on CFG properties. Prior Work Control flow divergence in heterogeneous computing applications is a well known and difficult problem, due to the lockstep nature of the GPU execution paradigm. Current efforts to address branch divergence in GPUs draw from several fields, including profiling techniques in CPUs, and software and hardware architectural support in GPUs. For instance, Sarkar demonstrated that the overall execution time of a program can be estimated by deriving the variances of basic 43 block regions Sarkar (1989). Control flow graphs for flow and context sensitive profiling were discussed in Ammons, Ball, and Larus (1997); Ball and Larus (1994), where instrumentation probes were inserted at selected edges in the CFG, which reduced the overall profiling overhead with minimal loss of information. Hammock graphs were constructed Zhang and D’Hollander (2004) that mapped unstructured control flow on a GPU Diamos et al. (2011); H. Wu, Diamos, Li, and Yalamanchili (2011). By creating thread frontiers to identify early thread reconvergence opportunities, dynamic instruction counts were reduced by as much as 633.2%. Lynx Farooqui, Kerr, Eisenhauer, Schwan, and Yalamanchili (2012) creates an internal representation of a program based on PTX and then emulates it, which determines the memory, control flow and parallelism of the application. This work closely resembles ours but differs in that we perform workload characterization on actual hardware during execution. Other performance measurement tools, such as HPCToolkit Adhianto et al. (2010) and DynInst Miller et al. (1995), provide a way for users to construct control flow graphs from CUDA binaries, but do not analyze the results further. The MIAMI toolkit Marin, Dongarra, and Terpstra (2014) is an instrumentation framework for studying an application’s dynamic instruction mix and control flow but does not support GPUs. Subgraph matching has been explored in a variety of contexts. For instance, the DeltaCon framework matched arbitrary subgraphs based on similarity scores Koutra, Vogelstein, and Faloutsos (n.d.), which exploited the properties of the graph (e.g., clique, cycle, star, barbell) to support the graph matching. Similarly, frequent subgraph mining was performed on molecular fragments for drug discovery Borgelt and Berthold (2002), whereas document clustering was formalized in a 44 user source.cu CUDAflow nvcc profiler construct CUDAFlow CFG analysis sample PC counter basic block counts instruction mixes CFG matching Figure 11. Overview of our proposed CUDAflow methodology. graph database context Huan, Wang, and Prins (2003). The IsoRank authors consider the problem of matching protein-protein interaction networks between distinct species Singh, Xu, and Berger (2007). The goal was to leverage knowledge about the proteins from an extensively studied species, such as a mouse, which when combined with a matching between mouse proteins and human proteins can be used to hypothesize about possible functions of proteins in humans. However, none of these approaches apply frequent subgraph matching for understanding performance behavior of GPU applications. Background Our CUDAflow approach shown in Figure 11 works in association with the current nvcc toolchain. Control flow graphs are constructed from static code analysis and program execution statistics are gathered dynamically through program counter sampling. This measurement collects counts of executed instructions and corresponding source code locations, among other information. In this way, the CUDAflow methodology provides a more accurate characterization 45 Kepler Maxwell Pascal Kepler Maxwell Pascal Kepler Maxwell Pascal START START START START START START START START root_5 root_8 root_3 root_2 root_6 root_6 root_1 L_27 L_59 L_10 root_2 START L_19 L_26 L_68 L_42 L_9 L_58 L_2 L_6 L_18 L_25 L_67 root_2 L_41 L_8 L_57 L_29 L_17 L_1 L_5 L_11 L_60 L_66 L_40 L_28 STOP L_12 L_20L_7 L_56 L_3 L_30 L_65 L_39 L_13 L_55 L_16 STOP L_24 STOP STOP STOP STOP STOP STOP STOP BFS kernel warp Reduction reduce SPMV csr scalar START START START START START START START START START root_4 root_1 root_1 root_1 root_4 root_2 root_1 root_2 L_73 L_1 L_1 L_1 root_1 L_104 L_9 L_72 L_4 L_2 L_3 L_4 L_4 L_74 L_3 L_1 L_103 L_8 L_3 L_3 L_75 L_2 L_1 L_2 L_105 L_7 STOP L_71 L_5 L_2 L_2 STOP STOP STOP STOP L_102 STOP STOP STOP Hotspot calc temp Particlefilter sum kernel Pathfinder dynproc kernel Figure 12. Control flow graphs generated for each CUDA kernel, comparing architecture families (Kepler, Maxwell, Pascal). of the application kernel, versus hardware performance counters alone, which lack the ability to correlate performance with source line information and are prone to miscounting events Lim et al. (2014). In particular, it gives a way to understand the control flow behavior during execution. Kernel Control Flow Graphs. One of the more complex parameters used to characterize SIMD thread divergence is by using a control flow graph (CFG) representation of the computation. A CFG is constructed for each GPU kernel computation in program order and can be represented as a directed acyclic graph G = (N,E, s), where (N,E) is a finite directed graph, and a path exists from the START node s ∈ N to every other node. A unique STOP node is also assumed in the CFG. A node in the graph represents a basic block (a straight line of code without jumps or jump targets), whereas directed edges represent jumps in the control flow. 46 Each basic block region is incremented with the number of times the node is visited. Upon sampling the program counter, the PC address is referenced internally to determine to which basic block region the instruction corresponds to. .L_41: /*04a0*/ DSETP.LE.AND P0,PT,|R6|,+INF,PT; /*04a8*/ @P0 BRA ‘(.L_43); /*04b0*/ LOP32I.OR R5, R7, 0x80000; /*04b8*/ MOV R4, R6; /*04c8*/ BRA ‘(.L_42); Example control flow graphs for selected SHOC (top) Danalis et al. (2010) and Rodinia (bottom) Che et al. (2009) GPU benchmarks are displayed in Figure 12. Different GPU architecture types will result in the nvcc compiler producing different code and possibly control flow, as seen in the CFGs from Figure 12 for Kepler, Maxwell and Pascal architectures. Section III discusses the differences in GPU architectures. The CFG differences for each architecture are due in part to the architecture layout of the GPU and its compute capability (NVIDIA virtual architecture). The Maxwell generally uses fewer nodes for its CFGs, as evident in kernel warp. Our approach can expose these important architecture-specific effects on the CFGs. Also, note that similarities in structure exist with several CFGs, including csr scalar and sum kernel. Part of the goal of this research is to predict the required resources for the application by inferring performance through CFG subgraph matching, with the subgraphs serving as building blocks for more nested and complex GPU kernels. For this purpose, we introduce several metrics that build on this CFG representation. 47  R1 L1 L4 L3 L2 L5   .21 − − − − −   R1 L3 L2 L1  0 .04 − − − − .30 − − − 0 .04 .38 − − −   0 .51 − −  0 0 0 .08 − −  0 0 0 − 0 0 0 0 .21 − 0 0 0 .21 0 0 0 0 .02 0 Figure 13. Transition probability matrices for Pathfinder (dynproc kernel) application, comparing Kepler (left) and Maxwell (right) versions. Transition Probability. Transition probabilities represent frequencies of an edge to a vertex, or branches to code regions, which describes the application in a way that gets misconstrued in a flat profile. A stochastic matrix could also facilitate in eliminating dead code, where states with 0 transition probabilities represent node regions that will never be visited. Kernels employing structures like loops and control flow increase the complexity analysis, and knowledge of transition probabilities of kernels could help during code generation. A canonical adjacency matrix M represents a graph G such that every diagonal entry of M is filled with the label of the corresponding node and every off- diagonal entry is filled with the label of the corresponding edge, or zero if no edge exists Yan and Han (2002). The adjacency matrix describes the transition from Ni to Nj. If the probability of moving from i to j in one time step is Pr(j|i) = mi,j, the adjacency matrix is given by m th thi,j as the i row and the j column element. Since the total transition probability from a stat∑e i to all other states must be 1, this matrix is a right stochastic matrix, so that j Pi,j = 1. Figure 13 illustrates transition probability matrices for a kernel from the Pathfinder application (Tab. 12, bottom-rt.), comparing Kepler (left) and Maxwell (right) versions. Note that the Pascal version was the same as Maxwell, 48 as evident in Fig. 16, lower-right, and was left out intentionally. The entries of the transition probability matrix were calculated by normalizing over the total number of observations for each observed node transition i to j. Although the matrices differ in size, observe that a majority of the transitions take place in the upper-left triangle, with a few transitions in the bottom-right, for all matrices. The task is to match graphs of arbitrary sizes based on its transition probability matrix and instruction operations executed, among other information. Hybrid Static and Dynamic Analysis. We statically collect instruction mixes and source code locations from generated code and map the instruction mixes to the source locator activity as the program is being run Lim et al. (2015). The static analysis of CUDA binaries produces an objdump file, which provides assembly information, including instructions, program counter offsets, and source line information. The CFG structure is stored in iGraph format Csardi and Nepusz (n.d.). We attribute the static analysis from the objdump file to the profiles collected from the source code activity to provide runtime characterization of the GPU as it is being executed on the architecture. This mapping of static and dynamic profiles provides a rich understanding of the behavior of the kernel application with respect to the underlying architecture. Methodology Based on the kernel CFG and transition probability analysis, the core of the CUDAflow methodology focuses on the problem of subgraph matching. In order to perform subgraph matching, we first scale the matrices to the same size by taking for graphs G1 and G2 the maximal proper submatrix, constructed by B(Gi) = max(|V1|, |V2|) for a given Gi = min(|V1|, |V2|) using spline interpolation. The similarities in the shapes of the control flow graphs, the variants generated for 49 Table 9. Distance measures considered in this paper. Abbrev Name √∑Result Euc Euclidean n |x − y |2i=1 i i Iso IsoRank √(∑I∑− αQ×P)xMan Manhattan ni=1 |xi − yi| Min Minkowski p ∑ni=1 |x p∑ ∑ i − yi| n 2 Jac Jaccard i=1 (xi−yi) n ∑x2+ n∑ y2 ni=1 i i=1 i− i=1 xiyin x y Cos Cosine 1− √∑ i=1√in ∑ix2 n y2i=1 i i=1 i each GPU (Table 12) and the activity regions in the transition probability matrices (Fig. 13) provided motivation for this approach. In our case, the dense hotspots in the transition matrix should align with their counterparts if the matrices are similar enough. Bilinear Interpolation. To scale the transition matrix before performing the pairwise comparison, we employ a spline interpolation procedure. Spline interpolation is general form of linear interpolation for functions of n-order polynomial, such as bilinear and cubic. For instance, a spline on a two-order polynomial performs bilinear interpolation on a rectilinear 2D grid (e.g. x and y) Gonzales and Woods (1993). The idea is to perform linear interpolation in both the vertical and horizontal directions. Interpolation works by using known data to estimate values at unknown points. Refer to Appendix B for the derivation of bilinear interpolation. Pairwise Comparison. Once the matrix is interpolated, the affinity scores (S1 and S2 for graphs G ′ 1 and G ′ 2, respectively) are matched via a distance measure, which includes the Euclidean distance, the IsoRank solution Singh et al. (2007), Manhattan distance, Minkowski metric, Jaccard similarity, and Cosine similarity. The distance measures considered in this work are listed in Table 9. 50 By definition, sim(Gi, Gj) = 0 when i = j, with the similarity measure placing progressively higher scores for objects that are further apart. CFG Results Applications. The Rodinia and SHOC application suite are a class of GPU applications that cover a wide range of computational patterns typically seen in parallel computing. Table 10 describes the applications used in this experiment along with source code statistics, including the number of kernel functions, the number of associated files and the total lines of code. Rodinia. Rodinia is a benchmark suite for heterogeneous computing which includes applications and kernels that target multi-core CPU and GPU platforms Che et al. (2009). Rodinia covers a wide range of parallel communication patterns, synchronization techniques, and power consumption, and has led to architectural insights such as memory-bandwidth limitations and the consequent importance of data layout. SHOC Benchmark Suite. The Scalable HeterOgeneous Computing (SHOC) application suite is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing Danalis et al. (2010). SHOC provides implementations for CUDA, OpenCL, and Intel MIC, and supports both sequential and MPI-parallel execution. Analysis To illustrate our new methodology, we analyzed the SHOC and Rodinia applications at different granularities. Application Level. Figure 14 projects goodness as a function of efficiency, which displays the similarities and differences of the benchmark 51 Table 10. Description of SHOC (top) and Rodinia (bottom) benchmarks studied. Name Ker File Ln Description FFT 9 4 970 Forward and reverse 1D fast Fourier transform. MD 2 2 717 Compute the Lennard-Jones potential from molecular dynamics. MD5Hash 1 1 720 Computate many small MD5 digests, heavily dependent on bitwise operations. Reduction 2 5 785 Reduction operation on an array of single or double precision floating point values. Scan 6 6 1035 Scan (parallel prefix sum) on an array of single or double precision floating point values. SPMV 8 2 830 Sparse matrix-vector multiplcation. Stencil2D 2 12 1487 A 9-point stencil operation applied to a 2D dataset. Backprop 2 7 945 Trains weights of connecting nodes on a layered neural network. BFS 2 3 971 Breadth-first search, a common graph traversal. Gaussian 2 1 1564 Gaussian elimination for a system of linear equations. Heartwall 1 4 6017 Tracks changing shape of walls of a mouse heart over a sequence of ultrasound images. Hotspot 1 1 1199 Estimate processor temperature based on floor plan and simulated power measurements. Nearest Neighbor 1 2 385 Finds k-nearest neighbors from unstructured data set using Euclidean distance. Needleman-Wunsch 2 3 1878 Global optimization method for DNA sequence alignment. Particle Filter 4 2 7211 Estimate location of target object given noisy measurements in a Bayesian framework. Pathfinder 1 1 707 Scan (parallel prefix sum) on an array of single or double precision floating point values. SRAD v1 6 12 3691 Diffusion method for ultrasonic and radar imaging applications based on PDEs. SRAD v2 2 3 2021 ... applications. The size of bubble represents the number of operations executed, whereas the shade represents the GPU type. Efficiency describes how gainfully employed the GPU floating-point units remained, or FLOPs per second: opfp+opint + opsimd + opconv efficiency = · callsn (4.1) timeexec The goodness metric describes the intensity of the floating-point and memory operation arithmetic intensity: ∑ goodness = opj · callsn (4.2) j∈J 52 Rodinia SHOC Goodness as a Function of Efficiency Node Difference over Euclidean Distance 3.0 Kepler par Maxwell Pascal par 40 2.5 p hapotahthrooht tea pbat gbafs a uc bac hea sstete md bfs gasursara2.0 30 md scsa spmste bfscsapm bacsra pat 1.5 srra fft sca sra fft 20 nswpm nw red 1.0 rerded 10 0.5 0.0 mffdt 0 0.0 0.2 0.4 0.6 0.8 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 Efficiency Euclidean Measure Figure 14. Left: The static goodness metric (Eq. B.2) is positively correlated with the dynamic efficiency metric (Eq. B.1). The color represents the architecture and the size of bubbles represents the number of operations. Right: Differences in vertices between two graphs, as a function of Euclidean metric for all GPU kernel combinations. Color represents intensity. Note that efficiency is measured via runtime, whereas goodness is measured statically. Figure 14 (left) shows a positive correlation between the two measures, where the efficiency of an application increases along with its goodness. Static metrics, such as goodness, can be used to derive dynamic behavior of an application. This figure also demonstrates that merely counting the number of executed operations is not sufficient to characterize applications because operation counts do not fully reveal control flow, which is a source of bottlenecks in large- scale programs. CFG Subgraph Matching. Distribution of Matched Pairs. Figure 14 (right) projects the distribution of differences in vertices |V | for all 162 CFG kernel pairs (Table 10, 2nd col. + 3 GPUs) as a function of the Euclidean measure (application, architecture, kernel), with shade representing the frequency of the score. Note that most matched CFGs had a similarity score of 1.5 to 2.2 and had size differences 53 Goodness |G_1| - |G_2| Error Rates for MD Kernel 1.30 30 FLOPS 1.25 25 MemOps CtrlOps 1.20 20 1.15 15 10 1.10 5 1.05 0 1.00 sr sr sr sr sr sr co co sp p o p p p u u p u u u u 11 11 11 11 11 11 16 16 22 22 s 6c s1 22 22 s 2s s s s s s s s Z Z 2 1 0 10 22 10 10 10 10 ._ ._ ._Z ._Z ._Z ._Z _Z _Z _Z _Z _Z _Z _Z _Z _Z _Z _Z _Zs s s _ Z _Z _Z .r. .r. .r. .r. s .r. s .s . . s. s.. . . s . s. s. . . . . . . ..r .m .m .s .s .m .s . .s . s. r.p. . .r. p .ss r.p r.p .p .pp p r rm m p m .s .s s . . . . .k k m m k. m m m p p m m p m m Error Rates for SPMV Kernel 1.30 35 FLOPS 1.25 Error Rates for Backprop Kernel 30 MemOps 1.30 35 25 CtrlOps 1.20 FLOPS 1.2530 20 25 MemOps 1.20 1.15 20 1.15CtrlOps 15 15 1.10 1.1010 10 1.05 5 1.055 1.00 0 0.95 bp e r an re 0 1.00 ra er er ra ch ch bo e r sr4 s r o p p u u o 2 Z7 K s4F 7p 4 6KZ Z Z Z6 K 4s 13 13 11 6 K c s s s s c Z Z Z 1 1 11 16 22 22 10_ 1 0 16 b. _ . .b g. _Z ._ _s s. ._ ._ . _ ._Zb b s f f._ Z ._Z _b. ._ Z ._Z _Z ._Z Z. ._ ._ Z ._Z ._Z. . . . . . . s . .s .s s s .r. k. r k. r. .r .r .r r .rk k s s . r r r m p. p k. k. .s k. p. . s.. .s . s.. .r. p r.p .m k m k m .k m m p m .sk Figure 15. Error rates when estimating instruction mixes statically from runtime observations for selected matched kernels (x-axis), with IsoRank scores near 1.30. under 10 vertices. Figure 14 (right) also shows that as the differences in vertices increase, similarity matching becomes degraded due to the loss of quality when interpolating missing information, which is expected. Another observation is that strong similarity results when node differences of the matched kernel pairs were at a minimum, between 0 and 8 nodes. Error Rates from Instruction Mixes. Here, we wanted to see how far off our instruction mix estimations were from our matched subgraphs. Figure 15 displays instruction mix estimation error rates, calculated using mean squared error, for MD, Backprop, and SPMV kernels as a function of matched kernels (x- axis) with IsoRank scores between 1.00 to 1.30. Naming convention for each kernel is as follows: 〈gpu arch.suite.app.kernel〉. In general, CUDAflow is able to provide subgraph matching for arbitrary kernels through the IsoRank score in addition to instruction mixes within a 8% margin of error. Note that since relative dynamic 54 Mean Absolute Error (%) Mean Absolute Error (%) IsoRank Mean Absolute Error (%) IsoRank IsoRank Similarity Measures for Kernels and Architectures k80.euc m40.euc p100.euc R.bfs.Z6Ker.1 R.hot.Z14ca.2 R.par.Z10su.3 R.pat.Z14dy.4 R.sra.Z6red.5 R.sra.Z4sra.6 R.sra.Z11sr.7 R.sra.Z11sr.8 S.sca.Z11bo.9 S.spm.Z22sp.10 S.spm.Z22sp.11 S.ste.Z13St.12 k80.iso m40.iso p100.iso 1.0 R.bfs.Z6Ker.1 R.hot.Z14ca.2 R.par.Z10su.3 R.pat.Z14dy.4 0.8 R.sra.Z6red.5 R.sra.Z4sra.6 R.sra.Z11sr.7 R.sra.Z11sr.8 S.sca.Z11bo.9 0.6 S.spm.Z22sp.10 S.spm.Z22sp.11 S.ste.Z13St.12 k80.cos m40.cos p100.cos 0.4 R.bfs.Z6Ker.1 R.hot.Z14ca.2 R.par.Z10su.3 R.pat.Z14dy.4 0.2 R.sra.Z6red.5 R.sra.Z4sra.6 R.sra.Z11sr.7 R.sra.Z11sr.8 S.sca.Z11bo.9 0.0 S.spm.Z22sp.10 S.spm.Z22sp.11 S.ste.Z13St.12 r.1a.2u.3 y.4d.5 .6 r.7 r.8 .9 10 11 12 r.1 .2 .3 .4 .5 .6 r.7 r.8 .9 10 11 12 r.1 .2 .3 .4 .5 .6 r.7 r.8 .9 0 1 2Ke 4c 0s6 4d 1 1 1 1 1 1 6r e4s ra1s 1s o . . . a u y d1b sp sp St Ke 4c 0s 4d re sr a 1 1 3 6 6 4 11 s s o 11 1b sp . p. St. Ke ca su dy ed ras s s3 6 b o p. p. t. .Z .Z .Z .Z .Z .Z .Z .Z .Z1 22 22 1 .Z .Z1.Z1.Z1 .Z .Z .Z .Z .Z1 22 22 1 .Z .Z1 4 10 14 6r 4s 11 1111 2s 2s 3S bfs ot ar at sra srasra sra a .Z .Z . Z fs t r t ra a a a Z Z Z Z 1 . h p p . . . . r r r a . . Z .Z fs t r.Z t.Z a. Z ra.r ra. a. . Z Z2 Z2 .Z R R. R. R. R R R R S.s c pm m st e .b .ho .pa.pap . R .s R.sR R R R R. s R.sS.s c e o a apmpm.st R.b .h s s sR R.pR.p R. R. R. R. sr a . ..sc m m te S.s S.s S S.s S.s S SS.s p .sp S.sS Kernels Figure 16. Similarity measures for Euclidean, IsoRank and Cosine distances for 12 arbitarily selected kernels. 55 Similarity Measures for Kernels and Architectures k80.jac m40.jac p100.jac R.bfs.Z6Ker.1 R.hot.Z14ca.2 R.par.Z10su.3 R.pat.Z14dy.4 R.sra.Z6red.5 R.sra.Z4sra.6 R.sra.Z11sr.7 R.sra.Z11sr.8 S.sca.Z11bo.9 S.spm.Z22sp.10 S.spm.Z22sp.11 S.ste.Z13St.12 k80.min m40.min p100.min 1.0 R.bfs.Z6Ker.1 R.hot.Z14ca.2 R.par.Z10su.3 R.pat.Z14dy.4 0.8 R.sra.Z6red.5 R.sra.Z4sra.6 R.sra.Z11sr.7 R.sra.Z11sr.8 S.sca.Z11bo.9 0.6 S.spm.Z22sp.10 S.spm.Z22sp.11 S.ste.Z13St.12 k80.man m40.man p100.man 0.4 R.bfs.Z6Ker.1 R.hot.Z14ca.2 R.par.Z10su.3 R.pat.Z14dy.4 0.2 R.sra.Z6red.5 R.sra.Z4sra.6 R.sra.Z11sr.7 R.sra.Z11sr.8 S.sca.Z11bo.9 0.0 S.spm.Z22sp.10 S.spm.Z22sp.11 S.ste.Z13St.12 r.1a.2u.3 y.4d.5a.6 r.7 r.8 .9 0 1 2 .1 .2e r s s . 3 .4 .5 .6 .7 .8 .9 0 1 2 c s d e bo p.1p.1 t.1 er ca su dy ed ra sr sr bo p.1p.1 t.1 er .1 K r s 1 1 S K r s 1 1 S K ca .2u.3 y.4d.5a.6 r.7 r.8o.9.10.11.12 6 14 10 14 6 Z4 Z1 Z1 11 2s 2s 3 6 14 10 14 6 Z4 Z1 Z1 11 2s s 4 0 s 4d re sr 1s 1s 1b sp sp St s.Zf t.Z r.Z t.Z a. Z ra. a. a. a.Z .Z 2.Z2 .Z 1 s.Zf t.Z r.Z t.Z a. Z ra. a. . .Z 2 22 3 6 1a a .Z .Z .Z 1 s.Zf t.Z r.Z 1 Z1 .Z6 .Z4.Z1.Z1Z1 22 22 13 .b .ho a a sr s s r sr c te b o a a . sr s r r t a a a a . Z Z Z R R R.pR.p R. R. R. R. S.s pmpm.s R. R.hR.pR.p R. R. R. s R.s cS.s pmpm r . .st e R.bR.h o R.p a R.p a r s R.s R. R.s r .sr . . s s s s R S.s cam m ste S. S. S . . S .s p S S S S.s p S. Kernels Figure 17. Similarity measures for Jaccard, Minkowski and Manhattan distances for 12 arbitarily selected kernels. performance is being estimated from static information, the error rates will always be high. Pairwise Matching of Kernels. Figure 16 shows pairwise comparisons for 12 arbitrary selected kernels, comparing Euclidean (top), IsoRank (middle), and Cosine distance (bottom) matching strategies, and GPU architectures (rows). Figure 17 shows comparisons for the Jaccard measure, Minkowski, and Manhattan distances for the same 12 kernels. Note that the distance scores were scaled to 0 and 1, where 0 indicates strong similarity and 1 denotes weak similarity. In general, all similarity measures, with the exeception of IsoRank, is able to match 56 Dendrogram of Kernels (M40) Dendrogram of Kernels (P100) 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 0 Kernels Kernels Figure 18. Dendrogram of clusters for 26 kernels, comparing Maxwell (left) and Pascal (right) GPUs. against itself, as evident in the dark diagonal entries in the plots. However, this demonstrates that using similarity measures in isolation alone is not sufficient for performing subgraph matching for CUDA kernels. Clustering of Kernels. We wanted to identify classes of kernels, based on characteristics such as instruction mixes, graph structures and distance measures. The Ward variance minimization algorithm minimizes the total within- cluster variance by finding a pair of clusters that leads to a minimum increase in a weighted squared distances. The initial cluster distances in Ward’s minimum variance method is defined as the squared Euclidean distance between points: dij = d({Xi}, {Xj}) = ||Xi − Xj||2. Figure 18 shows a dendrogram of clusters for 26 kernels calculated with Ward’s method all matched with Rodinia Particlefilter sum kernel, comparing the Maxwell (left) and Pascal (right) GPUs, which both have 4 edges and 2 vertices in their CFGs. sum kernel performs a scan operation and is slightly memory intensive (∼26% on GPUs). As shown, our tool is able to 57 distance (Ward) S.ste.Z13StencilK S.ste.Z13StencilK R.bac.Z22bpnn_lay R.sra.Z4sradfiilP S.spm.Z20spmv_ell S.sca.Z6reduceIfL S.sca.Z6reduceIdL S.sca.Z11bottom_s S.sca.Z11bottom_s S.spm.Z22spmv_csr S.spm.Z22spmv_csr R.sra.Z11srad_cud S.spm.Z20spmv_ell R.par.Z24normaliz R.hot.Z14calculat R.sra.Z11srad_cud R.bfs.Z6KernelP4N R.par.Z17find_ind R.pat.Z14dynproc_ R.sra.Z6reducelii S.spm.Z22spmv_csr S.spm.Z22spmv_csr R.par.Z17likeliho R.nw.Z20needle_c R.nw.Z20needle_c distance (Ward) R.nw.Z20needle_c R.nw.Z20needle_c R.sra.Z6reducelii S.spm.Z22spmv_csr S.spm.Z22spmv_csr S.ste.Z13StencilK S.ste.Z13StencilK R.bac.Z22bpnn_lay R.sra.Z4sradfiilP S.sca.Z11bottom_s S.sca.Z11bottom_s S.spm.Z22spmv_csr S.spm.Z20spmv_ell R.sra.Z11srad_cud S.spm.Z22spmv_csr S.spm.Z20spmv_ell R.par.Z17likeliho R.bfs.Z6KernelP4N R.par.Z17find_ind R.pat.Z14dynproc_ R.par.Z24normaliz S.sca.Z6reduceIfL S.sca.Z6reduceIdL R.hot.Z14calculat R.sra.Z11srad_cud categorize kernels by grouping features, such as instruction mixes, graph structures, and distance measures that show strong similarity. This figure also demonstrates that different clusters can be formed on different GPUs for the same kernel, where the hardware architecture may result in different cluster of kernel classes. Finally, we wanted to see if our technique could identify the same kernels running on a different GPU. Figure 19 shows distance measures when comparing three kernels across three GPUs, for a total of 9 comparisons, whereas Figure 20 shows pairwise comparisons for the same three kernels across 3 GPUs, for a total of 27 comparisons (x-axis), considering pairwise comparisons in both directions (e.g. sim(G1, G2) and sim(G2, G1)). Figure 19 displays patches of dark regions in distance measures corresponding to the same kernel when compared across different GPUs. As shown in Figure 20, our tool not only is able to group the same kernel that was executed on different GPUs, as evident in the three general categories of clusters, but also kernels that exhibited similar characteristics when running on a particular architecture, such as instructions executed, graph structures, and distance measures. Discussion. These metrics can be used both for guiding manual optimizations and by compilers or autotuners. For example, human optimization effort can focus on the code fragments that are ranked high by kernel impact, but low by the goodness metric. An autotuner can also use metrics such as the goodness metric to explore the space of optimization parameters more efficiently, such as by excluding cases where we can predict a low value of the goodness metric without having to execute and time the actual generated code. A benefit to end users (not included in paper, due to space purposes) would be providing the ability to compare an 58 Similarity Measures for Kernels across Architectures Euclidean IsoRank Cosine k80.R.par.Z10su m40.R.par.Z10su p100.R.par.Z10su k80.R.pat.Z14dy m40.R.pat.Z14dy p100.R.pat.Z14dy k80.S.sca.Z6red 1.0 m40.S.sca.Z6red p100.S.sca.Z6red 0.8 Jaccard Minkowski Manhattan 0.6 k80.R.par.Z10su m40.R.par.Z10su p100.R.par.Z10su 0.4 k80.R.pat.Z14dy m40.R.pat.Z14dy 0.2 p100.R.pat.Z14dy k80.S.sca.Z6red m40.S.sca.Z6red 0.0 p100.S.sca.Z6red su su su dy dy dy d d d0 0 0 4 4 4 re s u su su y y y d d d su su su y y y d 6 6re 6re 0 0 0 4d 4d 4d 6re 6re 6re 0 0 0 4d 4d 4d 6re 6re d ed .Z1 .Z1 .Z1 .Z1 .Z1 6r ar ar ar at at at. Z1 a.Z a.Z a.Z .Z 1 .Z1 1 1 1 1 1 1 1 1 1 1r r r.Z t.Z t.Z t.Z a.Z .Z .Z Z Z Z Z Z Z .Z .Z .Z .p .p .p .p .p .p .sc .sc .sc .pa .pa .pa .pa .pa .pa .sc .sc a a r. r. r. t. t. t. a a a R R .R R R R .S .S .S R R .R R R R .S .S .S. sc pa pa pa pa pa pa . . . . .R. .R. .R. R. R. R. .S. sc S.s c c S.s 80 40 00 80 . 40 . 0.0 k8 0 40 010 80 40 00 80 . 0. 0. 0 0 0 . . . 0 0. 0. k m p1 k m p1 m p k m p1 k m 4 0 p1 k 8 m4 10p k8 0 m4 0 00 80 40 001 k 1 k8 m4 10p m p p Kernels Figure 19. Dendrogram of clusters for pairwise comparison for 3 kernels across 3 GPUs (9 total). Dendrogram of Kernels with GPUs 160 140 120 100 80 60 40 20 0 Kernels Figure 20. Dendrogram of clusters for pairwise comparison for 3 kernels across 3 GPUs (27 total). 59 distance (Ward) k80.R.par.Z10su.m40.R.par.Z10su k80.R.par.Z10su.p100.R.par.Z10su m40.R.par.Z10su.k80.R.par.Z10su p100.R.par.Z10su.k80.R.par.Z10su m40.R.par.Z10su.p100.R.par.Z10su p100.R.par.Z10su.m40.R.par.Z10su k80.R.par.Z10su.k80.S.sca.Z6red m40.R.par.Z10su.k80.S.sca.Z6red p100.R.par.Z10su.k80.S.sca.Z6red k80.R.par.Z10su.k80.R.pat.Z14dy m40.R.par.Z10su.k80.R.pat.Z14dy p100.R.par.Z10su.k80.R.pat.Z14dy k80.R.par.Z10su.m40.S.sca.Z6red k80.R.par.Z10su.p100.S.sca.Z6red m40.R.par.Z10su.p100.S.sca.Z6red p100.R.par.Z10su.p100.S.sca.Z6red m40.R.par.Z10su.m40.S.sca.Z6red p100.R.par.Z10su.m40.S.sca.Z6red k80.R.par.Z10su.m40.R.pat.Z14dy k80.R.par.Z10su.p100.R.pat.Z14dy m40.R.par.Z10su.m40.R.pat.Z14dy p100.R.par.Z10su.m40.R.pat.Z14dy m40.R.par.Z10su.p100.R.pat.Z14dy p100.R.par.Z10su.p100.R.pat.Z14dy k80.S.sca.Z6red.k80.R.par.Z10su k80.S.sca.Z6red.m40.R.par.Z10su k80.S.sca.Z6red.p100.R.par.Z10su k80.R.pat.Z14dy.k80.R.par.Z10su k80.R.pat.Z14dy.m40.R.par.Z10su k80.R.pat.Z14dy.p100.R.par.Z10su m40.S.sca.Z6red.k80.R.par.Z10su p100.S.sca.Z6red.k80.R.par.Z10su p100.S.sca.Z6red.m40.R.par.Z10su p100.S.sca.Z6red.p100.R.par.Z10su m40.S.sca.Z6red.m40.R.par.Z10su m40.S.sca.Z6red.p100.R.par.Z10su m40.R.pat.Z14dy.k80.R.par.Z10su p100.R.pat.Z14dy.k80.R.par.Z10su m40.R.pat.Z14dy.m40.R.par.Z10su m40.R.pat.Z14dy.p100.R.par.Z10su p100.R.pat.Z14dy.m40.R.par.Z10su p100.R.pat.Z14dy.p100.R.par.Z10su m40.R.pat.Z14dy.k80.S.sca.Z6red p100.R.pat.Z14dy.k80.S.sca.Z6red m40.R.pat.Z14dy.k80.R.pat.Z14dy p100.R.pat.Z14dy.k80.R.pat.Z14dy m40.R.pat.Z14dy.p100.R.pat.Z14dy p100.R.pat.Z14dy.m40.R.pat.Z14dy m40.R.pat.Z14dy.m40.S.sca.Z6red p100.R.pat.Z14dy.m40.S.sca.Z6red m40.R.pat.Z14dy.p100.S.sca.Z6red p100.R.pat.Z14dy.p100.S.sca.Z6red m40.S.sca.Z6red.m40.R.pat.Z14dy m40.S.sca.Z6red.p100.R.pat.Z14dy p100.S.sca.Z6red.m40.R.pat.Z14dy p100.S.sca.Z6red.p100.R.pat.Z14dy k80.S.sca.Z6red.m40.R.pat.Z14dy k80.S.sca.Z6red.p100.R.pat.Z14dy k80.R.pat.Z14dy.m40.R.pat.Z14dy k80.R.pat.Z14dy.p100.R.pat.Z14dy k80.S.sca.Z6red.m40.S.sca.Z6red k80.S.sca.Z6red.p100.S.sca.Z6red k80.R.pat.Z14dy.k80.S.sca.Z6red k80.R.pat.Z14dy.m40.S.sca.Z6red k80.R.pat.Z14dy.p100.S.sca.Z6red k80.S.sca.Z6red.k80.R.pat.Z14dy m40.S.sca.Z6red.k80.R.pat.Z14dy p100.S.sca.Z6red.k80.R.pat.Z14dy m40.S.sca.Z6red.k80.S.sca.Z6red p100.S.sca.Z6red.k80.S.sca.Z6red m40.S.sca.Z6red.p100.S.sca.Z6red p100.S.sca.Z6red.m40.S.sca.Z6red implementation against a highly optimized kernel. By making use of subgraph matching strategy as well as instruction operations executed, CUDAflow is able to provide a mechanism to characterize unseen kernels. We have presented CUDAflow, a control-flow-based methodology for analyzing the performance of CUDA applications. We combined static binary analysis with dynamic profiling to produce a set of metrics that not only characterizes the kernel by its computation requirements (memory or compute bound), but also provides detailed insights into application performance. Specifically, we provide an intuitive visualization and metrics display, and correlate performance hotspots with source line and file information, effectively guiding the end user to locations of interest and revealing potentially effective optimizations by identifying similarities of new implementations to known, autotuned computations through subgraph matching. We implemented this new methodology and demonstrated its capabilities on SHOC and Rodinia applications. Future work includes incorporating memory reuse distance statistics of a kernel to characterize and help optimize the memory subsystem and compute/memory overlaps on the GPU. In addition, we want to generate robust models that will discover optimal block and thread sizes for CUDA kernels for specific input sizes without executing the application Lim et al. (2017). Last, we are in the process of developing an online web portal cknowledge (n.d.); Sreepathi et al. (2014) that will archive a collection of control flow graphs for all known GPU applications. For instance, the web portal would be able to make on-the-fly comparisons across various hardware resources, as well as other GPU kernels, without burdening the end user with hardware requirements or software 60 package installations, and will enable more feature rich capabilities when reporting performance metrics. Conclusion This chapter developed pattern matching techniques that captured runtime information of a GPU program that could be used by compilers and autotuners for optimization, which eliminates unnecessary experimentation runs. The next chapter is in the domain where the optimization landscape is vast. In particular we survey how to tune hyper-parameters for neural networks for a machine translation system. We identify which hyper-parameters matter most, in terms of systems execution performance, and what the cost of tuning that hyper-parameter is. 61 CHAPTER V OPTIMIZING HYPER-PARAMETERS FOR NEURAL NETWORKS This chapter includes previously published co-authored material from a presentation at the 2nd Workshop on Naval Applications of Machine Learning Lim, Heafield, Hoang, Briers, and Malony (2018). This work was performed while I was an intern at The Alan Turing Institute. Dr. Kenneth Heafield supervised me and provided the initial problem to solve. Dr. Hieu Hoang assisted in software install issues, and preprocessing of the datasets, as well as answering questions related to machine translation. Dr. Mark Briers partly supervised me as an intern. Dr. Allen Malony assisted in editing the paper. Abstract Neural machine translation (NMT) has been accelerated by deep learning neural networks over statistical-based approaches, due to the plethora and programmability of commodity heterogeneous computing architectures such as FPGAs and GPUs and the massive amount of training corpuses generated from news outlets, government agencies and social media. Training a learning classifier for neural networks entails tuning hyper-parameters that would yield the best performance. Unfortunately, the number of parameters for machine translation include discrete categories as well as continuous options, which makes for a combinatorial explosive problem. This research explores optimizing hyper- parameters when training deep learning neural networks for machine translation. Specifically, our work investigates training a language model with Marian NMT. Results compare NMT under various hyper-parameter settings across a variety of modern GPU architecture generations in single node and multi-node settings, 62 revealing insights on which hyper-parameters matter most in terms of performance, such as words processed per second, convergence rates, and translation accuracy, and provides insights on how to best achieve high-performing NMT systems. Motivation The rapid adoption of neural network (NN) based approaches to machine translation (MT) has been attributed to the massive amounts of datasets, the affordability of high-performing commodity computers, and the accelerated progress in fields such as image recognition, computational systems biology and unmanned vehicles. Research activity in NN-based machine translation has been taking place since the 1990s, but statistical machine translation (SMT) soared along with the successes of machine learning. SMT incorporates a rule-based, data driven approach, and includes language models such as word based (n-grams), phrased- based, syntax-based and hierarchical based approaches. Neural machine translation (NMT), on the other hand, does not require predefined rules, but learns lingusitic rules from statistical models, sequences and occurences from large corpuses. Models trained using NNs produce even higher accuracy than existing SMT approaches, but training time can take anywhere from days to weeks to complete. Suboptimal strategies are often difficult to find, given the dimensionality and its effect on parameter exploration. One of the main difficulties of training neural networks is the millions of parameters that need to be estimated. These parameters are estimated by optimization methods, such as stochastic gradient descent, where the solver seeks to identify the global optima. Due to the combinatorial search space, local optimization in many cases is sufficent to generalize beyond the training 63 set Goodfellow, Bengio, and Courville (2016) (Ch. 8). Thus, the tuning of hyper- parameters is paramount in accelerating training of neural networks. In neural machine translation, modeling and training are crucial in achieving high performing systems. A combination of hyper-parameter optimization methods to train a NMT system is investigated in this work. Specifically, this work examines the stability of different optimization parameters in discovering local minima, and how a combination of hyper-parameters can lead to faster convergence. The following contributions are made in this work: – We identify which hyper-parameters matter most in contributing to the learning trajectory of NMT systems. – We analyze our findings for translation performance, training stability, convergence speed, and tuning cost. – We tie in systems execution performance with hyper-parameters. Related Work Hyper-parameter optimization has been an unsolved problem since the inception of machine learning, and becomes even more crucial in training the millions of parameters in neural networks. The past work has investigated techniques for hyper-parameter tuning and search strategies, such as Bergstra, et. al., concluding that random search outperforms grid search Bergstra, Bardenet, Bengio, and Kégl (2011). Likewise, the authors in Shahriari, Swersky, Wang, Adams, and De Freitas (2016); Snoek, Larochelle, and Adams (2012) take a Bayesian approach toward parameter estimation and optimization. However, these efforts apply their strategies on image classfication tasks. 64 In relation to NMT, Britz, et. al. massively analyze neural network architectures and its variants Britz, Goldie, Luong, and Le (2017). Their approach incorporates a 2-layer bidirectional encoder/decoder with a multiplicative attention mechanism as a baseline architecture, with a 512-unit GRU and a dropout of 0.2 probability. Their model parameters remained fixed and the studies varied the architecture, including depth layer, unidirectional vs bidirectional encoder/decoder, attention mechanism size, and beam search strategies. Likewise, Bahar et. al. compare various optimization strategies for NMT by switching to a different optimizer after 10k iterations, and found that Adam combined with other optimizers, such as SGD or annealing, increased the BLEU score by 2.4 Bahar, Alkhouli, Peter, Brix, and Ney (2017). However, these approaches study a standard NMT system. In addition, Wu, et. al. Y. Wu et al. (2016) utilized the combination of Adam and SGD, where Adam ran for a fixed number of iterations with a 0.0002 learning rate, and switched to SGD with a 0.5 learning decay rate to slow down training, but did not perform hyper-parameter optimization. To the best of our knowledge, there has not been any work comparing different hyper-parameter optimization strategies for NMT. Moreover, our optimization strategies are demonstrated on a production-ready NMT system and explores parameter selection tradeoffs, in terms of performance and stability. Background Machine translation involves model design and model training. In general, learning algorithms are viewed as a combination of selecting a model criterion, defined as a family of functions, training, defined as parameterization, and a procedure for appropriately optimizing this criterion. The next subsections 65 Figure 21. RNN encoder-decoder, illustrating a sentence translation from English to French. The architecture includes a word embedding space, a 1-of-K coding and a recurrent state on both ends.1 discuss how sentences are represented with a neural network and the optimization objectives used for training a model for a translation system. Machine Translation. This subsection discusses how neural networks can model language translation from a source to a target sequence. Recurrent Neural Networks. Recurrent neural networks (RNN) are typically employed for neural machine translation because of its ability to handle variable length sequences. RNNs capture unbounded context dependencies typical in natural language comprehension and speech recognition systems. For inputs xt and yt, connection weight matrices Wih, Whh, Who, indicating input-to-hidden, hidden-to-hidden and hidden-to-output, respectively, and activation function f , the 1https://devblogs.nvidia.com/introduction-neural-machine-translation-gpus-part-2/ 66 recurrent neural network can be described as follows: ht = fH(Wihxt + Whhht−1) (5.1) yt = fO(Whoht). (5.2) RNNs learn a probability distribution over a sequence by being trained to predict the next symbol in a sequence. The output at each timestep t is the conditional probability distribution p(xt|xt−1, ..., x1). RNN Encoder-Decoder. A RNN encoder-decoder (pictured in Fig. 21) encodes a variable-length sequence into a fixed vector representation, and decodes the fixed vector representation into a variable-length sequence Cho et al. (2014). The RNN encoder-decoder are two separate neural networks that are jointly trained to maximize the conditional log-likeliho∑od, defined asN1 arg max log pθ(tn|sn), (5.3) θ N n=1 where θ represents the set of model parameters, each sn, tn is a pair of input and output sequences from a parallel text corpus training set, and the output of the decoder from the encoder is differentiable. A trained RNN encoder-decoder can generate a target sequence given an input sequence. Neural Machine Translation. Neural machine translation is defined as maximizing the conditional probability, arg maxt p(t|s) ∝ p(s|t)p(t), for a source s and target t sequence, where p(s|t) represents the translation model, and p(t) represents the language model Bahdanau, Cho, and Bengio (2014); Sutskever, Vinyals, and Le (2014). Taking the l∑og linear of p(t|s) yields,N log p(t|s) = wntn(t, s) + logλ(s), (5.4) n=1 67 Table 11. Stochastic gradient descent and its variants. Optimizer Operations Description gt ←5θtJ(θt)SGD ← − gt - gradient cost function,θt+1 θt ηgt η - learning rate, θ parameters gt ←5θtJ(θt) AdaGrad ηt ← η 2t−1 + gt Divides η by previous θt+1 ← θt − √ η gη + t gradients, handles sparset data well gt ←5θtJ(θt) ηt ← γη 2t−1 + (1− γ)gt η̂ ← ηt Adam 1−γ t ← m - decay mean of pastmt µmt−1 + (1− µ)g tt m̂← mt gradients, m̂t, n̂t - biased 1−µt ← − η corrected terms that avoidsθt+1 θt √ m̂η̂ tt+ zero initialization, γ = 0.9, µ = 0.999,  = 108 where t and w are the nthn n feature and weight, and λ(s) is a normalization constant. The BLEU score provides a measure for optimizing weights during training. Optimization Objectives. The following subsections describe the tuning of hyper-parameters that affect the performance of training a NMT system. In particular, this work focuses on the optimizers, activation functions, and dropout. SGD Optimizers. Stochastic gradient descent (SGD), commonly used to train neural networks, updates a set of parameters θ, where η is the learning rate and gt represents the gradient cost function, J(·). Adagrad is an adaptive-based gradient method, where η is divided by the square of all previous gradients, ηt, plus , a smoothing term to avoid dividing by zero. As a result, larger gradients have less frequent updates, whereas smaller gradients have more frequent updates. Adagrad handles sparse data well and does not require manual 68 Table 12. Activation units for RNN. Activation Operations Description tanh st ← (ex − e−x)/(ex + e−x) hyperbolic tangent i← σ(x itU + s it−1W ) f ← σ(xtUf + s ft−1W ) o← σ(x otU + st−1Wo)LSTM g← tanh(x Ug g 3 gates, c - internalt + st−1W ) ← ◦ ◦ memory, o output, 2ct ct−1 f + g i ← ◦ tanhst tanh(ct) o z← σ(x z ztU + st−1W ) r← σ(x Ur + s r GRU t t−1 W ) h← tanh(x Uht + (st−1 ◦ 2 gates, no internal r)Wh) st ← (1− z) ◦ memory, no output h + z ◦ st−1 gates, 1 tanh tuning of η. Adaptive moment estimation (Adam) accumulates the decaying mean of past gradients, mt, and the decaying average of past squared gradients, ηt, referred to as the first and second moments, respectively. The moments, m̂t, n̂t are biased corrected terms that avoids initializing to zero. γ is usually set to 0.9, with µ = 0.999, and  = 108. Table 11 displays SGD, AdaGrad and Adam optimizers. Activation Functions. Activation functions serve as logic gates for recurrent neural networks that computes the hidden states, and include the hyperbolic tangent, long short term memory (LSTM) Hochreiter and Schmidhuber (1997), and gated recurrent unit (GRU) Cho et al. (2014). Table 12 displays the hyperbolic tangent, LSTM and GRU activation functions. To address the vanishing gradients problem associated with learning long- term dependencies in RNNs, LSTMs and GRUs employ a gating mechanism when computing the hidden states. For LSTM s, note that the input i, forget f and hidden h gates are the same equations except with different parameter matrices. g is a hidden state, based on the current input and previous hidden state. ct serves as the internal memory, which is a combination of the previous memory, ct−1, 69 Table 13. Dropout versus a standard update function. Optimizer Operations Description (l+1) ← (l+1) (l+1)z w yl + b Update i i i(l+1) (l+1) Standard update yi ← f(zi ) (l) rj ∼ Bernoulli(p) ŷ(l) ← r(l) ∗ y(l) Dropout (l+1) ← (l+1) l (l+1) r - Bernoulli rv, p - zi wi ŷ + bi dropout param (l+1) ← (l+1)yi f(zi ) multiplied by the input gate. The hidden state, st, is calculated by multiplying ct and the output gate. On the other hand, a GRU employs a reset gate r and an update gate u. The reset gate r determines how to combine the new input with the previous memory, whereas the update gate u defines how much of the previous memory to retain. If the reset gates were set to 1’s and the update gates to 0’s, this would result in a vanilla RNN. The differences between the two approaches to compute hidden units are that GRUs have 2 gates, whereas LSTMs have 3 gates. GRUs do not have an internal memory and output gates, compared with LSTM which uses c as its internal memory and o as an output gate. The GRU input and forget gates are coupled by an update gate z, and the reset gate r is applied directly to the previous hidden state. Also, GRUs do not have a 2nd non-linearity operation, compared to LSTMs, which uses two hyperbolic tangents. Dropout. In a fully-connected, feed-forward neural network, dropout randomly retains connections within hidden layers while discarding others Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov (2014). Table 13 displays a standard hidden update function on the top, whereas a version that decides whether to retain a connection is displayed on the bottom. ŷ(l) is the 70 thinned output layer, and retaining a network connection is decided by a Bernoulli random variable r(l) with probability p(·) = 1. Combination of Optimizers. Since the learning trajectory significantly affects the training process, it is required to select and tune the proper types of hyper-parameters to yield good performance. The construction of the RNN cell with activation functions, the optimizer and its learning rate, and the dropout rates all have an affect on how the training progresses, and whether good accuracy can be achieved. Marian NMT Marian Junczys-Dowmunt et al. (2018) is an efficient NMT framework written in C++, with support for multi-node and multi-GPU training and CPU/GPU translation capabilities. Marian is currently being developed and deployed by the Microsoft Translator team. Table 14 displays parameters involved with tuning a neural machine translation system, categorized by model, training and validation, with values and types in brackets, and its default value, if any. The types of models in Marian include RNNs and Transformers Vaswani et al. (2017). The translation system evaluated in this study is a sequence-to-sequence model with single layer RNNs for both the encoder and decoder. The RNN in the encoder is bi-directional and the decoder is sequence-to-sequence. Depth, also referred to as deep transitions Koehn (2017), is achieved by stacking activation blocks, resulting in tall RNN cells for every recurrent step. The encoder consists of four activation blocks per cell, whereas the decoder consists of eight activation blocks, with an attention mechanism placed between the first and second block. Word embedding sizes were set at 512, the RNN state size was set to 1024, and 71 layer normalization was applied inside the activation blocks and the attention mechanism. Experiments The experiments were carried out on the WMT 2016 Junczys-Dowmunt and Grundkiewicz (2016) translation tasks for the Romanian and German languages in four directions: EN → RO, RO → EN, EN → DE, and DE → EN. The datasets and its characteristics used in the experiments are listed in Table 15, with number of sentence examples in parenthesis. Table 15 shows that for WMT 2016 EN → RO and RO → EN, the training data consisted of 2.6M English and Romanian sentence pairs, whereas for WMT 2016 EN → DE and DE → EN, the training corpus consisted of approximately 4.5M German and English sentence pairs. Validation was performed on 1000 sentences of the newsdev2016 corpus for RO, and on the newstest2014 corpus for DE. The newstest2016 corpus consisted of 1999 sentences for RO and 2999 sentences for DE, and was used as the test set. We evaluated and saved the models every 10K iterations and stopped training after 500K iterations. All experiments used bilingual data without additional monolingual data. We used the joint byte precision encoding (BPE) approach Sennrich, Haddow, and Birch (2015) in both the source and target sets, which converts words to a sequence of subwords. For all four tasks, the number of joint-BPE operations were 20K. All words were projected on a 512-dimensional embedding space, with vocabulary dimensions of 66000 × 50000. The mini-batch size was determined automatically based on the sentence length that was able to fit in GPU global memory, set at 13000 MB for each GPU. 72 Table 14. Marian hyper-parameters, with options in brackets. dimensions vocab [vect] embed [int] RNN dim [int] type [bi-dir, bi-unidir, s2s] cell: type [gru, lstm, tanh], depth [1, 2, ...], transition cells [1, 2, ...] skip [bool] layer norm [bool] tied embeddings [src, trg, all] dropout [float] transformer heads [int] no projection tied layers [vector] guided alignment layer preproc, postproc, post-emb [dr, add, norm] dropout [float] cost [ce-mean, ce-mean, words, ce-sum, perplexity] after-epochs [∞] max length [int=50] system GPUs, threads mini-batch size, words, fit, fit-step [int, int, bool, uint] optimizer [sgd, adgrad, adam] learn rate decay: strategy [epochs, stalled, epoch + batches, ep+stalled], start, frequency, repeat warmup, inverse sqrt, warmup label smoothing [bool] clip norm [float=1] exponential [float=0] smoothing guided alignment cost [ce, mean, mult], weight [float=0.1] data weighting type [sentence, word] embedding vectors, norm, fix-src, fix-trg frequency metrics [ce, ce-words, perplexity, valid-script, translation, bleu, bleu-detok] early stopping [int=10] beam size [int=12] normalize [float=0] max-length-factor [float=3] word penalty [float] mini-batch [int=32] max length [int=1000] 73 Validation Training Model Table 15. Datasets used in experiments. RO→EN, EN→RO DE→EN, EN→DE Train corpus.bpe (2603030) corpus.bpe (4497879) Valid newsdev2016.bpe (1999) newstest2014.bpe (3003) Test newstest2016.bpe (1999) newstest2016.bpe (2999) Beam search was used for decoding, with the beam size set to 12. The translation portion consisted of recasing and detokenizing the translated BPE chunks. The trained models compared different hyper-parameter strategies, including the type of optimizer, the activation function, and the amount of dropout applied, as discussed in Section V. The number of parameters were initialized with the same random seed. The systems were evaluated using the case-sensitive BLEU score computed by Moses SMT Koehn et al. (2007). We compared models trained on two different types of GPUs (P100 Pascal, V100 Volta), listed on Table 16. The corresponding CPUs are listed on Table 17. Each ran with four GPUs. The dataset was partitioned across 4 GPUs, and a copy of the model was executed on each GPU. Analysis This section analyzes the results of the evaluated NMT systems in terms of translation quality, training stability, convergence speed and tuning cost. Translation Quality. Table 18 shows BLEU scores calculated for four translation directions for the validation sets (top) and the test sets (bottom), comparing learning rates, activation functions and GPUs. Note that entries with n/a means that no results were available, whereas entries with dnf indicates training time that did not complete within 24 hours. For the validation sets, LSTMs were able to achieve higher accuracy rates, whereas in the test set GRUs and LSTMs were about the same. Also, note that the best performing learning 74 Table 16. Graphical processors used in this experiment. P100 V100 CUDA capability 6.0 7.0 Global memory (MB) 16276 16152 Multiprocessors (MP) 56 80 CUDA cores per MP 64 64 CUDA cores 3584 5120 GPU clock rate (MHz) 405 1380 Memory clock rate (MHz) 715 877 L2 cache size (MB) 4.194 6.291 Constant memory (bytes) 65536 65536 Shared mem blk (bytes) 49152 49152 Registers per block 65536 65536 Warp size 32 32 Max threads per MP 2048 2048 Max threads per block 1024 1024 CPU (Intel) Ivy Bridge Haswell Architecture family Pascal Volta rates were usually at a lower value (e.g. 1e-3). The type of hidden unit mechanism (e.g LSTM vs GRU) and the learning rate can affect the overall accuracy achieved, as demonstrated by Table 18. Table 19 displays various dropout rates applied for two translation directions RO → EN and DE → EN, comparing hidden units, GPUs and overall training time. The learning rate was evaluated at 0.001, the rate that achieved the highest Table 17. Hardware and execution environment information. Architecture Haswell Ivy Bridge Model E5-2698 v3 Xeon X5650 Clock speed 2.30 GHz 2.67 GHz Node count 4, 14 6 GPUs 4 × V100 4 × P100 Memory 256 GB 50 GB Linux kernel 3.10.0-229.14.1 2.6.32-642.4.2 Compiler CUDA v9.0.67 Flags {‘g’, ‘lineinfo’, ‘arch=sm cc’} 75 Table 18. BLEU scores for validation (top) and test (bottom) datasets. ro→en en→ro de→en en→de cell learn-rt P100 V100 P100 V100 P100 V100 P100 V100 GRU 1e-3 35.53 35.43 19.19 19.28 28.00 27.84 20.43 20.61 5e-3 34.37 34.05 19.07 19.16 26.05 22.16 n/a 19.01 1e-4 35.47 35.46 19.45 19.49 27.37 27.81 dnf 21.41 LSTM 1e-3 34.27 35.61 19.29 19.64 28.62 28.83 21.70 21.69 5e-3 35.05 34.99 19.48 19.43 n/a 24.36 18.53 18.01 1e-4 35.41 35.28 19.43 19.48 n/a 28.50 dnf dnf GRU 1e-3 34.22 34.17 19.42 19.43 33.03 32.55 26.55 26.85 5e-3 33.13 32.74 19.31 18.97 31.04 26.76 n/a 26.02 1e-4 33.67 34.44 18.98 19.69 33.15 33.12 dnf 28.43 LSTM 1e-3 33.10 33.95 19.56 19.08 33.10 33.89 28.79 28.84 5e-3 33.10 33.52 19.13 19.51 n/a 29.16 24.12 24.12 1e-4 33.29 32.92 19.14 19.23 n/a 33.44 dnf dnf Table 19. Dropout rates, BLEU scores and total training time for test set, comparing systems. ro→en de→en cell dropout P100 t V100 t P100 t V100 t GRU 0.0 34.47 6:29 34.47 4:43 32.29 9:48 31.61 6:15 0.2 35.53 8:48 35.43 6:21 33.03 18:47 32.55 19:40 0.3 35.36 12.21 35.15 7:28 31.36 10:14 31.50 9:33 0.5 34.50 12:20 34.67 17:18 29.64 11:09 30.21 11:09 LSTM 0.0 34.84 6:29 34.65 4:46 32.84 12:17 32.88 7:37 0.2 34.27 8:10 35.61 6:34 33.10 16:33 33.89 13:39 0.3 35.67 9:56 35.37 11:29 33.45 20.02 33.51 15:51 0.5 34.50 15:13 34.33 12:45 32.67 20:02 32.20 13:03 76 BLEU score as evident in Table 18. Generally speaking, increasing the dropout rates also increased training time. This may be the result of losing network connections when applying the dropout mechanism, but at the added benefit of avoiding overfitting. This is evident in Table 19, where applying some form of dropout will result in a trained model achieving higher accuracies. The best performance can be seen when the dropout rate was set at 0.2 to 0.3. This confirms that some form of skip connection mechanism is necessary to prevent the overfitting of models under training. Figure 23 shows BLEU score results as a function of training time, comparing GPUs, activation units, learning rates and translation directions. Note that in most cases a learning rate of 0.001 achieves the higher accuracy in most cases, at the cost of higher training time. Also, note the correlation between longer training time and higher BLEU scores in most cases. In some cases, the models were able to converge at a faster rate (e.g. Fig. 23 upper left, RO→EN, GRU with learning rate of 0.005 vs 0.001). Training Stability. Figure 24 shows the cross-entropy scores for the RO → EN and EN → RO translation tasks, comparing different activation functions (GRU vs. LSTM), with learning rates at 0.001. Note the training stability patterns that emerge from this plot, which is highly correlated with the translation direction. The activation function (GRU vs LSTM) during validation also performed similarly across GPUs and was also highly correlated with the translation direction. Cross-entropy scores for the EN → RO translation direction were more or less the same. However, for RO → EN, a LSTM that executed on a P100 converged the earliest by one iteration. 77 RO -> EN cell = lstm cell = gru 5e-3 35.50 1e-35e-3 5e-3 51e-3 35.25 1e-3 1e-3 35.00 1e-4 GPU v100 34.75 p100 34.50 1e-4 34.25 1e-4 1e-4 34.00 20000 40000 60000 80000 20000 40000 60000 80000 Train Train EN -> RO cell = lstm cell = gru 5e-3 19.6 19.5 1e-4 1e-3 1e-4 5e-3 1e-3 1e-4 19.4 GPU p100 19.3 5e-3 v100 5e-3 19.2 1e-3 1e-4 19.1 1e-3 20000 25000 30000 35000 40000 45000 50000 20000 25000 30000 35000 40000 45000 50000 Train Train Figure 22. BLEU scores as a function of training time (seconds), comparing GPUs (color), activation units (sub-columns), learning rates and translation directions. 78 BLEU BLEU DE -> EN cell = gru cell = lstm 29 1e-4 1e-3 1e-3 28 1e1-5e4e-3-3 1e-4 5e-3 27 26 1e-3 GPU p100 25 5e-3 5e-3 v100 24 23 1e-4 22 20000 40000 60000 80000 20000 40000 60000 80000 Train Train EN -> DE cell = lstm cell = gru 1e-13e-3 1e-4 21 1e-13e-3 GPU 20 v100 p100 19 5e-3 5e-3 18 5e-3 20000 40000 60000 80000 20000 40000 60000 80000 Train Train Figure 23. BLEU scores as a function of training time (seconds), comparing GPUs (color), activation units (sub-columns), learning rates and translation directions. 79 BLEU BLEU Cross-Entropy over Epochs (RO->EN, EN->RO) 80 p100-roen-gru v100-roen-gru p100-roen-lstm 75 v100-roen-lstm p100-enro-gru v100-enro-gru 70 p100-enro-lstm v100-enro-lstm 65 60 55 50 45 000 00 0 00 0 00 0 00 0 00 00 0 0 0 0 0 0 0 0 00 00 00 0 000 00 0 1 2 3 4 5 6 7 8 9 10 validation Figure 24. Cross entropy over the number of epochs for RO → EN and EN → RO, comparing activation functions and GPUs. Cross-Entropy over Epochs (DE->EN, EN->DE) 70 p100-deen-gru v100-deen-gru p100-deen-lstm 65 v100-deen-lstm p100-ende-gru v100-ende-gru p100-ende-lstm 60 v100-ende-lstm 55 50 45 40 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 00 0 00 00 0 0 00 0 00 00 00 00 0 00 00 00 00 00 00 00 00 10 20 30 40 50 60 70 80 90 00 10 20 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 1 1 1 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 validation Figure 25. Cross-entropy over the number of epochs for DE → EN and EN → DE, comparing activation functions and GPUs. 80 cross-entropy cross-entropy Words per Second over Epochs (RO->EN) 50000 p100-roen-gru-0.0001 v100-roen-gru-0.0001 p100-roen-lstm-0.0001 45000 v100-roen-lstm-0.0001 p100-roen-gru-0.005 v100-roen-gru-0.005 p100-roen-lstm-0.005 v100-roen-lstm-0.005 40000 p100-roen-gru-0.001 v100-roen-gru-0.001 p100-roen-lstm-0.001 v100-roen-lstm-0.001 35000 30000 25000 20000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 validation Figure 26. Average words-per-second for the RO → EN translation task, comparing systems. Figure 25 shows the same comparison of cross-entropy scores over epochs for DE → EN and EN → DE translation tasks. Note that the behavior for this translation task was wildly different for all systems. Not only did it take more epochs to converge compared to Fig 24, but also how well the system progressed also varied, as evident in the cross-entropy scores during validation. When comparing hidden units, LSTMs outperformed GRUs in all cases. When comparing GPUs, the V100 performed better than the P100 in terms of cross-entropy, but took longer to converge in some cases (e.g. v100-deen-lstm, v100-ende-lstm). Also, note that the behavior of the translation task EN → DE for a GRU hidden unit never stabilized, as evident in both the high cross-entropy scores and the peaks toward the end. The LSTM was able to achieve a better cross-entropy score overall, with nearly a 8 point difference for DE → EN, compared with the GRU. Convergence Speed. Figure 26 shows the average words-per-second for the RO → EN translation task, comparing systems. The average words-per-second 81 words-per-sec Table 20. Words-per-second (average) and number of epochs, comparing activation units, learning rates and GPUs. words-per-sec validation words-per-sec validation cell learn-rt P100 V100 P100 V100 P100 V100 P100 V100 ro→en en→ro GRU 1e-3 33009.23 45762.54 18000 18000 29969.14 42746.15 15000 15000 5e-3 32965.23 24253.14 19000 8000 30223.89 23144.62 17000 10000 1e-4 32828.61 24341.96 44000 16000 29959.34 23277.51 25000 14000 LSTM 1e-3 29412.87 40534.06 15000 16000 27282.54 38131.13 14000 14000 5e-3 29536.65 40598.24 16000 16000 27245.42 37384.46 19000 21000 1e-4 29478.51 41441.37 40000 35000 27002.60 38118.79 25000 25000 de→en en→de GRU 1e-3 28279.53 38026.87 20000 28000 28367.91 39995.48 10000 10000 5e-3 28215.40 19819.59 25000 4000 n/a 39944.10 n/a 16000 1e-4 28367.54 33218.70 26000 32000 dnf 39993.89 dnf 36000 LSTM 1e-3 24995.64 33507.31 16000 17000 25245.67 35122.54 13000 17000 5e-3 25210.15 33740.92 14000 7000 25049.21 33649.20 9000 6000 1e-4 dnf 34529.58 dnf 31000 dnf dnf dnf dnf executed remained consistent across epochs. The system that was able to achieve the most words-per-second was v100-roen-gru-0.001, whereas the one that achieved the least words-per-second was the v100-roen-gru-0.005. Surprisingly, the best and worst performer was the v100-roen-gru, depending on its learning rate, with the sweet spot at 0.001. This confirms 0.001 as the best learn rate that can execute a decent number of words-per-second and achieve a fairly high accuracy, as evident in previous studies, across all systems. Table 20 also displays words-per-second and validation, comparing activation units, learning rates and GPUs. When fixing learning rate, the V100 was able to execute more words-per-second than the P100, and was able to converge at an earlier iteration. When comparing hidden units, GRUs were able to execute higher words per second on a GPU and converge at a reasonable rate (at 18000 iterations) for most learning rates, except for 5e-3. When looking at LSTMs, words-per-second executed on a V100 was similar, although at a higher learning rate it was able to converge at 42000 iterations, but at the cost of longer training time and slower convergence (35000 iterations). 82 Table 21. Total training time for four translation directions, comparing systems. ro→en en→ro de→en en→de cell learn-rt P100 V100 P100 V100 P100 V100 P100 V100 GRU 1e-3 8:48 6:21 7:47 5:26 18:47 19:40 9:26 6:41 5e-3 9:41 4:52 8:38 6:02 23:57 4:36 n/a 10:56 1e-4 21:58 9:43 12:33 8:59 23:50 21:09 dnf 23:58 LSTM 1e-3 8:10 6:34 7:49 5:36 16:33 13:39 13:50 12:24 5e-3 9:02 6:34 10:44 8:32 n/a 5:12 9:37 4:35 1e-4 22:29 14:05 13:46 9:45 n/a 23:57 dnf dnf Table 21 shows the corresponding total training time for the four translation directions, comparing GPUs, activation units, and learning rates. The dropout rate was set at 0.2, which was the best performer in most cases (Tab 19). Table 21 shows that the training time increased as the learning rates were decreased. In general, Romanian took a fraction of the time to complete training (usually under 10 hours), whereas German took 18-22 hours to complete training. Cost of Tuning a Hyper-Parameter. Table 22 displays the average time spent per epoch for the Romanian ↔ English translation task, and Table 23 displays the average time spent per epoch for the German ↔ English translation task, comparing learning rates, activation cells, and GPUs. The mean is displayed in each cell, with the standard deviation in parenthesis and the number of epochs executed in brackets. For both tasks, dropout was set to 0.2. Surprisingly, GRUs take longer on the V100 on average with larger learning rates (5e-3, 1e-4) vs the P100, whereas for LSTMs, the V100s clearly speeds up execution per epoch. Note also that the learning rate does not have a significant change in the average time spent per epoch, except for the case with GRUs executing on the V100 with large learning rates. The learning rate does have an effect on the number of epochs executed, as seen in brackets as the learning rate increases. Table 23 reports on the German ↔ English translation tasks. The same observation can be made for 83 Table 22. Average time spent per iteration for RO → EN and EN → RO translation directions, comparing systems, with standard deviation in parenthesis and epochs in brackets. ro→en en→ro cell learn-rt P100 V100 P100 V100 GRU 1e-3 1807.362941 1304.076471 1829.790714 1278.770714 (142.43) [17] (102.67) [17] (166.06) [14] (117.63) [14] 5e-3 1814.640556 2472.531429 1816.642500 2385.243333 (140.01) [18] (11.16) [7] (165.40) [16] (15.08) [9] 1e-4 1823.828837 2466.306429 1839.624583 2369.436923 (129.08) [43] (11.29) [14] (167.28) [24] (13.79) [23] LSTM 1e-3 2032.362857 1470.278 2010.199231 1438.945385 (155.58) [14] (108.79) [15] (146.74) [13] (107.76) [13] 5e-3 2018.048 1469.054 2014.716667 1474.787500 (148.21) [15] (110.05) [15] (144.41) [18] (100.57) [20] 1e-4 2026.976154 1445.585882 2037.517083 1443.758333 (147.46) [39] (106.30) [34] (140.28) [24] (99.68) [24] this task, where GRUs spend less time per epoch compared to LSTMs, and that the average time spent per epoch remains fixed as the learnignrate increases. Summarize Findings This work reveals the following, with respect to tuning hyper-parameters: – Dropout is neccessary to avoid overfitting. The recommended probability rate is 0.2 to 0.3. – LSTMs take longer than GRUs per epoch, but achieves better accuracy. – Although the average time spent per epoch remains fixed as learning rates increase, the total number of epochs executed per training run increases as the learning rates increase. – Tensor core GPUs, particularly the V100, provide more words that can be processed per second, compared to non-tensor core GPUs, such as the Pascal P100. 84 Table 23. Average time spent per iteration for DE → EN and EN → DE translation directions, comparing systems, with standard deviation in parenthesis and epochs in brackets. de→en en→de cell learn-rt P100 V100 P100 V100 GRU 1e-3 3430.330526 2555.738148 3432.534444 2535.11 (124.58) [19] (95.76) [27] (128.70) [9] (88.61) [9] 5e-3 3450.174167 4898.036667 n/a 2432.112000 (133.13) [24] (47.79) [3] (87.91) [15] 1e-4 3425.231600 4907.070667 n/a 2434.452000 (129.98) [25] (51.24) [15] (90.02) [35] LSTM 1e-3 3887.889333 2898.554375 3840.552500 2761.088125 (164.183) [15] (129.37) [16] (162.85) [12] (116.41) [16] 5e-3 3855.21 2852.335 3859.903750 2886.194 (162.27) [13] (121.95) [6] (167.48) [8] (122.26) [5] 1e-4 n/a 2814.689000 n/a n/a (118.66) [30] Discussion The variation in the results, in terms of language translation, hyper- parameters, words-per-second executed and BLEU scores, in addition to the hardware the training was executed on demonstrates the complexity in learning the grammatical structure between the two languages. In particular, the learning rate set for training, the hidden unit selected for the activation function, the optimization criterion and the amount of dropout applied to the hidden connections all have a drastic effect on overall accuracy and training time. Specifically, we found that a lower learning rate achieved the best performance in terms of convergence speed and BLEU score. Also, we found that the V100 was able to execute more words-per-second than the P100 in all cases. When looking at accuracy as a whole, LSTM hidden units outperformed GRUs in all cases. Lastly, the amount of dropout applied on a network in all cases prevented the model from overfitting and achieve a higher accuracy. 85 The multidimensionality of hyper-parameter optimization poses a challenge in selecting the architecture design for training NN models, as illustrated by the varying degrees of behavior across systems and its performance outcome. This work investigated how the varying design decisions can affect training outcome and provides neural network designers how to best look at which parameters affect performance, whether accuracy, words processed per second, and convergence expectation. Coupled with massive datasets for parallel text corpuses and commodity heterogenous GPU architectures, the models trained were able to achieve WMT grade accuracy with the proper selection of hyper-parameter tuning. We analyzed the performance of various hyper-parameters for training a NMT, including the optimization strategy, the learning rate, the activation cell, and the GPU across various systems for the WMT 2016 translation task in four translation directions. Results demonstrated that a proper learning rate and a minimal amount of dropout is able to prevent overfitting as well as achieve high training accuracy. Future work includes developing optimization methods to evaluate how to best select hyper-parameters. By statically analyzing the computational graph that represents a NN in terms of instruction operations executed and resource allocation constraints, one could derive execution performance for a given dataset without running experiments. Conclusion This chapter addresses the following questions when training neural networks. Specifically, neural machine translation was evaluated during training for stability, convergence, speed, and cost. Questions that were addressed include how much a hyper-parameter update costed, as well as which hyper-parameters 86 contributed to learning. The next chapter attempts to combine techniques from the previous chapters for identifying the precision requirements for classifying image applications. 87 CHAPTER VI NUMERICAL REPRESENTATION This chapter includes both previously pubished and unpublished co- authored material. The work includes a poster presentation that was accepted at the 5th Workshop on Naval Applications of Machine Learning Lim, Castro, Coti, Jalby, and Malony (2021), and work in progress involving Dr. Camille Coti, Dr. William Jalby, Dr. Allen Malony and Dr. Pablo Oliveira that started when I was a Chateaubriand Fellow at the University of Versailles. I am the primary contributor to this work in developing the algorithm, writing the new code, and writing the paper. Dr. Coti initially identified the need for this work and provided the application that this work was performed in. Dr. Jalby supervised me while I was interning in Versailles. Dr. Allen Malony assisted in editing the paper. Dr. Oliveira introduced the foundation of numerical representation. Abstract This paper investigates training deep neural networks with varying precision lengths while providing the user with guidance on how to best set the precision requirements of a floating point operation. Our approach intercepts floating point operations at the LLVM intermediate representation layer and applies rounding at varying precision lengths. We demonstrate our approach with PyTorch C++ Vision models on the CalTech 101 dataset. Our results are presented in the following manner. We break down the precision requirements per iteration and overall for the training session and compare across various hyper-parameters, including learning rates, mini-batch sizes, and convolution filters. Our results demonstrate that mixed precision is stable in earlier parts of the training phase, whereas reduced precision 88 Figure 27. Comparison of floating point representations (image source TF32 (2019a)). becomes unstable near convergence. Our approach is novel in that it ties in the network architecture and hyper-parameters with variable length floating point precision and enables exploration of precision bounds of an operation. Motivation Due to the lengthy amount of time it takes to train machine learning models, increasing floating-point operations per clock cycle can be attained with reduced precision operations, which trades off accuracy with instruction throughput and low latency. Since machine learning involves repeated matrix-vector operations during the forward, backward and update passes, counting multiply-add-accumulate (MAC) operations provide a way to estimate performance of a training run for a given model. Figure 27 displays a comparison of floating point representations, including tensorfloat32, bfloat16, float and half types. This motivates the discussion to investigate whether more operations can be executed per clock cycle, while maintaining accuracy and correctness of a program using reduced precision. 89 Modern microprocessors, accelerators and embedded devices provide hardware units for executing reduced precision operations. For instance, NVIDIA Volta and Turing GPUs have hardware capability for executing mixed precision operations since CUDA 8 Mixed Precision Programming (2016), where Volta tensor cores have FP16/FP16 and FP16/FP32 modes, and Turing tensor cores have INT8/INT32, INT4/INT32 and INT1/INT32 execution modes Tensor Core Performance (2019). Intel Cascade Lake provides vectorized intrinsics, AVX512 VNNI, for accelerating convolutional neural networks, which performs 8- bit multiplies (VPMADDWD) and 32-bit accumulates (VPADDD) in one clock cycle using Port 0 and Port 5 simultaneously for a theoretical 4× increase in instruction throughput Intel Cascade Lake (2019). Floating-Point Numbers in Machine Learning Numerical portions of machine learning include inputs, model, gradients, activation units and weights De Sa, Feldman, Ré, and Olukotun (2017). Weights and activation units, which represent signals of the gradient when computing backpropagation with SGD, are typical candidates for quantization Intel Low Precision (2018), Mixed Precision Programming (2016), Micikevicius et al. (2017). Floating point representation includes fixed-point computation that truncates the floating-point into a fix sized format and custom quantization points that varies the length of the precision and range of a floating-point number. Mixed precision operations incur accuracy loss, depending on the method, and the solution can be improved with iterative refinement. For instance, the authors demonstrated that the inner generalized minimal residual method (GMRES) loop, an iterative method for solving a system of non-linear equations, was computed in 32-bit and the outer GMRES loop was computed in 64-bit with 90 Table 24. IEEE-754 Numbers and exceptions. Outcome Description Zeroes +0 and −0 Sign determines behavior when dividing by nonzero (e.g. −∞ or +∞) Infinities +∞ or −∞ Div-by-zero, overflow√ NaNs Not a Number (+∞)− (+∞), 0/0, −1 Normal Normalized Most common nonzero representable reals Subnormal Denormalized Values very close to zero, issues regarding rounding errors Invalid operation NaN produced NaN conditions, as above Overflow Operation result Number too large in magnitude to be represented in data type Division by zero x±0 , x 6= 0 Produce ±∞ depending on sign of x and ±0 Underflow Result too small Generally harmless, but error bounds will differ from normal computations Inexact Real result can’t be Rounding (default), care needed represented for sound analysis minimal loss in accuracy Baboulin et al. (2009), and more recently with GPUs in 8-bit / 32-bit mixed precision modes Haidar, Wu, Tomov, and Dongarra (2017). Concerns relating to deep learning with limited numerical precision include overflow and underflow of values, and not-a-number (NaN) resulting from undefined operations, such as adding or subtracting infinite variables, or √ −1 Goodfellow et al. (2016) (Chapter 4). Table 24 displays floating point numbers and exceptions defined by the IEEE-754 standard. A numerical value can result in zeroes, infinities, NaN, normal, or subnormal. Representing numbers in fixed-point registers require handing exceptional cases. Exceptions result in undefined behavior, either due to the resulting procedure or invalid mathematical definitions, and includes overflow, division by zero, underflow, and inexact. 91 Exceptions Numbers Figure 28. Verificarlo workflow. Workflow This subsection covers the workflow of instrumenting deep learning applications with reduced precision. Our methodology of instrumenting floating point operations at reduced precision was implemented at the LLVM intermediate representation (IR) level. Verificarlo Modes. Verificarlo is a toolkit for assessing and reproducing floating point operations Denis, Castro, and Petit (2015). The reduced precision is emulated in the IEEE-754 format for 32-bit and 64-bit operations, where the exponent and mantissa sizes can be set at arbitrary lengths for each floating point operation or at the function level. The error bounds are where the precision is truncated and can also be set. Figure 28 illustrates the overall Verificarlo workflow. Verificarlo takes in C or C++ code, performs source-to- source translation, where the operations are replaced with its reduced floating point counterpart in the LLVM IR level, and produces a binary executable. The control flow is the same as the original program with the ability of assessing operations as they take place in the program. The code is instrumented and program results are available (both the original and the reduced FP precision) and allows further analysis to take place. 92 Table 25. Verificarlo backends and options. Backend Description Options IEEE No effect on output debug, debug-binary, print-new- line, print-subnormal-normalized, no-backend-name Monte Carlo (MCA) MCA on variable, mode {ieee, mca, pb †, rr ‡}, quad type on doubles, precision-binary32, precision- double type on floats binary64, error-mode {rel§, abs ‖, all}, max-abs-error-exponent, daz *, ftz #, seed MCA-MPFR MCA using GNU same as MCA MPFR library Bitmask First order model of mode {ieee, mca, pb, rr}, operator noise, sets t mantissa {zero, one, rand}, precision- bits to 0, 1, rand binary32, precision-binary64, daz, ftz, seed Cancellation Cancels out bits seed, tolerance, warning arbitrarly Virtual precision Sets length for both mode {ieee, mca, pb, rr}, precision- exponent (range) and binary32, precision-binary64, range- mantissa (precision) binary32, range-binary64, error- mode {rel, abs, all}, daz, ftz * denormals are zero, [†] precision bound, [‡] random round, [§] relative, [‖] absolute, [#] flush to zero Table 25 lists the various backend modes supported, which include Monte Carlo arithmetic (MCA), bitmask, cancellation, and virtual precision. The backend modes can be applied at the variable level for inputs, outputs or both, at the operand level, or both the variable and operand levels. MCA utilizes quadruple types (128 bits) for double variables and double types on floating-point variables. The bitmask backend sets t mantissa bits to 0, 1 or MCA randomization. The cancellation backend automatically detects a cancellation, and if detected, noise is applied on the cancelled part with MCA. Virtual precision, the study of this work, enables setting of mantissa and exponent at arbitrary lengths, and other options can be applied such as stochastic rounding, denormalization, and flush to zero. 93 Figure 29. Virtual precision in Verificarlo, showing r = 5 and p = 10, simulating a binary16 embedded inside a binary 32. Algorithm 2 Instrumenting functions with Verificarlo. 1: Data: X inputs, Y outputs, F : X → Y 2: Result: F ′ instrumented 3: procedure func inst(F,X, Y ) 4: for f i ∈ F do 5: Count Nfloat, Ndouble 6: Allocate Nfloat +Ndouble space in heap 7: for xi ∈ X do 8: Add xi itype, xsize, x i name, x i address for vfc enter 9: Create callback to vfc enter 10: Load xi rounded values 11: Call hooked function with xi 12: for yj ∈ Y do 13: Add yjtype, y j , yj jsize name, yaddress for vfc exit 14: Call vfc exit 15: Load yj rounded values 16: Return yj, if needed Figure 29 illustrates how the virtual precision mode is applied on a floating-point variable, with r = 5 and p = 10 (image source: Chatelain, Petit, de Oliveira Castro, Lartigue, and Defour (2019)). Primitive Types from Composite Types. To instrument derived types, a mechanism to infer primitive types, such as floats and doubles, from composite types was added in Verificarlo. Algorithm 2 describes the procedure for instrumenting functions with Verificarlo, whereas Algorithm 3 describes the steps to infer floats from composite types. The routine takes as input a struct type. 94 Algorithm 3 Infer floating-point types from derived types in Verificarlo. 1: Input: Type is a Struct type 2: Output: Number of floats and doubles for derived type T 3: procedure DERIVED TYPE(Type T) 4: S ← T . type cast to struct 5: for ∀s ∈ S do . s are members of S 6: T ← getType(s) 7: if Tptr then . T is a pointer 8: T ← getElemPointedTo(Tptr) . get callee until primitive type reached 9: if Tfl then float++ . float 10: else if Tdbl then double++ 11: else if Tarr or Tvec then 12: if Tdbl then double+= Narr 13: else if Tfl then float+= Narr 14: else if Tstruct then 15: DERIVED TYPE(Type T) 16: else if Tfl then float++ 17: else if Tdbl then double++ 18: else if Tarr or Tvec then 19: if Tdbl then double+= Narr 20: else if Tfl then float+= Narr 21: else if Tstruct then 22: DERIVED TYPE(Type T) For each member of the struct, the type is checked. If it is a float, a double, or a pointer to such type, the address is logged and the algorithm continues until all members have been examined. Algorithm 3 takes place during Algorithm 2, specifically in Lines 4 to 6, 7 to 8, and 12 to 13. Rounding. When each float or double is intercepted in Verificarlo, stochastic rounding is applied, which returns a modified version of that value. For instance, D. Stott Parker defines stochastic rounding as the probability of rounding x to bxc proportional to the proximity of x to bxc Parker (1997). 95 Sample Compute Precision and Resources Range Train Agent Accuracy Time Multi-Objective Optimization Figure 30. Search for precision and range settings during training.  bxc with probability 1− x−bxc E Round(x) = (6.1) bxc+ E with probability x−bxcE , where E ∈ R represents a random error, uniformly distributed on (−1 , 1). 2 2 Multi-Objective Optimization. We pose the problem of training deep neural networks with reduced precision as a multi-objective optimization problem that seeks to satisfy accuracy and execution speed objectives. Our approach is novel in that it performs training in real time to measure the speed to convergence, as opposed to counting operations. Our approach also provides an ability for models to dynamically set precision and range sizes during training run. The approach is formulated as follows. Given a precision p, let Facc(p) denote the achieved accuracy on a training run, and let Fexec(p) denote the measured execution speed of model m on hardware h, and T be the bounded 96 execution time. Multi-objective optimization is formulated as maxFacc(p) p (6.2) subject to Fexec(p) ≤ T Given this formulation, we are interested in the Pareto optimal, defined in Sec. II as a point in the criterion space that satisfies multiple objectives. In this scenario, the Pareto optimal is a model that achieves high accuracy with minimal execution time, or a model that does not decrease its accuracy while maintaining minimal execution time. Figure 30 displays an overview of the proposed search framework, which consists of an agent that interacts with its environment in a feedback manner. The agent performs search by sampling the precision and range sizes of a floating point operation, evaluates the performance of the model under that setting, and updates the model parameters accordingly. The environment includes the training process, accounting for compute resources, that measures the time spent per epoch. For a value function Vπ(S), with states S following a sequence of actions a = 1 : π, where the actions represent a model’s accuracy under a specified precision setting, and θ represents the learned model, the goal is to maximize the expected criterion: Va;θ(s) = E[R], (6.3) where R is the reward function, as defined in Equation 6.2. This process of sampling, evaluating and updating is repeated until θ is reached with a desired number of steps. Experimental Results This section reports on instrumenting various neural network models in PyTorch C++ with Verificarlo. 97 Table 26. Neural networks for image classification evaluated in this study. Properties AlexNet VGG ResNet SqueezeNet ShuffleNet MNASNet Top-5 79.6 90.4 93.5 80.6 81.7 91.5 Input Size 227×227 224×224 224×224 224×224 224×224 224×224 Conv Layers 5 13 53 8(×3)+1 13(×3) 4+6(×3) Filter Size 3, 5, 11 3 1, 3, 7 1,3,7,13 1, 3 1, 3, 5 # Channels 3-256 3-512 3-2048 3-512 24-1024 32-320 # Filters 96-384 64-512 64-2048 16-96 12-512 32-1600 FC Layers 3 3 1 1 1 1 # Channels 256-4096 512-4096 2048 512 24-1024 1280 # Filters 1K-4096 1K-4096 1000 1 12-512 1280 Weights 61M 138M 25.5M 1.24M 7.39M 6.28M MACs 724M 15.5G 3.9G 0.35G 0.60G 0.53G Applications and Execution Environment. CalTech 101 is a standard image dataset that includes 101 image categories, ranging from helicopter, chair to camera. There are about 40 to 800 images per category, with most categories with about 50 images. The applications that were instrumented were PyTorch Vision image classification models, including AlexNet Krizhevsky (2014), MNASNet Tan et al. (2019), ResNet He, Zhang, Ren, and Sun (2016), ShuffleNet Ma, Zhang, Zheng, and Sun (2018), SqueezeNet Iandola et al. (2016) and VGG Simonyan and Zisserman (2014). Table 26, adapted from Sze et al. (2017), displays a summary of the neural network architectures evaluated in this study. Note the trend in compressing models, as evident in the decreased number of weights and MACs in more recent models. We evaluated each of the models in Torchvision C++. Note that AlexNet, ResNet and VGG models follow the standard convolutional architecture setup, but varies in the number of layers, filter sizes, and channels. SqueezeNet consists of 8 fire modules, each of which consists of 3 convolutional layers (1×1, 3×3, 7×7, 13×13). MNASNet consists of 6 inverted residuals, each of which consists of 3 98 Table 27. Intel Xeon Platinum hardware and execution environment information. Model 8176 (×2) 8180M (×24) Clock Speed 2.50 GHz 2.10 GHz Cores 28 28 Threads 56 56 TDP 165W 205W L2 28 MiB 28 MiB L3 38.5 MiB 38.5 MiB Max Mem 768 GiB 1536 GiB convolutional layers (1×1, 3×3, 5×5). ShuffleNet includes 3 stages, each of which consists of 3 (1×1, 3×3) convolutional layers. The experiments ran on an Intel Skylake with a S2600WF motherboard, which consists of 24 Xeon Platinum 8180M CPUs @ 2.50 GHz, and 2 Xeon Platinum 8176 CPUs @ 2.10GHz. The memory size was 768GB RAM, EDR IB, Intel P4500 1TB NVMe SSD, and Intel P4800X 375GB NVMe Optane SSD. Table 27 lists the hardware specifications for the Xeon Platinum 8180M and 8176 CPUs. Two sets of experiments were carried out. One set trained the neural networks for 10 epochs, and the second set trained the networks for 30 epochs. For all networks, the mini-batch size for the train set was set to 64 and the mini-batch size for the test set was set to 1000. The learning rate was set to 0.001. The data set was cross validated with a 80/20 split of train and test data. The total train data set size was 7281, and the total test size was 1865. Results. This subsection discusses the results from this work. First, we analyze the learning trajectories with varying precision lengths. Figure 31 plots the trajectory of accuracy at each epoch, comparing vision models and the precision sizes. The top row represents inbound, or inputs, and the bottom row represents outbound. In general, lowering the precision to 1 bit compared to the 99 prec64 = 1 prec64 = 26 prec64 = 51 0.8 0.6 0.4 0.2 model alexnet mnasnet resnet18 resnet34 0.8 shufflenet squeezenet vgg 0.6 0.4 0.2 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 epoch epoch epoch Figure 31. Accuracy per epoch, comparing vision models. full precision run affected accuracy. What is notable is that lowering the precision to half type had the same effect as running it in full precision mode. This could be related to the recent push of training neural networks in half precision mode, which demonstrates the stability of neural networks in lower precision. Table 28 breaks down statistics for the first ten epochs, comparing precision sizes, models, and the resulting accuracy and loss. The time reported is in seconds. Note that in some cases, the total time executed in the lower bit precision is faster than the full precision version, such as AlexNet and ResNet34. Note also that in most cases, the half precision types match the accuracy and loss of the full precision types with lower execution times. Figure 32 plots the change in difference in values, comparing 1-bit and 26- bit mantissa with the full precision as the baseline for AlexNet, MNASNet and VGG. The same observation can be made for the half type that maintains near 100 accuracy accuracy mode = inbound mode = outbound Table 28. Statistics for first ten epochs of training, comparing precision sizes and models. Inbound Outbound Time Loss Accuracy Time Loss Accuracy 1 801 0.2500 0.5000 762 0.2500 0.500 Alexnet 26 744 1.9195 0.5100 797 1.9195 0.5100 51 793 1.9195 0.5100 796 1.9195 0.5100 1 1650 0.1250 0.7500 1605 0.2500 0.7500 MNASNet 26 1745 0.7954 0.7613 1602 0.7954 0.7613 51 1744 0.7954 0.7613 1608 0.7954 0.7613 1 1119 0.1250 0.7500 1020 0.1250 0.7500 ResNet18 26 990 0.6752 0.8193 1014 0.6752 0.8193 51 1017 0.6752 0.8193 1049 0.6752 0.8193 1 1536 0.1250 0.7500 1405 0.1875 0.7500 ResNet34 26 1457 0.9195 0.7363 1534 0.9195 0.7363 51 1452 0.9195 0.7363 1415 0.9195 0.7363 1 1258 0.0625 0.7500 1343 0.1250 0.7500 ShuffleNet 26 1278 0.4439 0.8662 1210 0.4439 0.8662 51 1237 0.4439 0.8662 1309 0.4439 0.8662 1 896 0.5000 0.2500 823 0.5000 0.3750 SqueezeNet 26 793 2.9584 0.3365 809 2.9584 0.3365 51 828 2.9540 0.3365 798 2.9584 0.3365 1 3536 0.2500 0.5000 3667 0.2500 0.5000 VGG 26 3424 1.4977 0.6161 3461 1.4977 0.6161 51 3588 1.4977 0.6161 3566 1.4977 0.6161 101 alexnet Mantissa Rounding inbound_outbound = ib inbound_outbound = ob 0.8 0.6 0.4 1 26 0.2 0.0 0.2 0 200 400 600 0 200 400 600 inst num inst num mnasnet Mantissa Rounding inbound_outbound = ib inbound_outbound = ob 0.8 0.6 0.4 1 26 0.2 0.0 0.2 0 200 400 600 0 200 400 600 inst num inst num vgg Mantissa Rounding inbound_outbound = ib inbound_outbound = ob 0.75 0.50 1 0.25 26 0.00 0.25 0 200 400 600 0 200 400 600 inst num inst num Figure 32. Rounding errors for various mantissa sizes, comparing vision models. 102 c_difference c_difference c_difference Table 29. Displaying multi-objective results when accounting for accuracy and average time spent per epoch. Accuracy Convergence AlexNet-1 51.0 801 AlexNet-half 51.0 744 ResNet18-1 75.0 1119 ResNet18-half 81.9 990 ResNet34-1 75.0 1536 ResNet34-half 73.6 1457 zero change in difference in the values, whereas in the 1-bit case, the change varies drastically. Figure 33, top, displays accuracy as a function of execution time. Observe that in some networks, such as SqueezeNet, ResNet18 and ResNet34, the accuracy of 1 bit is actually on par with the full precision version∑. Figure 33, bottom, displays loss as a function of time. Loss is defined as − Mc=1 yo,clog(po,c), where M is the number of classes, y is the binary indicator if class label c is the correct classification for observation o, and p is the predicted probability that observation o is of class c. In general, the loss for the 1-bit precision is lower than that of the 26-bit and the full bit precision. Table 29 displays a summarized version of multi-objective optimization when accounting for accuracy and average time spent per epoch. As seen, the precision settings vary depending on the model and size. Discussion The techniques evaluated image classification on a variety of neural network models on the CalTech 101 data set. It would be interesting to make a wider comparison with other data sets, such as ImageNet, as well as with neural machine translation, and other scientific applications. The study was contained to image classification with PyTorch C++ models. A limitation of this study is 103 Accuracy over Time mode = ib mode = ob 2561 2561 alexnet 0.8 2561 2561 mnasnet 1 1 125612561 1 1 1256125161 resnet18resnet34 0.7 shufflenet 0.6 2561 2561 squeezenet vgg 0.5 12561 1 12561 1 0.4 2561 12561 0.3 1 1000 2000 3000 1000 2000 3000 time time Loss over Time mode = ib mode = ob 3.0 2561 2561 alexnetmnasnet 2.5 resnet18resnet34 2.0 2561 2561 shufflenetsqueezenet 2561 2561 vgg1.5 1.0 2561 2561 2561 256 2561 25 161 0.5 11 2561 1 1 1 1 1 1 1 2561 1 1 1 1 1 0.0 1000 2000 3000 1000 2000 3000 time time Figure 33. Accuracy over time (top) and loss over time (bottom), comparing vision models. 104 loss accuracy that the precision settings were made globally for the whole program. This study did not evaluate mixed precision at the variable level. We leave that as future work. Another limitation of this work is that we did not exhaustively explore the precision settings between half and 1 bit. The purpose of this work was to present a proof-of-concept with the tool capabilities that was added in Verificarlo and the types of analysis that can be performed with reduced precision during machine learning. Prior Work Techniques to reduce numerical precision have been used for compression, custom quantization points, computing convolutional operations in the logarithmic domain, and stochastic rounding. Deep compression Han, Mao, and Dally (2015) proposes a three-stage compression pipeline that prunes redundant weight connections, quantizes multiple connections to share the same weight, and applies Huffman coding in the fully connected layers to biased effective weights. For the AlexNet neural network, the 256 shared weights were quantized to 8-bits for each convolution layer, and the 32 shared weights were quantized to 5-bits in the fully connected layers without loss in accuracy. They observed that for the last fully- connected layer, most quantized weights were distributed around two peaks. Thus, Huffman coding was used to non-uniformly distribute values, which saved 20-30% in network storage space. This paper Alistarh, Grubic, Li, Tomioka, and Vojnovic (2017) proposes an approach for quantizing gradients in distributed training of SGD, particularly neural networks. The approach partitions the problem amongst available processors, where each processor broadcasts its unquantized gradients. Then, each processor aggregates the gradients, performs local training with quantized gradients 105 and, with a uniformly random integer broadcasts the quantized update vector. Their approach compared various neural network models with 1-16 K80 GPUs and found that parallelization decreased epoch time but led to more communication. Their results also compared 4 bit and 8 bit and found that 8 bit quantization was able to maintain the accuracy of the gradients compared to the full precision version, and that 4 bits loses 0.57% for Top-5 accuracy and 0.68% for Top-1 accuracy. The weights and activations were encoded in a base-2 logarithmic representation Miyashita, Lee, and Murmann (2016), since weights and activations have a non-uniform distribution. Their approach proposed computing the convolution operation in logarithmic domain, where either the individual operands or the operation is converted to the log domain and quantized. Their experiments compared quantization in the linear and log domains. They trained CIFAR- 10 with 5-bits weights and 4-bit activations resulting in minimal performance degradation. They also noted that for 3-bits, the log domain tolerated a larger dynamic full-scale range, where AlexNet performed 0.2% worse in the log domain compared to the linear domain, but VGG, a higher capacity network, performed 6.2% better in the log domain and maintained its Top-5 accuracy. This work evaluated fixed-point representation of 16 bits with round-to- nearest and stochastic rounding modes for training neural networks S. Gupta, Agrawal, Gopalakrishnan, and Narayanan (2015). The weights W l and biases Bl were quantized to 16-bits and compared with round-to-nearest and stochastic rounding. Aggressive reduced precision may result in loss of gradient information, if updates are significantly smaller than . In round-to-nearest, any parameter update in the range of (−  ,  ) is always rounded to zero, whereas stochastic rounding 2 2 106 maintains a non-zero probability of small parameter updates to ±. Experiments compared MNIST and CIFAR10 datasets and found that, in general, stochastic rounding maintained accuracy compared to round-to-nearest mode. Automatic mixed precision (AMP) in PyTorch consists of autocast and GradScalar as modules for executing in low precision PyTorch Mixed Precision (2019). autocast will automatically typecast certain operations to half, such as convolution and matrix multiplication, whereas other operations will be executed in float32, such as softmax and point-wise operations, based on their predefined operation eligibility. The model is converted to float16 where possible, and a copy of the master weights is kept in float32 to accumulate per-iteration weight updates. Loss scaling is applied, using the master weights, to preserve small gradient values. The TensorFlow mixed precision, provided by the Keras high-level API TensorFlow Mixed Precision (2021), takes a similar approach toward quantizing variables. However, these approaches perform mixed precision automatically with predefined rules and do not provide a mechanism for the user to specify the precision requirements. Conclusion This chapter discusses the precision requirements when training neural networks. Specifically, the chapter seeks to understand how stable the numerical representation is when changing the precision and range sizes. We implemented our work on the LLVM intermediate representation layer and evaluated our work on various PyTorch C++ image applications. We demonstrate our capabilities and were able to identify where in the training phase that the precision is stable, and where it becomes unstable. 107 The follow-up work that builds upon this work can lead in several directions. For instance, mixed-precision remains fixed throughout the program run. What would be interesting is whether the precision can change throughout the duration of the training run. One approach would be to utilize just-in-time (JIT) compilation of various precision sizes during program execution. This would enable a more dynamic effect of numerical representation and is not limited to machine learning. Another area of exploration is error analysis of accuracy in fault tolerant settings. For instance, the resiliency of the classification models becomes important, especially since it is being incorporated into our daily lives. The security of the models and weeding out false positives become even more urgent. 108 CHAPTER VII SUMMARY AND FUTURE DIRECTIONS The findings of this dissertation are summarized in this chapter. In addtion, directions for future work are also discussed. Summary In order to optimize for performance and accuracy, a clear understanding of the optimization landscape is needed. This dissertation work outlines where the potentional opportunites are for optimizing performance while maintaining the learning trajectory curves. Specifically, we evaluated automatic performance tuning for GPUs and developed search heuristics that worked in the static analysis case, which improved our search space by 92%. Next, we evaluated subgraph matching for representing performance profiles for GPU execution. In that study, we were able to define an architecture independent way for matching control flow graphs, and demonstrated that capability with various CUDA programs. Next, we evaluated machine translation and the hyper-parameters that are entailed for tuning a translation system. In that work, we noted that certain hyper-parameters took longer than others. Finally, we investigated reduced precision for increasing exeuction performance for image classification. Future Work This section discusses several research directions that can be pursued for the various topics that were covered in this dissertation. Optimizing Code Generation. In optimizing CUDA code generation, this work Lim et al. (2017) optimized performance, and accounted for threads, blocks, and shared memory. The work did not account for memory behavior 109 on GPUs, specifically where communication vs. computation optimization opportunities may lie. A more in-depth analysis of loop transformations, such as tiling, fusion, and mixed-precision, could be pursued. Also, pattern matching could be directly employed during Orio automatic performace tuning. GPU Subgraph Matching. This work Lim et al. (2019) defined the necessary decision support boundaries that characterizes the runtime behavior of a GPU application in the form of a control flow graph, which aides in matching with other unseen GPU kernels. Some areas that need further work include formally providing provable guarantees that pattern matching will always the lead solver to an optimum. In addition, subgraph matching could also be utilized in optimizing hyper-parameters. Hyper-Parameter Optimization. This work Lim et al. (2018) explored optimizing hyper-parameters for neural machine translation. This work performed grid search when setting hyper-parameters. An area of exploration would be to incorporate a cost metric associated with search methods. Since it is known that machine learning training can take hours to weeks to complete, are there methods that can formally model the whole hyper-parameter training from end-to-end, with adaptive model tuning and checkpointing in-between, without training the network? Also, an area worth investigating is the cost of a hyper- parameter update in relation to hardware counter metrics. Numerical Representation. Since the recent work on numerical representation Lim et al. (2021) revealed that precision may matter more in certain phases of the training run versus others, this warrants more investigative work on whether dynamic mixed precision could be employed during machine learning training. Several ways of doing that would include JIT-compiling code for certain 110 precision sizes and running the appropriate precision size. This would provide a more dynamic approach toward numerical representation, versus the approach of fixed sized precision throughout the training run. Another area worth exploring is evaluating the resiliency of applications with mixed precision via fault injection methods. For example, could machine learning models withstand noise and if so, how much noise and at what phases, and how much noise before the overall application is affected? 111 APPENDIX A THE FIRST APPENDIX Stochastic Gradient Descent A one dimension update of gradient descent involves the following: W (t+ 1) = W (t)− dE(W )η (A.1) dW The optimal learning rate ηopt, or one that gives the fastest convergence, can be derived by a Taylor series expansion on E about current weight Wc dE(Wc) 1 d 2E(Wc) E(W ) = E(Wc) + (W −Wc) + (W −Wc)2 (A.2) dW 2 dW 2 with dE(Wc) ≡ dE |W=Wc . Differentiating both sides with respect to W givesdW dW dE(W ) dE(Wc) d 2E(Wc) = + (W −Wc) (A.3) dW dW dW 2 Set W = Wmin. Note that dE(Wmin)/dW = 0, and after rearranging ( ) 2 −1 − d E(Wc) dE(Wc)Wmin = Wc (A.4) dW 2 dW Compare with W (t+ 1) update function, can reach min in one step if ( ) d2 −1 E(Wc) ηopt = dW 2 Fig A.1 plots gradient E as function of W . When E is quadratic, the gradient is simply a straight line with value zero at minimum and ∂E(Wc) at current ∂W weight Wc. ∂ 2E/∂2W is the slope of line, and can be solved in the following way: 112 Figure A.1. Gradient descent for different learning rates. 2 2 ∂E(Wc)/∂W − 0∂ E/∂ W = Wc −Wmin 113 APPENDIX B THE SECOND APPENDIX Bilinear Interpolation Let A represent the canonical adjacency matrix for G1 and B for G2, with A = N × N , B = M × M , and M ≥ N . In other words, we want to scale up A from N to M , which requires constructing interpolated points for every Bij. To find the coordinates for each x and y for a given B(i,j)-th element to interpolate, we calculate: × M − 1 M − 1x = i , y = j × N − 1 N − 1 where the x + 1 and y + 1 components are given by the floor and ceiling as appropriate, yielding four components: {x1, y1, x2, y2}. Note that upon calculating the components, A{x1,y1}, A{x1,y2}, A{x2,y1}, A{x2,y2} are known points, referenced by the original matrix A. The solution to the interpolation problem is f(x, y) ≈ ω0 + ω1 x+ ω2 y + ω3 x · y. Solving the linear system gives thecoefficients: 1 x1 y1 x1y1 ω0 A{x1,y1}1 x1 y2 x1y2       ω   1 A{x1,y2}=  , 1 x2 y1 x2y1ω2 A{x2,y1} 1 x2 y2 x2y2 ω3 A{x2,y2} 114 yielding the result: A{x1,y1} · x2 · y2 A{x1,y2} · x2 · y1ω0 = + (x1 − x2)(y1 − y2) (x1 − x2)(y2 − y1) A{x2,y1} · x1 · y2 A{x2,y2} · x1 · y1+ + , (x1 − x2)(y2 − y1) (x1 − x2)(y1 − y2) A{x1,y1} · y2 A{x1,y2} · y1ω1 = + (x1 − x2)(y2 − y1) (x1 − x2)(y1 − y2) A{x2,y1} · y2 A{x2,y2} · y1+ + , (x1 − x2)(y1 − y2) (x1 − x2)(y2 − y1) A{x ,y } · x2 A{x ,y } · x2 ω2 = 1 1 + 1 2 (x1 − x2)(y2 − y1) (x1 − x2)(y1 − y2) A{x2,y1} · y2 A{x2,y2} · x1+ + , (x1 − x2)(y1 − y2) (x1 − x2)(y2 − y1) A{x1,y1} A{xω = + 1 ,y2} 3 (x1 − x2)(y1 − y2) (x1 − x2)(y2 − y1) A{x + 2 ,y1} A{x2,y+ 2 } . (x1 − x2)(y2 − y1) (x1 − x2)(y1 − y2) This step is carried out for every {i, j}th element of B, where 0 ≤ i, j < M . Efficiency and Goodness Efficiency describes how gainfully employed the GPU floating-point units remained, or FLOPs per second: opfp+opint + opsimd + opconv efficiency = · callsn (B.1) timeexec The goodness metric describes the intensity of the floating-point and memory operation arithmetic intensity: ∑ goodness = opj · callsn (B.2) j∈J 115 REFERENCES CITED [1] Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., & Tallent, N. R. (2010). Hpctoolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22 (6), 685–701. [2] AI and Compute. (2018). (https://openai.com/blog/ai-and-compute/) [3] Alistarh, D., Grubic, D., Li, J., Tomioka, R., & Vojnovic, M. (2017). QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in neural information processing systems (nips) (pp. 1709–1720). [4] Allinea DDT. (2016). (http://www.allinea.com/products/ddt) [5] Ammons, G., Ball, T., & Larus, J. R. (1997). Exploiting hardware performance counters with flow and context sensitive profiling. ACM Sigplan Notices , 32 (5), 85–96. [6] Baboulin, M., Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Langou, J., . . . Tomov, S. (2009). Accelerating scientific computations with mixed precision algorithms. Computer Physics Communications , 180 (12), 2526–2533. [7] Bahar, P., Alkhouli, T., Peter, J.-T., Brix, C. J.-S., & Ney, H. (2017). Empirical investigation of optimization algorithms in neural machine translation. The Prague Bulletin of Mathematical Linguistics , 108 (1), 13–25. [8] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 . [9] Ball, T., & Larus, J. R. (1994). Optimally profiling and tracing programs. ACM Transactions on Programming Languages and Systems (TOPLAS), 16 (4), 1319–1360. [10] Bergstra, J. S., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyper-parameter optimization. In Advances in neural information processing systems (nips) (pp. 2546–2554). [11] Böhm, C., & Jacopini, G. (1966). Flow diagrams, turing machines and languages with only two formation rules. Communications of the ACM , 9 (5), 366–371. 116 [12] Borgelt, C., & Berthold, M. R. (2002). Mining molecular fragments: Finding relevant substructures of molecules. In Data mining, 2002. icdm 2003. proceedings. 2002 ieee international conference on (pp. 51–58). [13] Britz, D., Goldie, A., Luong, T., & Le, Q. (2017). Massive exploration of neural machine translation architectures. arXiv preprint arXiv:1703.03906 . [14] Chaimov, N., Norris, B., & Malony, A. (2014, Dec). Toward multi-target autotuning for accelerators. In Parallel and distributed systems (icpads), 2014 20th ieee international conference on (p. 534-541). doi: 10.1109/PADSW.2014.7097851 [15] Chatelain, Y., Petit, E., de Oliveira Castro, P., Lartigue, G., & Defour, D. (2019). Automatic exploration of reduced floating-point representations in iterative methods. In European conference on parallel processing (pp. 481–494). [16] Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S.-H., & Skadron, K. (2009). Rodinia: A benchmark suite for heterogeneous computing. In Workload characterization, 2009. iiswc 2009. ieee international symposium on (pp. 44–54). [17] CHiLL: A Framework for Composing High-Level Loop Transformations (Tech. Rep.). (2008, June). USC Department of Computer Science. [18] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 . [19] Chong, E. K. P., & Zak, S. H. (2013). An Introduction to Optimization. Wiley. [20] Cloud TPU. (2019). (https://cloud.google.com/tpu/docs/tpus) [21] Collective Knowledge (CK). (n.d.). (http://cknowledge.org) [22] Csardi, G., & Nepusz, T. (n.d.). The iGraph software package for complex network research. [23] CUDA occupancy calculator. (2016). http://developer.download.nvidia.com/compute/cuda/ CUDA Occupancy calculator.xls. [24] Danalis, A., Marin, G., McCurdy, C., Meredith, J. S., Roth, P. C., Spafford, K., . . . Vetter, J. S. (2010). The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd workshop on general-purpose computation on graphics processing units (pp. 63–74). 117 [25] Davies, M., Srinivasa, N., Lin, T.-H., Chinya, G., Cao, Y., Choday, S. H., . . . others (2018). Loihi: A neuromorphic manycore processor with on-chip learning. Ieee Micro, 38 (1), 82–99. [26] Denis, C., Castro, P. D. O., & Petit, E. (2015). Verificarlo: checking floating point accuracy through monte carlo arithmetic. arXiv preprint arXiv:1509.01347 . [27] de Oliveira Castro, P., Akel, C., Petit, E., Popov, M., & Jalby, W. (2015). CERE: LLVM Based Codelet Extractor and REplayer for Piecewise Benchmarking and Optimization. ACM Transactions on Architecture and Code Optimization (TACO), 12 (1), 6. doi: 10.1145/2724717 [28] De Sa, C., Feldman, M., Ré, C., & Olukotun, K. (2017). Understanding and optimizing asynchronous low-precision stochastic gradient descent. In Proceedings of the 44th annual international symposium on computer architecture (pp. 561–574). [29] Diamos, G., Ashbaugh, B., Maiyuran, S., Kerr, A., Wu, H., & Yalamanchili, S. (2011). SIMD re-convergence at thread frontiers. In Proceedings of the 44th annual ieee/acm international symposium on microarchitecture (pp. 477–488). [30] Farooqui, N., Kerr, A., Eisenhauer, G., Schwan, K., & Yalamanchili, S. (2012). Lynx: A dynamic instrumentation system for data-parallel applications on GPGPU architectures. In Performance analysis of systems and software (ispass), 2012 ieee international symposium on (pp. 58–67). [31] Fursin, G. e. (2011). Milepost gcc: Machine learning enabled self-tuning compiler. International journal of parallel programming , 39 (3), 296–327. [32] Gonzales, R. C., & Woods, R. E. (1993). Digital Image Processing. Addison-Wesley. [33] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. (http://www.deeplearningbook.org) [34] Gupta, R., Laguna, I., Ahn, D., Gamblin, T., Bagchi, S., & Lin, F. (2015). STATuner: Efficient Tuning of CUDA Kernels Parameters. In Supercomputing conference (sc 2015) (p. poster). [35] Gupta, S., Agrawal, A., Gopalakrishnan, K., & Narayanan, P. (2015). Deep learning with limited numerical precision. In International conference on machine learning (pp. 1737–1746). 118 [36] Haidar, A., Wu, P., Tomov, S., & Dongarra, J. (2017). Investigating half precision arithmetic to accelerate dense linear system solvers. In Proceedings of the 8th workshop on latest advances in scalable algorithms for large-scale systems (p. 10). [37] Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 . [38] Hartono, A., Norris, B., & Sadayappan, P. (2009). Annotation-based empirical performance tuning using orio. In 2009 ieee international symposium on parallel & distributed processing (pp. 1–11). [39] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 770–778). [40] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9 (8), 1735–1780. [41] Huan, J., Wang, W., & Prins, J. (2003). Efficient mining of frequent subgraphs in the presence of isomorphism. In Data mining, 2003. icdm 2003. third ieee international conference on (pp. 549–552). [42] Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360 . [43] Junczys-Dowmunt, M., & Grundkiewicz, R. (2016). Log-linear combinations of monolingual and bilingual neural machine translation models for automatic post-editing. In Acl (pp. 751–758). Berlin, Germany: Association for Computational Linguistics. Retrieved from http://www.aclweb.org/anthology/W16-2378 [44] Junczys-Dowmunt, M., Grundkiewicz, R., Grundkiewicz, T., Hoang, H., Heafield, K., Neckermann, T., . . . others (2018). Marian: Fast Neural Machine Translation in C++. arXiv preprint arXiv:1804.00344 . [45] Koehn, P. (2017). Neural machine translation. arXiv preprint arXiv:1709.07809 . [46] Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., . . . others (2007). Moses: Open source toolkit for statistical machine translation. In Acl (pp. 177–180). [47] Koutra, D., Vogelstein, J. T., & Faloutsos, C. (n.d.). DeltaCon: A principled massive-graph similarity function.. 119 [48] Krizhevsky, A. (2014). One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 . [49] Lim, R., Carrillo-Cisneros, D., Alkowaileet, W., & Scherson, I. (2014). Computationally efficient multiplexing of events on hardware counters. In Linux symposium. [50] Lim, R., Castro, P. d. O., Coti, C., Jalby, W., & Malony, A. (2021). Reduced Precision Computation for Accurate and Robust Learning Systems. In 5th workshop on naval applications of machine learning (p. poster). [51] Lim, R., Heafield, K., Hoang, H., Briers, M., & Malony, A. (2018). Exploring hyper-parameter optimization for neural machine translation on gpu architectures. arXiv preprint arXiv:1805.02094 . [52] Lim, R., Malony, A., Norris, B., & Chaimov, N. (2015). Identifying optimization opportunities within kernel execution in gpu codes. In European conference on parallel processing (pp. 185–196). [53] Lim, R., Norris, B., & Malony, A. (2016). Tuning heterogeneous computing architectures through integrated performance tools. In Gpu technology conference (p. poster). [54] Lim, R., Norris, B., & Malony, A. (2017). Autotuning gpu kernels via static and predictive analysis. In 2017 46th international conference on parallel processing (icpp) (pp. 523–532). [55] Lim, R., Norris, B., & Malony, A. (2019). A similarity measure for gpu kernel subgraph matching. In Languages and compilers for parallel computing (lcpc) (pp. 37–53). Cham: Springer International Publishing. [56] Lower Numerical Precision Deep Learning Inference and Training . (2018). (https://software.intel.com/content/www/us/en/develop/ articles/lower-numerical-precision-deep-learning -inference-and-training.html) [57] Ma, N., Zhang, X., Zheng, H.-T., & Sun, J. (2018). Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the european conference on computer vision (eccv) (pp. 116–131). [58] Malony, A. D., Biersdorff, S., Shende, S., Jagode, H., Tomov, S., Juckeland, G., . . . Lamb, C. (2011). Parallel performance measurement of heterogeneous parallel systems with GPUs. In 2011 international conference on parallel processing (pp. 176–185). 120 [59] Mametjanov, A., Lowell, D., C.C. Ma, C.-C., & Norris, B. (2012). Autotuning stencil-based computations on GPUs. In Cluster computing (cluster), 2012 ieee international conference on (pp. 266–274). [60] Marin, G., Dongarra, J., & Terpstra, D. (2014). Miami: A framework for application performance diagnosis. In Performance analysis of systems and software (ispass), 2014 ieee international symposium on (pp. 158–168). [61] Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., . . . others (2017). Mixed precision training. arXiv preprint arXiv:1710.03740 . [62] Miettinen, K. (2012). Nonlinear multiobjective optimization (Vol. 12). Springer Science & Business Media. [63] Miller, B. P., Callaghan, M. D., Cargille, J. M., Hollingsworth, J. K., Irvin, R. B., Karavanic, K. L., . . . Newhall, T. (1995). The paradyn parallel performance measurement tool. Computer , 28 (11), 37–46. [64] Mixed Precision Programming. (2016). (https://developer.nvidia.com/blog/ mixed-precision-programming-cuda-8/) [65] Miyashita, D., Lee, E. H., & Murmann, B. (2016). Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025 . [66] Modha, D. S. (2017). Introducing a brain-inspired computer. (https://www.research.ibm.com/articles/ brain-chip.shtml) [67] Monsifrot, A., Bodin, F., & Quiniou, R. (2002). A machine learning approach to automatic production of compiler heuristics. In Artificial intelligence: Methodology, systems, and applications (pp. 41–50). Springer. [68] Neural Processor. (2020). (https://en.wikichip.org/wiki/neural processor) [69] Nukada, A., & Matsuoka, S. (2015). Auto-tuning 3-D FFT library for CUDA GPUs. In Supercomputing conference (sc 2009). [70] NVIDIA. (n.d.). CUDA Toolkit. https://developer.nvidia.com/cuda-toolkit. [71] NVIDIA GeForce GTX 680 Whitepaper (Tech. Rep.). (2012). (http://bit.ly/2jzzX13) 121 [72] Nvidia Visual Profiler. (2016). (https://developer.nvidia.com/nvidia-visual-profiler) [73] Park, J., Naumov, M., Basu, P., Deng, S., Kalaiah, A., Khudia, D., . . . others (2018). Deep learning inference in facebook data centers: Characterization, performance optimizations and hardware implications. arXiv preprint arXiv:1811.09886 . [74] Parker, D. S. (1997). Monte carlo arithmetic: exploiting randomness in floating-point arithmetic. [75] PyTorch. (2019). Automatic Mixed Precision Package. (https://pytorch.org/docs/master/amp.html) [76] Sabne, A., Sakdhnagool, P., & Eigenmann, R. (2016). Formalizing structured control flow graphs. In Languages and compilers for parallel computing (lcpc) (Vol. 10136). Lecture Notes in Computer Science. [77] Sarkar, V. (1989). Determining average program execution times and their variance. In Acm sigplan notices (Vol. 24, pp. 298–312). [78] Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 . [79] Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., & De Freitas, N. (2016). Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE , 104 (1), 148–175. [80] Shende, S. S., & Malony, A. D. (2006). The TAU parallel performance system. International Journal of High Performance Computing Applications , 20 (2), 287–311. [81] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 . [82] Singh, R., Xu, J., & Berger, B. (2007). Pairwise global alignment of protein interaction networks by matching neighborhood topology. In Research in computational molecular biology (pp. 16–31). [83] Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems (pp. 2951–2959). [84] Sreepathi, S., Grodowitz, M., Lim, R., Taffet, P., Roth, P. C., Meredith, J., . . . Vetter, J. (2014). Application characterization using Oxbow toolkit and PADS infrastructure. In Proceedings of the 1st international workshop on hardware-software co-design for high performance computing (pp. 55–63). 122 [85] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15 (1), 1929–1958. [86] Stephenson, M., & Amarasinghe, S. (2005). Predicting unroll factors using supervised classification. In Code generation and optimization, 2005. cgo 2005. international symposium on (pp. 123–134). [87] Stevens, R., Taylor, V., Nichols, J., Maccabe, A. B., Yelick, K., & Brown, D. (2020). AI for Science (Tech. Rep.). Argonne National Lab.(ANL), Argonne, IL (United States). (https://anl.app.box.com/s/ f7m53y8beml6hs270h4yzh9l6cnmukph) [88] Stochastic Gradient Descent. (2021). (https://en.wikipedia.org/wiki/ Stochastic gradient descent) [89] Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104–3112). [90] Sze, V., Chen, Y.-H., Yang, T.-J., & Emer, J. S. (2017). Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE , 105 (12), 2295–2329. [91] Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., & Le, Q. V. (2019). Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 2820–2828). [92] Tensor Core Performance. (2019). (https://developer.download.nvidia.com/video/ gputechconf/gtc/2019/presentation/ s9926-tensor-core-performance-the-ultimate-guide.pdf) [93] Tensor Float 32. (2019a). (https://blogs.nvidia.com/blog/2020/ 05/14/tensorfloat-32-precision-format) [94] Tensor Float 32. (2019b). (https://en.wikipedia.org/wiki/Tensor Processing Unit) [95] TensorFlow. (2021). TensorFlow Mixed Precision. (https://www.tensorflow.org/guide/mixed precision) [96] Tomov, S., Nath, R., Ltaief, H., & Dongarra, J. (2010, January). Dense linear algebra solvers for multicore with GPU accelerators. In International parallel and distributed processing symposium (ipdps 2010). 123 [97] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., . . . Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008). [98] Volkov, V. (2010). Better performance at lower occupancy.. [99] Wang, Y. E., Wei, G.-Y., & Brooks, D. (2019). Benchmarking tpu, gpu, and cpu platforms for deep learning. arXiv preprint arXiv:1907.10701 . [100] WikiChip. (2017). Intel Sky Lake. (https://en.wikichip.org/wiki/ intel/microarchitectures/skylake (server)) [101] WikiChip. (2019). Intel Cascade Lake. (https://en.wikichip.org/ wiki/intel/microarchitectures/cascade lake) [102] Williams, M. H., & Ossher, H. (1978). Conversion of unstructured flow diagrams to structured form. The Computer Journal , 21 (2), 161–167. [103] Wu, H., Diamos, G., Li, S., & Yalamanchili, S. (2011). Characterization and transformation of unstructured control flow in GPU applications. In 1st international workshop on characterizing applications for heterogeneous exascale systems. [104] Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., . . . others (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 . [105] Xu, R., Chandrasekaran, S., Tian, X., & Chapman, B. (2016). An analytical model-based auto-tuning framework for locality-aware loop scheduling. In International conference on high performance computing (pp. 3–20). [106] Yan, X., & Han, J. (2002). gspan: Graph-based substructure pattern mining. In Data mining, 2002. icdm 2003. proceedings. 2002 ieee international conference on (pp. 721–724). [107] Zhang, F., & D’Hollander, E. H. (2004). Using hammock graphs to structure programs. IEEE Transactions on Software Engineering , 30 (4), 231–245. 124