PERFORMANCE ANALYSIS AND MODELING OF PARALLEL APPLICATIONS IN THE CONTEXT OF ARCHITECTURAL ROOFLINES by NASHID SHAILA A THESIS Presented to the Department of Computer and Information Science and the Graduate School of the University of Oregon in partial fulfillment of the requirements for the degree of Master of Science June 2016 THESIS APPROVAL PAGE Student: Nashid Shaila Title: Performance Analysis and Modeling of Parallel Applications in the Context of Architectural Rooflines This thesis has been accepted and approved in partial fulfillment of the requirements for the Master of Science degree in the Department of Computer and Information Science by: Boyana Norris Chairperson Stephen Fickas Member and Scott L. Pratt Dean of the Graduate School Original approval signatures are on file with the University of Oregon Graduate School. Degree awarded June 2016 ii c© 2016 Nashid Shaila iii THESIS ABSTRACT Nashid Shaila Master of Science Department of Computer and Information Science June 2016 Title: Performance Analysis and Modeling of Parallel Applications in the Context of Architectural Rooflines Understanding the performance of applications on modern multi- and manycore platforms is a difficult task and involves complex measurement, analysis, and modeling. The Roofline model is used to assess an application’s performance on a given architecture. Not much work has been done with the Roofline model using real measurements. Because it can be a very useful tool for understanding application performance on a given architecture, in this thesis we demonstrate the use of architectural roofline data with measured data for analyzing the performance of different benchmarks. We first explain how to use different toolkits to measure the performance of a program. Next, these data are used to generate the roofline plots, based on which we can decide how can we make the application more efficient and remove bottlenecks. Our results show that this can be a powerful tool for analyzing performance of applications over different architectures and different code versions. iv CURRICULUM VITAE NAME OF AUTHOR: Nashid Shaila GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: University of Oregon, Eugene, OR Bangladesh University of Engineering and Technology, Dhaka, Bangladesh DEGREES AWARDED: Master of Science, Computer and Information Science, 2016, University of Oregon Bachelor of Science, Computer Science and Engineering, 2011, Bangladesh University of Engineering and Technology AREAS OF SPECIAL INTEREST: High Performance Computing Performance Optimization PROFESSIONAL EXPERIENCE: Graduate Research Fellow, University of Oregon, Eugene, 2014 Software Engineer, Dohatec New Media, Bangladesh, 2012 Software Developer, Pubali Bank Limited, Bangladesh, 2011-2012 v ACKNOWLEDGEMENTS I’d like to thank my supervisor, Professor Boyana Norris. Her expertise on the subject matter was valuable during the early phase of this thesis. Her assistance in the preparation of this manuscript was valuable. I’ll be grateful to her for helping me with useful insights and guiding me to the right direction. vi TABLE OF CONTENTS Chapter Page I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 II. BACKGROUND: DEFINITIONS, TOOLKITS, AND TECHNIQUES . . . . . 4 Roofline Performance Model . . . . . . . . . . . . . . . . . . . . . . 4 Performance Counters . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Performance Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 8 MPI/OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Tuning and Analysis Utilities: TAU . . . . . . . . . . . . . . . . . . 11 PerfExplorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Performance Application Programming Interface: PAPI . . . . . . . 14 VTune . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 III. RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures . . . . . . . . . 16 Applying the Roofline Model . . . . . . . . . . . . . . . . . . . . . . 17 PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Online Performance Analysis by Statistical Sampling of Microprocessor Performance Counters . . . . . . . . . . . . . . . . . 19 MuMMI: Multiple Metrics Modeling Infrastructure for Exploring Performance and Power Modeling . . . . . . . . . . . . . . . . . . . . 20 vii Chapter Page Prophesy: An Infrastructure for Performance Analysis and Modeling of Parallel and Grid Applications . . . . . . . . . . . . . . . . . . . . 21 Design and Implementation of Prophesy Automatic Instrumentation and Data Entry System . . . . . . . . . . . . . . . . . . . . . . . . . 21 Performance Analysis Using the MIPS R10000 Performance Counters 22 Modeling Performance of Parallel Programs . . . . . . . . . . . . . . 22 Performance Measurement and Modeling of Component Applications in a High Performance Computing Environment . . . . 23 IV. METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Environment and Specs: Edison . . . . . . . . . . . . . . . . . . . . 24 Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Measuring Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Measuring Memory Traffic . . . . . . . . . . . . . . . . . . . . . . . 26 Steps of the Experiments . . . . . . . . . . . . . . . . . . . . . . . . 27 V. RESULTS AND ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Benchmarks Used in the Experiments . . . . . . . . . . . . . . . . . 35 Analysis of the Results . . . . . . . . . . . . . . . . . . . . . . . . . 36 A Study of Load Balancing in the Multithreaded Versions . . . . . . 37 A Study of Total Execution Time . . . . . . . . . . . . . . . . . . . 38 A Study of L3 misses . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Application Performance in the Context of the Roofline Model . . . 41 VI. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 viii LIST OF FIGURES Figure Page 1. Runtime vs arithmetic intensity. . . . . . . . . . . . . . . . . . . . . . . . 5 2. Instruction-level parallelism . . . . . . . . . . . . . . . . . . . . . . . . . 7 3. Partial output of papi avail (on Edison). . . . . . . . . . . . . . . . . . . . 29 4. Output of papi event chooser (when metrics are not compatible). . . . . . 30 5. Output of papi event chooser (when metrics are compatible). . . . . . . . 31 6. Partial output of pprof (metric is TIME ). . . . . . . . . . . . . . . . . . . 32 7. VTune flags inside a program. . . . . . . . . . . . . . . . . . . . . . . . . 34 8. Load balance among different DGEMM threads on Edison. . . . . . . . . 36 9. Perfectly balanced load among different Picsar threads. . . . . . . . . . . 37 10. Load imbalance among different Stream threads. . . . . . . . . . . . . . . 38 11. Total execution time of Picsar. . . . . . . . . . . . . . . . . . . . . . . . . 39 12. Total execution time of DGEMM. . . . . . . . . . . . . . . . . . . . . . . 39 13. Total execution time of Stream. . . . . . . . . . . . . . . . . . . . . . . . . 40 14. Cache misses vs threads of execution in the dominant loop particle push (Picsar). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 15. Cache misses vs MPI /thread combination in the application (Stream). . . 41 16. Cache misses vs MPI /thread combination in the application (DGEMM ). . 42 17. SDE data for a single run of DGEMM. . . . . . . . . . . . . . . . . . . . 42 18. Arithmetic intensity of three versions of Picsar on Edison. . . . . . . . . . 44 19. Arithmetic intensity of three versions of DGEMM kernel on Edison. . . . 45 ix LIST OF TABLES Table Page 1. Counters for measuring floating-point operations and instructions on Ivy Bridge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2. Counters for measuring memory operations on Ivy Bridge. . . . . . . . . . . 27 3. List of the counters measured during the experiments on Ivy Bridge. . . . . 28 x CHAPTER I INTRODUCTION High Performance Computing (HPC ) systems are well known for their complexity, as the architecture makes it very challenging to achieve consistent performance on different platforms. For this reason, theoretical performance of these applications often significantly exceeds the actual performance. There will always be a gap between theoretical performance and measurement, as there is no standard way of evaluating the performance on manycore and multicore platforms. It makes performance optimization more difficult as well. This is especially true for mathematical or numerical functions because their performance can be hard to understand and estimate on cache-based architectures even for sequential runs. Hence, performance models are used to help scientists understand and improve performance. Measurement tools and performance models that rely on hardware counters enable us to achieve better understanding of application performance at the microprocessor level. The goal of a model should be to indicate any bottlenecks present in the application and point out the things that can be altered to achieve better performance on a target architecture. Tools like Intel’s VTune and TAU can be used to measure the performance counters and bytes of data transferred to and from the memory. The counters measure events such as different cache misses, TLB misses, floating-point instructions, total instruction, and time. But these are all data based, meaning they will often generate the reports and for analysis there are no well known visualization tool. As we know, visualization can play a very important part in finding new insights from a simple 1 text based data. Thus, if a model can be used to visualize all the data gathered from VTune and TAU, that would make the analysis easier. One such model is Roofline Performance Model. This model is represented with a 2D-graph with arithmetic intensity (FLOPS/Byte) on the x-axis and performance (expressed as number of floating-point operations per second) on the y-axis. Thus it defines a performance envelope that helps reveal new information. The memory bandwidth creates an additional upper bound for computations with low intensity as the plot clearly distinguishes memory- and compute-bound computations. It also shows the behavior of intensity across sizes. But the greatest downside is that getting reliable data for this model is incredibly difficult. The papers that have used the model to date have done so through primarily back-of-the-envelope manual calculations rather than measurements Lo et al. [2014]. Here, in this paper, we use real application data, gathered through TAU and VTune to generate the plots. Our goal is to analyze the performance of an HPC application in the context of the Roofline model. We used in the experiment three different benchmarks to generate the data used for this roofline based analysis. The first benchmark is HPC Interpolation/Particle Pusher/Maxwell solver. The second one is the Stream benchmark. It is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels. The third one is DGEMM, which demonstrates the processor’s and coprocessor’s ability to do a multiplication of two matrices using double precision calculations. The main challenge is to identify the functions (hotspots) inside the code that could limit the performances of the code (bottlenecks). We evaluate these three benchmarks on a single HPC platform for uniformity. The platform we use is Edison, 2 which is a Cray XC30 supercomputer with Intel “Ivy Bridge” processor on the compute nodes. The contribution of this thesis is twofold. First we devise a strategy to generate the roofline plots with real data. These data are generated using real applications, on real platforms and measured carefully with TAU and VTune. After a lot of trial and error we devised a technique to get the data reliably using the correct set of hardware counters in a multi-threaded environment. The second contribution is the performance analysis of numerical kernels such as Picsar, and DGEMM with the roofline model. For the first time, we generate roofline plots for these benchmarks in a multithreaded environment. The plots prove that execution time and other metrics, such as cache utilization is sensitive to the parallel configuration of the problem. For example, dramatic differences in total time is observed for varying the MPI task/OpenMP thread combinations. Also, the Roofline Toolkit visualization enabled us to quickly evaluate differences between multiple versions of the same code components (or the entire application) in the context of architectural roofline. In the second chapter, we provide background on the roofline model and arithmetic intensity along with brief description of the toolkits. Chapter III describes the related work. In Chapter IV we introduce the measurement methodology. We explain in detail how the performance data is collected using the hardware counters and profiling tools. Finally, in Chapter V we show a set of experiments on various benchmarks and how it can be used to provide useful information in analyzing and tuning applications. After a discussion of results we conclude. 3 CHAPTER II BACKGROUND: DEFINITIONS, TOOLKITS, AND TECHNIQUES In this chapter, we overview the concepts and tools on which this thesis relies. Roofline Performance Model The Roofline model is a visually intuitive performance model used to bound the performance of various numerical methods and operations running on multicore, many-core, or accelerator processor architectures Williams et al. [2009]. The model measures the performance of an application by combining locality, bandwidth, and different parallelization paradigms instead of using percent-of-peak estimates. This model can also be used for determining the inherent implementation and performance limitations. Arithmetic Intensity The Roofline’s core parameter is Arithmetic Intensity. It is defined by the ratio of total floating-point operations to total bytes transferred (FLOPS/byte). A BLAS-1 vector-vector increment (x[i] += y[i]), where x and y are arrays with N -elements, would have a very low arithmetic intensity of 0.0417 (N FLOPS/24N byte) and would be independent of the vector size. Conversely, FFTs perform 5∗N ∗ logN FLOPS for a N -point double complex transform. On a write allocate cache architecture, the transform would move at least 48N bytes Lab [2008]. 4 Roofline Model The basic Roofline model is used to bound floating-point performance as a function of machine peak performance, machine peak bandwidth, and arithmetic intensity. It can then be visualized by plotting the performance bound (GFLOP/s) as a function of arithmetic intensity. The resultant curve is the performance envelope under which the kernel or application performance exists. Runtime vs. Arithmetic Intensity The Roofline expresses the relationship between run time and arithmetic intensity. The example in Figure 1, shows that the runtime is independent of the degree of the polynomial until the machine balance is achieved. After that, runtime increases linearly with the degree of the polynomial Lab [2008]. FIGURE 1. Runtime vs arithmetic intensity. 5 Effects of Cache Behavior on Arithmetic Intensity The Roofline model requires an estimate of total data movement between CPU and main memory. Conflict misses and capacity of the cache increases data movement and as a result arithmetic intensity is reduced. Also, if the number of cache write- allocations is large, it can result in huge data movement. For example, the vector initialization operation x[i] = 0.0 allocates one write and write back per cache line. Here, write allocation is huge as every element of the cache line gets overwritten. But, it’s very difficult to quantify how much compulsory data movement actually occurred during an execution Lab [2008]. Instruction-Level Parallelism and Performance Deeply pipelined processor architectures can increase frequency and peak performance as well as they increase the latency of different instructions. To achieve peak performance the programmer has to make sure that independent instructions are issued in sequence. It’s called instruction-level parallelism (ILP). Absence of this can decreases performance on compute-intensive kernels. On the other hand, it does not effect memory-intensive operations. The example in Figure 2, demonstrates good ILP by constructing partial sums and summing them at the end of the loop Lab [2008]. Data-Level Parallelism and Performance Data-level parallelism such as vectorization (or SIMDization) is used widely to maximizing performance and energy efficiency Lab [2008]. However, obtaining peak performance depends greatly on the compiler/programmer’s ability to insert these instructions in the right places. In cost of high arithmetic intensity, absence of 6 FIGURE 2. Instruction-level parallelism SIMDization significantly affects performance. For example, if a high arithmetic intensity application such as matrix multiplication is done without vectorization, it can take a long time to run even for a moderately sized matrices. In the presence of vectorization, computation is fast as data is localized and number of misses becomes significantly lower. However, for low arithmetic intensities, the impact is negligible. Performance Counters In computing, hardware performance counters (or hardware counters) are a set of special purpose registers built into modern microprocessors to store the counts of hardware activities. Advanced users often rely on those counters to conduct low-level performance analysis or tuning of an application. Many different sets of performance counters are available for different CPUs. They might have the same or different names across these platforms. In the same processor family, different models can differ considerably in the specific performance counters available. In general, these counters measure similar types of things, such 7 as registering the absolute number of cache misses, number of instructions issued, number of floating-point instructions executed, number of vector instructions, etc. Nearly every processor commonly used these days is a cache-based machine. Caches offer high-speed access of instructions and data compared to the main memory of the system. They work on the principle of spatial and temporal locality. It is designed in a way that it can take advantage of the applications’ tendency to frequently reuse blocks of data, which is denoted by the term temporal locality. Also applications tend to access data items near those already been used, which is denoted by spatial locality. An application’s chance of achieving high performance is greater if it follows the above patterns on a cache-based processor. If the application is not performing well, it is the developer’s job to find out why the processor is stalling instead of doing meaningful work. This is where performance counters can be useful. Most of the time, an application is devised explicitly to run on specific computer hardware. The numbers generated by the performance counters are used to measure the exact performance of the application and point out the cases where it might gain additional performance. Performance Modeling Performance modeling is a structured approach to understanding the performance of an application. It typically begins during the early phases of application design and continues throughout its lifecycle. In earlier days, performance of an application was generally ignored until there was a problem. There are several problems with this casual approach: 1. In the design phase, performance problems are frequently encountered. 2. Issues faced in design phase cannot always be fixed by tuning or efficient coding. 8 3. Fixing design issues later in the development cycle is not always possible. It is usually inefficient and can be very expensive. The main aim of creating a performance model is to eliminate all those problems. When creating a performance model, performance objectives, such as response time, throughput, and resource utilization (CPU, memory, disk I/O, and network I/O) are set. Different application scenarios are identified as well. Performance scenarios are broken down into steps. Also performance budgets are assigned. This budget defines the resources and constraints across the performance objectives. The most significant benefits of performance modeling are: 1. Performance of the application becomes an important part of the design process. 2. By building and analyzing models, it becomes easier to evaluate the tradeoffs before the actual solution is built. 3. The design decisions that are influenced by performance can be identified. If these decisions are unidentified, it can lead to additional maintenance efforts. 4. Surprises in terms of performance can be avoided when the application is released into production. 5. It helps to see quickly what is important. That translates to where to instrument, what to test for, and how to know whether the application is on or off track for meeting the performance goal. 9 MPI/OpenMP MPI MPI stands for the M essage Passing I nterface. MPI addresses primarily the message-passing parallel programming model, in which data is moved from the address space of one process to that of another process through cooperative operations on each process Forum [2015]. This standardized API (application program interface) is mainly used in the fields of parallel and distributed computing. It is an specification that defines the syntax of a library routine, not the implementation. MPI operations are expressed as functions, subroutines, or methods, according to the appropriate language bindings that for C and Fortran, are part of the MPI standard Forum [2015]. There are several stable and efficient implementations of MPI, many of which are open-source. MPI-1.0 (released in 1994) is the first installment. MPI- 3.1 is the latest stable version. The main advantage of using MPI is portability and ease of use. Furthermore it enhances scalability. The main goals of MPI include ensuring efficient communication, providing portable implementation, convenient C and Fortran binding for the interfaces, being language independent, allowing thread safety, and building reliable communication interfaces. OpenMP OpenMP stands for Open M ulti-Processing. At its most elemental level, OpenMP is a set of compiler directives and callable runtime library routines that extend Fortran (and separately, C and C++) to express shared-memory parallelism Dagum and Menon [1998]. The base language is unspecified and vendors can implement it in any language. It works on most platforms, processor architectures, and operating systems, 10 including Solaris, AIX, HP-UX, LINUX, OS X, and Windows Dagum and Menon [1998]. It consists of a set of compiler directives, library routines, and environment variables that influences the run-time behavior of an application. OpenMP is a flexible standard and can easily be implemented in different platforms. It has four parts: control structure, data environment, synchronization, and runtime library. It is frequently used in conjunction with MPI to exploit both shared- and distributed- memory parallelism. Tuning and Analysis Utilities: TAU TAU is a performance analysis framework developed at the Oregon Performance Research Lab. It consists of a suite of static and dynamic tools that provide graphical user interaction and interoperation to form an integrated analysis environment for parallel Fortran, C++, C, Java, and Python applications of Oregon Performance Research Lab. The two major features of the TAU framework are described below. TAU Portable Profiling Package The model that TAU uses to profile parallel, multi-threaded programs maintain performance data for each thread, context, and node in use by an application. The profiling instrumentation needed to implement the model captures data for functions, methods, basic blocks, and statement execution at these levels of Oregon Performance Research Lab. In the TAU profiling instrumentation, for C /C++ for example, all the C++ features are supported. It is available through an API at the application level. This API also provides selection of profiling groups for organizing and controlling instrumentation. 11 From the collected profile data, the profile analyzer of TAU can generate a range of performance information. For an application, it can show the inclusive and exclusive time spent in each function with resolution of a nanosecond. The breakup time for each installation can be shown for templated entities. Other measures include how many times each function was called, how many profiled functions each function invoked, and what the mean inclusive time per call was. Time information can also be displayed relative to nodes, contexts, and threads. Instead of time, hardware performance data can be shown. Also, user-level profiling is possible of Oregon Performance Research Lab. TAU’s profile visualization tool, paraprof, provides graphical displays of all the performance analysis results, in aggregate and per node/context/thread form of Oregon Performance Research Lab. This graphical interface is very useful in detecting the sources of performance bottlenecks in an application. TAU can also produce event traces. The Vampir trace visualization tool is used to display the traces. TAU Code Analysis Package The TAU static analysis tools are based on PDT (Program Database T oolkit) that produces an intermediate language (IL) representation. With various parsing tools such as GNU gfortran and EDG TAU supports sophisticated views of program structure, incorporating the latest C++ language features such as templates, namespaces, and exceptions. Currently, the code analysis systems have been used to analyze C, C++, and Fortran source to automatically generate TAU profiling instrumentation of Oregon Performance Research Lab. 12 PerfExplorer PerfExplorer is a framework for parallel performance data mining and knowledge discovery. The framework enables the development and integration of data mining operations that will be applied to large-scale parallel performance profiles Oregon Performance Research Lab [2005]. The goal of this framework is to use data mining techniques to analyze parallel performance data. PerfExplorer supports different data mining approaches, such as clustering, association, regression, correlation, and summarization. Organizing data points into logical groups (or clusters) is defined as cluster analysis. Association is the technique of looking for relations in these clusters. The method of finding dependent/correlated and independent variables in the performance data is called regression. And finally, the process of finding out the similarities and dissimilarities between clusters is called summarization. In addition, comparative analysis can be done by using PerfExplorer. The types of charts available include time steps per second, relative efficiency and speedup of the entire application, relative efficiency and speedup of one event, relative efficiency and speedup for all events, relative efficiency and speedup for all phases and runtime breakdown of the application by event or by phase Oregon Performance Research Lab [2005]. Moreover, events can be grouped together and the percentage of total runtime of the group can be displayed using another chart. All these analyses can be done within different parallel performance profiles and across different phases of execution. 13 Performance Application Programming Interface: PAPI The PAPI project specifies a standard Application Programming I nterface (API ) for accessing hardware performance counters available on most modern microprocessors ICL. The measurement of these counters provides important insights of the correlation between the structure of the source code and the efficiency of mapping this code to the underlying architecture.. These correlations can be used in various cases, such as hand tuning of the code, compiler optimization, benchmarking, monitoring, debugging, and most importantly performance modeling. Understanding the mapping of the object code with the help of PAPI to the underlying architecture reduces the commonly occurring bottlenecks in high performance computing. Description PAPI provides two interfaces to the underlying counter hardware: a simple, high level interface for the acquisition of simple measurements and a fully programmable low level interface directed towards users with more sophisticated needs. The low level interface deals with hardware events in groups that are called EventSets. It gives a glimpse of how the counters are used in a system. The high level interface, on the other hand, can start, stop, and read specific events, one at a time. PAPI can be divided into two layers of software. The upper layer contains the API and all machine independent functions. The lower layer translates this upper layer to machine dependent functions. These functions access a subtree that may contain assembly functions, operating system, or a kernel extension. Depending on availability, PAPI chooses among most efficient of the above three options. The 14 functionality of the upper level layer depends on the subtree immensely. TAU relies on PAPI for collecting performance measurements. VTune VTune Intel, is an integrated tool that automatically profiles the execution of an application on an Intel platform using performance counters. It generates a report detailing the breakdown of execution cycles. It is widely used for code profiling on Intel architectures. Examples include stack sampling, thread profiling, hardware event sampling etc. The report contains measures such as time spent in each subroutine that can be further break down to the instruction level. So, if there are any stalls in the pipeline during the execution, the time taken reflects that. VTune also facilitates thread performance analysis. 15 CHAPTER III RELATED WORK The Roofline Paper Williams et al. [2009] first introduced the the roofline plots with real data. In this chapter, we put together some papers that are closely related to this thesis and discuss about them. Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis In this paper, the authors describe a prototype architecture characterization engine for the Roofline Toolkit that quantifies the bandwidth and compute characteristics of multicore, manycore, and accelerated systems Lo et al. [2014]. The tool was used to benchmark four leading HPC systems: Edison, Mira, Babbage, and Titan. The paper describes the ability of each architecture to reach its peak bandwidth and the sensitivity to changes in arithmetic intensity or parallelism. They also propose a new benchmark to measure the performance of CUDA codes that is superior in performance compared to Zero Copy alternatives. Last, they evaluate three other HPC benchmarks on Mira by plotting their performance on an Roofline model to get a clear view of the application’s performance. Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures This paper proposes an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for floating point computations Williams et al. [2009]. The model is simple and 16 visual, and can be used to understand which systems would be a good match to important kernels, or to see how to change kernel code or hardware to achieve better performance. For floating-point kernels that do not fit completely in caches, this paper shows how operational intensity is an important parameter for both the kernels and the multicore computers. The Roofline model offers insights into the difficulty of achieving the peak performance of a computer by making it obvious when a computer is imbalanced. Since nodes have different processing speeds, more jobs should be assigned to nodes with higher processing speed. If this is not present in a system, there will be an imbalance. Finding an optimal load distribution that uses dynamic load balancing is a challenging task. This paper investigates the existence of a synergistic relationship between performance counters and the Roofline model. The requirements for automatic creation of a Roofline model could guide the designer as to which metrics should be collected when faced with literally hundreds of candidates but a limited hardware budget. Applying the Roofline Model The Roofline model makes precise notions of memory and compute-bound applications and, can provide an insightful visualization of bottlenecks. To date the model has been used almost exclusively with back-of-the-envelope calculations and not with measured data. In this paper the authors show how to produce roofline plots with measured data on recent generations of Intel platforms Ofenbeck et al. [2014]. They also show how to accurately measure the necessary quantities for a given program using performance counters, including threaded and vectorized code, and for warm and cold cache scenarios. Later on, they explain in detail the approach to measure 17 and validate the results. They generate a set of roofline plots with measured data for some common benchmarks on different platforms. The goal of the paper is twofold. First, it shows that the roofline model can be used with measurements rather than back-of-the-envelope calculations, but it proved to be very difficult because of the pitfalls arising from lower level system details. The second goal was to show that with measurements, roofline plots can be a valuable tool in performance analysis. This paper focuses on floating-point operations but the original model can easily be instantiated for integer computations. The work described in this thesis aims to address some of these difficulties. PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications The authors of this paper have developed PerfExpert, a tool that combines a simple user interface with a sophisticated analysis engine to detect probable core, socket, and node-level performance bottlenecks in each important procedure and loop of an application Burtscher et al. [2010]. As HPC applications take only a small fraction of peak performance to operate, the available performance evaluation tools need a lot of effort to learn this patterns. Performance optimization can be further complicated by the ongoing migration to multi-core and multi-socket compute nodes. As most of the available performance evaluation tools are not very accurate application writers don’t use them. PerfExpert provides an alternate performance assessment that is concise. It also suggests steps to improve performance that can guide application developers. These steps include compiler switches and optimization strategies with code examples. Many experiments have been done using PerfExpert and in all cases, it has correctly 18 identified the critical code sections and provided accurate assessments of their performance. Online Performance Analysis by Statistical Sampling of Microprocessor Performance Counters Hardware performance counters (HPC s) are used to analyze performance and identify the causes behind performance bottlenecks. But these are difficult to use as microprocessors almost always don’t provide enough counters to monitor and collect data of these types of events that are important to understand the performance of a specific application. In this paper Azimi et al. [2005], describe two techniques that help overcome these difficulties, allowing HPC s to be used in dynamic realtime optimizers. First, statistical sampling is used to dynamically multiplex HPC s and make a larger set of logical HPC s available. Using real programs, they show experimentally that it is possible through this sampling to obtain counts of hardware events that are statistically similar (within 15%) to complete non-sampled counts, thus allowing us to provide a much larger set of logical HPC s. Second, they observe that stall cycles are a primary source of inefficiencies, and hence they should be major targets for software optimization. A simple model is built based on these observations. This model associates every stall cycle to a processor to find out what is causing the stall in realtime. This model is generated using the data collected form the HPC s multiplexing facility that monitors a large number of hardware components simultaneously. This analysis approach achieves an accurate model in an out-of-order superscalar microprocessor. 19 The results obtained in the study show that effective analysis of online performance of application and system code running at full speed. The stall analysis shows where performance is being lost on a given processor. MuMMI: Multiple Metrics Modeling Infrastructure for Exploring Performance and Power Modeling In this paper, the authors present the MuMMI framework, which consists of an Instrumentor, Databases, and Analyzer. Its an infrastructure that facilitates systematic measurement, modeling, and prediction of performance, power consumption and performance-power tradeoffs for parallel systems Wu et al. [2013]. The MuMMI system is based on three existing frameworks: Prophesy for performance modeling and prediction of parallel applications, PAPI for hardware performance counter monitoring, and PowerPack for power measurement and profiling. The MuMMI Instrumentor provides low overhead automatic performance and power data collection and storage. The database part extends the database of Prophesy for storing power and energy consumption. Also, hardware performance counter data is stored for different frequency settings. The analyzer part extends the data analysis component of Prophesy to support power consumption and hardware performance counters, and it supports performance and power modeling, performance-power trade-off and optimizations, and Web-based automated modeling system. The MuMMI online system currently supports four modeling techniques: curve fitting, parameterization, kernel coupling, and performance-counters-based performance and power models for scientific applications online. It can also help 20 performance and power data measurement, storage, modeling and prediction of scientific applications on XSEDE website resources. Prophesy: An Infrastructure for Performance Analysis and Modeling of Parallel and Grid Applications This paper presents the Prophesy system and the use of its coupling parameter (i.e., a metric that attempts to quantify the interaction between kernels that compose an application) to develop application models Taylor et al. [2003]. In grid applications, like other applications, performance is an important issue. Understanding how the system features impact the performance of the applications is essential for faster execution. This knowledge is gathered by analysis and development of performance models. This paper discusses how the modeling techniques can be used in grid application analysis. Design and Implementation of Prophesy Automatic Instrumentation and Data Entry System In this paper, the authors present the Prophesy Automatic I nstrumentation and Data Entry (PAIDE ) system and describe its design framework and implementation supporting C, Fortran77 and Fortran90 programs on diverse systems Taylor and Stevens [2001]. The benchmarks used in this study are NAS C onjugate Gradient (CG) and I nteger Sort (IS ). They are used to analyze the PAIDE’s instrumentation overheads for two granularities of instrumentation when problem size and number of processors increase. The experimental results on the SGI Origin2000 show that the instrumentation overhead for CG benchmarks is less than 3.4%, while that for parallel 21 IS benchmarks is less than 1%. The result also shows that the PAIDE system does not affect the performance data at all. It nicely summarizes performance information of repeated events as sequential events. The PAIDE system is appropriate for parallel and distributed applications, which take a long time to execute. Performance Analysis Using the MIPS R10000 Performance Counters In this paper, the authors describe support in the MIPS R10000 for non- intrusively monitoring a variety of processor events – support that is particularly useful for characterizing the dynamic behavior of multi-level memory hierarchies, hardware-based cache coherence, and speculative execution Zagha et al. [1996]. Performance tuning of supercomputer application needs proper analysis of the interaction of that application and the underlying architecture. This paper first explains the performance data collection approach. An integrated set of hardware mechanisms, operating system abstractions, and performance tools are used for that. Next, some scientific applications are used as examples to illustrate how the counters and profiling information can help the developers to analyze and tune the application. Modeling Performance of Parallel Programs The authors of this paper discuss performance modeling, an approach to understanding the performance of parallel systems. They present a survey of current approaches to modeling (both analytical modeling based on system parameters, and structural modeling based on the structure of the program) and propose a combination of these two approaches as a promising direction for new work Meira [1995]. If the actual performance of parallel programs is compared to the highest performance given by the underlying hardware, it is often disappointing. 22 Understanding the source of this difference is necessary for improving the performance of the application. The combination proposed in this paper is evaluated and later improvements are proposed that combines measurement and modeling. Performance Measurement and Modeling of Component Applications in a High Performance Computing Environment Parallel component environments place various constraints on performance measurement. Some applications require reusable performance interfaces for component interface monitoring. Moreover, observing component operation without access to the source code must be possible. This paper describes a non-intrusive, coarse-grained performance measurement framework that allows the user to gather performance data through the use of proxies that conform to these constraints Trebon [2005]. Performance models for individual components and also for the entire application can be generated from this data. They also describe a framework for validation that is used to validate the behavior and measurement methodologies of the framework by using known performance models of simple component based applications. Finally, modeling and measurement of a real application are studied and presented in the paper. 23 CHAPTER IV METHODOLOGY In this section, we discuss the approach taken for measuring the code-specific quantities and constructing the roofline plots. First, we describe the environment we used for experiments. Next, we explain the general way of using the performance counters to obtain the measurements and derived metrics. Finally, we illustrate how the experiments are performed, data is exported into the database, and plots are generated. Environment and Specs: Edison The N ational Energy Research Scientific C omputing C enter (NERSC ) is the primary scientific computing facility that we are using for our experiments. It operates under the Office of Science in the U.S. Department of Energy. NERSC is a division of the Lawrence Berkeley National Laboratory, located in Berkeley, CA. It is also one of the three divisions in the Berkeley Lab Computing Sciences area. We used the machine Edison that is named after scientist Thomas Alva Edison. It is a Cray XC30 machine, with a peak performance of 2.57 Peta-FLOPS/sec. It has 133, 824 compute cores, 357 terabytes of memory, and 7.56 petabytes of disk. Edison has a total of 24 cores on each compute node. It has two sockets on each compute node. All the processors have Intel H yper-T hreading (HT ) enabled, which means user can run with 48 logical cores per node. At runtime the user can decide to run with 24 cores per node (which is the default setting) or 48 logical cores per node. Edison uses Cray Aries interconnects for inter-node communication. Aries provides a higher bandwidth, lower latency interconnect than Gemini, and should 24 exhibit reduced network congestion. Edison’s Aries network is connected through a Dragonfly topology. Edison has 12 login nodes that are external to the main compute portion of the system. Since the login nodes are external, users can login, access file systems and submit jobs when the main compute portion of the system is down for maintenance. The login nodes on Edison have 512 GB of memory. Measurement Edison has two sockets per node and each socket is populated with a 12-core Intel “Ivy Bridge” processor. There are 24 cores per node and 2 threads per core. So we experimented with up to 48 threads in various MPI process/threads combinations. We made sure to include 32 (e.g., 4 MPI tasks with 8 threads each) because that’s what the architecture rooflines were generated with. The architectural roofline is the best performing MPI process/OpenMP thread configuration for a specific kernel. In general, the number of MPI processes and threads can be mixed and matched depending one the architecture. We did not use hyperthreading as the floating point measurements couldn’t be done otherwise. Counters for Measuring Floating-Point Operations Performance counters exist on most modern microprocessors. These count hardware performance events such as cache misses, floating-point operations, etc. These counters are only incremented when either arithmetic or comparison instructions are issued. We used the Performance Data S tandard and API (PAPI ) for measuring the counters. It provides a uniform interface to access these 25 performance counters. Table 1 lists the counters used for measuring floating-point operations. TABLE 1. Counters for measuring floating-point operations and instructions on Ivy Bridge. Event Mask Mnemonic Events PAPI FP OPS Floating point operations PAPI FP INS Floating point instructions Measuring Runtime For measuring the performance of the application we focused only on the total runtime. To measure it we used the TIME variable of UNIX gettimeofday() function. The regular Intel timestamp counter is not used in our experiments. Because the regular timestamp is a system-wide counter, it measures any effects of the operating system. Moreover, in the parallel scenario, the reference cycles will report how many cycles are spent per CPU, but reconstructing their sum back into wall-clock time is not trivial due to partially overlapping regions. Hence, we used the TIME variable. Measuring Memory Traffic Various memory uses and leaks are measured with performance counters, derived metrics, and VTune during the experiments. Counters for Memory Measurement TAU can evaluate memory utilization options that examine how much heap memory is currently used and how well caches are being utilized . We used the memory measurement counters to collect these measurements. Memory instructions and shuffles are also recorded by the counters while the specific application executes 26 on the processor. Getting this measurement right is the most challenging task, as the set of counters differ between different microarchitectures. The counters we used for the Ivy Bridge microarchitecture are listed in Table 2. TABLE 2. Counters for measuring memory operations on Ivy Bridge. Event Mask Mnemonic Events PAPI L3 TCM Level 3 cache misses PAPI L2 TCM Level 2 cache misses PAPI L3 TCA Level 3 total cache accesses PAPI L2 TCA Level 2 total cache accesses Caveats L3 cache misses are not a good way of estimating intensity as the caches are all write-allocate. Moreover, other events may cause data transfers like the prefetcher, page table loading, and streaming memory operations. An alternative approach is to measure the raw traffic on the memory controller. We used VTune measurements for getting the number of bytes transferred to and from memory. Figure 3 shows the list of counters we have measured on Edison. Steps of the Experiments Here we list all the necessary steps to generate and store results in the database. Sampling with TAU For this thesis, we used TAU Event-Based Sampling (EBS ) to integrate sample information into TAU profile measurements at runtime. To enable sampling we used “-ebs” flag with the tau exec command. This takes an un-instrumented binary 27 TABLE 3. List of the counters measured during the experiments on Ivy Bridge. Event Mask Mnemonic Events PAPI TLB IM Instruction translation lookaside buffer misses PAPI DP OPS Floating point operations; optimized to count scaled double precision vector operations PAPI TOT INS Instructions completed PAPI TOT CYC Total cycles PAPI SR INS Store instructions PAPI BR INS Branch instructions PAPI LD INS Load instructions PAPI REF CYC Reference clock cycles as input and generates a flat profile with the sample data. While sampling the application is periodically interrupted and the running state of the program is examined. The samples are aggregated and a histogram of where the program spends its time is built. We used sampling because our codes have many lightweight functions and if we introduce instrumentation, too much overhead will be added. We sampled the codes by running it with the tau exec command. The tau exec command then tracks the memory events and creates separate profiles with the data. Sampling MPI Applications The tau exec command allows us to sample an MPI application at runtime. Because the instrumentation is not done at runtime, the linking is all done dynamically. To use the MPI option, we simply place the tau exec command before the application’s executable when executing mpirun, and add the mpi flag: $ > mpirun −n 4 tau exec −T mpi, papi −ebs ./a.out 28 TAU Configuration We configure TAU to use PAPI for accessing the hardware performance counters. To do so, we downloaded and installed PAPI on Edison. Then, we build TAU using the −papi = configuration option specifying the location of PAPI. TAU can be configured to record more than one hardware performance counter, along with time for each timer and routine. For enabling this feature, we added “–MULTIPLECOUNTERS” option while building TAU. PAPI Events Before measuring the events, we have to find out which PAPI events Edison actually supports. The command papi avail is used for getting the whole list of hardware counters supported on Edison, as shown in Figure 3. FIGURE 3. Partial output of papi avail (on Edison). 29 Next, to find out the compatibility between each metric that PAPI is going to to profile, we used the papi event chooser command that produces the matching events that can be measured together as a list. For example, in the following example, the event chooser tells us that there is an incompatibility in the choice of these four metrics: PAPI LD INS, PAPI SR INS, PAPI STL ICY, PAPI VEC SP. So they cannot be profiled together as shown in Figure 4. After we remove PAPI VEC SP and event chooser verifies that PAPI LD INS and PAPI SR INS, and PAPI STL ICY are compatible metrics and generates a list of compatible events that can be measured with this combination. This scenario is depicted by Figure 5. FIGURE 4. Output of papi event chooser (when metrics are not compatible). Environment Variables Setup After the PAPI events are chosen, we have to set the environment variable such as TAU METRICS, PATH, TAU DIR, and TAU MAKEFILE. To set these, we have to export the variable values and list the PAPI metrics we would like to use in a 30 FIGURE 5. Output of papi event chooser (when metrics are compatible). comma-separated list (along with TIME counter). Also, for some MPI applications, we need to set the OMP NUM THREADS environment variable to the number of threads to be used. Generating Event Traces/Profiles Profiling shows the distribution of execution time across routines. It also shows the code locations associated with specific measurements. Tracing the execution of a parallel program shows when and where an event occurred, in terms of the process that executed it and the location in the source code. For getting these profiles we first compile the source code of the application with appropriate flags and compiler optimization (described above). Next, we get the profiles by running the following command. $ > srun −n #procs tau exec −T ompt,mpi, papi −ompt −ebs ./picsar 31 This generates the profile.∗.∗ files inside the MULTI TIME and MULTI PAPI * directories. The * in the profile file denotes the thread numbers and * in the MULTI PAPI * denotes the name of the PAPI metric. All the MULTI PAPI * directories are generated inside the main TAU directory. We wrote a batch script for setting all the environment variables and submitting the batch jobs to Edison’s scheduler using sbatch command. This is more efficient than setting variables and executing applications by hand. Getting Text Summary of the Profiles For quick representation of the summary of TAU performance, we use the pprof command. It reads and prints a summary of the TAU data on terminal window for the profiles present inside the current directory. For performance data with multiple metrics (MULTI PAPI * and MULTI TIME profiles), we have to move into one of the directories to get the information about that metric. This is shown in Figure 6. FIGURE 6. Partial output of pprof (metric is TIME ). 32 Uploading Profiles on TAUdb TAUdb (TAU Database), previously known as PerfDMF (Performance Data M anagement F ramework), is a an API/Toolkit and a DBMS to manage and analyze performance data. The API is available in Java and C. We upload the profiles with a specific MPI rank, thread number of an application in the TAUdb by using the taudb loadtrial command. The command takes several parameters to determine the application and experiment name. $ taudb loadtrial −a −x −n We can use the multiple metrics profiles for the upload or we can pack it in ∗.ppk format for the upload. For viewing the data uploaded in the database, the paraprof command is used. Generating VTune Traces For the VTune measurement, it is necessary to instrument the code with the itt resume() and SSC MARK(0x111) flags for starting the VTune/SDE profiling. For stopping the profiling, itt pause() and SSC MARK(0x222) flags are used. It will then get the data for that portion of the code and display the result in a text file. It is very important to use the “-littnotify” flag for compilation of the code, as it makes sure that VTune actually collects the data. This memory data is then parsed into a CSV (C omma Separated V alue) file by another script and imported to the database. 33 FIGURE 7. VTune flags inside a program. Generating Derived Metrics For generating the derived metrics we write a script that takes the ∗.ppk file of a trial and generates different derived metrics. This script uses the perfexplorer command to execute and outputs the data into a text file. Another script takes this file as input and generates CSV files. These derived metrics are then uploaded into the database using a CSV file loader. During the experiments some of the above mentioned steps were executed manually. In future, all the steps will be partially or fully automated. The time limitation restricted our ability to do the automation at this time. However, we plan on starting the automation as soon as possible. 34 CHAPTER V RESULTS AND ANALYSIS In this section, we discuss the results of experiments with three codes. This discussion includes description of the benchmarks, problems faced for getting the correct results, analyzing performance, and studying the results within the roofline framework. Benchmarks Used in the Experiments We ran all our experiments on the Intel -based platform Edison (described in the Methodology section) of NERSC, running the LINUX operating system. For our experiments, we considered the following numerical kernels: 1. Stream is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels. We used the latest Stream OpenMP version for the experiments. The source code has a hardcoded parameter, N , that is the length of the three arrays (A, B, and C) used in the tests. All of them are of type DOUBLE PRECISION. Hence, for 8 byte doubles 3 ∗ 8 ∗N bytes are required. N should be big enough so that the tests touch a large portion of the memory. 2. DGEMM, the Intel MKL routine for multiplying matrices. It is the most widely used matrix multiplication routine of Intel. It calculates the product of DOUBLE PRECISION matrices. This benchmark is designed to measure the sustained, floating-point computational rate of a single node. However, we used 35 the OpenMP version of it to run it with different process/thread combination. It also uses the Intel M ath K ernel Library (MKL) routines. 3. Picsar is a Particle-I n-C ell (PIC ) code where the main operations include particle pushing and field deposition and gathering. It uses Hybrid MPI /OpenMP scheme. The HPC kernel does interpolation, and simulates a particle pusher as well as a Maxwell solver. All of these are implemented in Fortran90 /Python and parallelized with MPI and OpenMP. Parallelization on distributed memory architectures is very efficient in the implementation. The Picsar benchmark is compiled using ftn (Intel Fortran compiler version 15.0.120141023). On the other hand, the DGEMM and Stream benchmarks are compiled with cc (Intel C compiler verison 15.0.120141023) compiler. All of the benchmarks uses flags “-g -dynamic -openmp” with different optimization flags. For VTune and SDE calculations, the “-littnotify” library flag is used during compilation for all the kernels. FIGURE 8. Load balance among different DGEMM threads on Edison. Analysis of the Results On a multicore environment, computation efficiency depends largely on load balancing, parallelization, code optimization, and proper utilization of memory 36 hierarchy. Hence, the data we collected on Edison included CPU, cache, memory measurements for each thread. The next few sections will highlight the finding of the thesis along with pointing out the limitations of analyzing some of the cases properly. A Study of Load Balancing in the Multithreaded Versions First, we will look at how the multithreaded versions of different kernels were balanced among threads. Figure 8 shows the wall-clock time metric for different threads of node 0 for a DGEMM run. The runtime for each of the thread is a little unbalanced. The pthread overhead and OpenMP overhead are the two main overheads found on each thread. FIGURE 9. Perfectly balanced load among different Picsar threads. Figure 9 shows the time metric calculated on different threads on a node for Picsar. From the figure, it is evident that the threads are almost perfectly balanced. All the threads have the same amount of pthread and OpenMP overheads, and the same computation time. Hence, additional load balancing for Picsar is not necessary on this architecture. 37 FIGURE 10. Load imbalance among different Stream threads. Figure 10 shows the wall-clock time metric for Stream benchmark. Looking at figure, its can be easily seen that the threads are not at all balanced. The pthread, and OpenMP overheads, computation time is not at all same for the threads on different nodes. Moreover, if we look closely, it becomes clear that, the first thread of each node is not doing any numeric computation, rather it is acting like a master thread and distributing workload among the other threads. In addition, worker threads are performing different amounts of computation. So, load balancing measure can be taken to remove this imbalance and make the threads more balanced. A Study of Total Execution Time In this section, we will discuss the sensitivity of the execution time to the parallel configuration of the problem. For example, Figure 11 shows sizable differences in total time for different MPI task/OpenMP thread combinations (x-axis) for Picsar. As the number of threads go higher than 8, a significant jump in the runtime of the program because additional threads incur more OpenMP and pthread overheads. 38 FIGURE 11. Total execution time of Picsar. Figure 12 shows the same effect on total time for different MPI task/ OpenMP thread combinations (y-axis), now for DGEMM. Here, the runtime grows significantly as the MPI /thread combination goes higher. The correlation is subtle compared to the Picsar. However, with enough data points it can be demonstrated. FIGURE 12. Total execution time of DGEMM. Finally, for Stream, Figure 13 also shows the characteristics that total runtime goes higher with the different MPI task/ OpenMP thread combinations (y-axis). The runtime of the program grows proportionally with the number of threads per MPI 39 task. This occurs because Stream is limited by the available memory bandwidth, so adding more threads beyond a certain number decreases performance due to contention. FIGURE 13. Total execution time of Stream. A Study of L3 misses We observed during the experiment that cache utilization is sensitive to the parallel configuration of the problem. Figure 14 shows the mean number of L3 cache misses over varying numbers of threads for the same problem size for the dominant loop (particle push of Picsar). From Figure 15, it is clear that cache performance is sensitive to the MPI /thread combination of the problem. It shows the number of L3 cache misses over varying numbers of MPI processes and threads for the same problem size for the whole application (Stream). From Figure 16, it is clear that cache miss is smaller when the MPI process is small and vice versa. This figure shows the number of L3 cache misses over varying numbers of MPI processes and threads for the same problem size for the whole application 40 FIGURE 14. Cache misses vs threads of execution in the dominant loop particle push (Picsar). FIGURE 15. Cache misses vs MPI /thread combination in the application (Stream). (DGEMM ). As the process count goes higher, the number of misses grows as well. Hence, using fewer processes is good if the goal is to minimize the number of cache misses. Application Performance in the Context of the Roofline Model In addition to the hardware counters, another very effective method of evaluating how well a code is performing is by measuring its operational intensity. In this section 41 FIGURE 16. Cache misses vs MPI /thread combination in the application (DGEMM ). FIGURE 17. SDE data for a single run of DGEMM. we will describe how we used the RooflineToolkit to visualize the data we gathered for the kernels and how it can indicate performance of an application on a specific platform. 42 We modified and used the RooflineToolkit visualization front end (developed at the University of Oregon) to enable easy computation and visualization of the widely accepted arithmetic intensity metric. The x and y coordinates of the plots are generated using the SDE and PAPI measurements. Figure 17 shows the SDE data for a single run of DDGEMM. From this text data we generate a CSV file and use that for computing the operational intensity and GFLOPS/sec. For example, if we want to compute the operational intensity, we take the total Bytes form the CSV file and divide it by total FLOPS. It will result in the x coordinate. x = totalFLOPS totalBytes For calculating the GFLOPS/sec, we need to convert the total FLOPS to GIGA FLOPS (by dividing it by 109). Total runtime is fetched form TAU measurements, which is the inclusive runtime of the application. By dividing the GFLOPS and the inclusive runtime, we get the y coordinate. y = totalFLOPS/109 totalruntime This is how we calculate the (x, y) coordinates for each trial of an application and insert them in the Roofline visualization tool to generate the plots. Figure 18 gives a general idea of the tool for the dominant loop in the Picsar kernel. Here, the performance of three versions of the Picsar kernel is shown along with Edison’s roofline plot. The highest attainable performance on this machine is denoted by the roofline envelope. Compared to the highest performance of the system, Picsar performs moderately well. This plot takes into consideration the fact 43 that each node on Edison can take up to 48 threads. So, here is the plot for three different combination of MPI /thread combinations. Among the them, 48/1 attains highest performance (GFLOP/sec). 16/3 comes second and 8/6 comes third. From this, we can conclude that the Roofline visualization makes it easier to choose among different versions of the computation. FIGURE 18. Arithmetic intensity of three versions of Picsar on Edison. Figure 19 shows the same experiment but on DGEMM kernel. The figure shows plot for three different combination of MPI /threads. It is evident from this plot, that the DGEMM performs very well on Edison. 48/1 MPI /thread attains highest performance, which is almost equal to the highest performance possible. Here too, 16/3 comes second and 8/6 comes third. So, from this example, its clear that as the number of MPI tasks increases and threads decreases, performance increases. 44 FIGURE 19. Arithmetic intensity of three versions of DGEMM kernel on Edison. One important thing to note from the plots is that the operational intensity remains unchanged (mostly) across the trials. This is intuitive as all the trials are doing the same numeric calculation. Only their performance changes with the change in MPI /thread combination. The biggest advantage of using the Roofline visualization graph is to validate the result and get a clear idea about how a specific benchmark would perform on a specific architecture. If the points are all over the place in this plot, most likely the kernel data is not captured correctly, which can be the case when using hardware counters. Also, just by looking at the plots, we can say what will be the best MPI /thread combination for a benchmark on a specific machine. 45 At the beginning, our goal was to enable quick evaluation of differences between multiple versions of the same code components (or the entire application) in the context of the architectural roofline, which shows both the upper bound and can also point to specific deficiencies (e.g., insufficient vectorization). The space being sampled include varying numbers of MPI tasks, varying OpenMP threads, different compiler optimization options, and as we create different code versions, they will also form one of the dimensions that must be tested to determine the best level and type of parallelism for a particular problem size (number of nodes, MPI and OpenMP configuration) with the objective of minimizing execution time. The outcome of the experiments proves that our approach is helpful in evaluating the performance of applications on different architecture. 46 CHAPTER VI CONCLUSION Visualizing application performance within the Roofline model can give a useful indication of how an application will run on a specific architecture with a specific MPI process/thread counts. This can be very helpful in predicting the performance of the same application on a different machine of similar specs. The plots also give an idea of which MPI process/thread counts works best for a given architecture. Moreover, calculating the cache misses in complex benchmarks, such as Picsar is almost impossible to do by hand because of the complexity of runtime cache utilization on multi- and manycore architectures. Hence, plotting the non-measured data in these cases can be misleading. There were two main goals of this thesis. The first goal was to generate the roofline plots with measured data. However, it was very difficult to get the data as there were numerous obstacles. The compatibility issue of the toolkits with the compilers and different modules of Edison, default features, compiler optimizations flags, etc., were some of the major hindrances that we have overcome. Moreover, data generated by PAPI was not at all reliable. For example, cache misses measured by PAPI counters were way off the actual values of bytes transferred from and to main memory as measured by SDE. All of the above mentioned obstructions made the seemingly straight forward data generation immensely difficult and often time impossible. After a lot of trial and error we chose a correct set of hardware counters, modules, optimization flags to be used in a multi-threaded environment. The second goal was to analyze the performance of numerical kernels within the roofline plots to determine the efficiency of these kernels on different architecture. 47 We focused on floating-point computations mostly as we considered FLOPS/Bytes as the computational intensity measure. The experiments we did resulted in various performance insights. The roofline plots enabled us to get a quick comparison of performance between multiple versions of the same code components on a specific architecture. Also by analyzing the hardware counter values, we can estimate how an application characterized by a particular computational intensity will run given a specific set of MPI process/thread combination. We generated a number of plots to demonstrate how to use all these tools efficiently to analyze an application’s performance. In our experiments some of the measurement and analysis work was done manually. In future, we hope to automate all the steps fully or partially. 48 REFERENCES CITED Yu Jung Lo, Samuel Williams, Brian Van Straalen, Terry J Ligocki, Matthew J Cordery, Nicholas J Wright, Mary W Hall, and Leonid Oliker. Roofline model toolkit: A practical tool for architectural and program analysis. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, pages 129–148. Springer, 2014. Samuel Williams, Andrew Waterman, and David Patterson. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009. Berkeley Lab. Performance and algorithm research, 2008. URL http://crd.lbl.gov/departments/computer-science/PAR/research/roofline. Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, Version 3.1. High Performance Computing Center Stuttgart (HLRS), USA, 2015. Leonardo Dagum and Ramesh Menon. Openmp: An industry-standard api for shared-memory programming. IEEE Comput. Sci. Eng., 5(1):46–55, January 1998. ISSN 1070-9924. doi: 10.1109/99.660313. URL http://dx.doi.org/10.1109/99.660313. University of Oregon Performance Research Lab. Tau homepage. URL https://www.cs.uoregon.edu/research/tau/home.php. University of Oregon Performance Research Lab. Perfexplorer user’s manual, 2005. URL https://www.cs.uoregon.edu/research/tau/docs/perfexplorer. ICL. Papi website. URL http://icl.cs.utk.edu/papi. Intel. Vtune website. URL https://software.intel.com/en-us/intel-vtune-amplifier-xe. Georg Ofenbeck, Ruedi Steinmann, Victoria Caparros, Daniele G Spampinato, and Markus Puschel. Applying the roofline model. In Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on, pages 76–85. IEEE, 2014. Martin Burtscher, Byoung-Do Kim, Jeff Diamond, John McCalpin, Lars Koesterke, and James Browne. Perfexpert: An easy-to-use performance diagnosis tool for hpc applications. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–11. IEEE Computer Society, 2010. 49 Reza Azimi, Michael Stumm, and Robert W Wisniewski. Online performance analysis by statistical sampling of microprocessor performance counters. In Proceedings of the 19th annual international conference on Supercomputing, pages 101–110. ACM, 2005. Xingfu Wu, Hung-Ching Chang, Shirley Moore, Valerie Taylor, Chun-Yi Su, Dan Terpstra, Charles Lively, Kirk Cameron, and Chee Wai Lee. Mummi: multiple metrics modeling infrastructure for exploring performance and power modeling. In Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery, page 36. ACM, 2013. website. Xsede website. URL https://www.xsede.orgi. Valerie Taylor, Xingfu Wu, and Rick Stevens. Prophesy: an infrastructure for performance analysis and modeling of parallel and grid applications. ACM SIGMETRICS Performance Evaluation Review, 30(4):13–18, 2003. Xingfu Wu Valerie Taylor and Rick Stevens. Design and implementation of prophesy automatic instrumentation and data entry system. In Proc. 13th IASTED Intl Conf. Parallel and Distributed Computing and Systems (PDCS01). Citeseer, 2001. Marco Zagha, Brond Larson, Steve Turner, and Marty Itzkowitz. Performance analysis using the mips r10000 performance counters. In Supercomputing, 1996. Proceedings of the 1996 ACM/IEEE Conference on, pages 16–16. IEEE, 1996. Jr Wagner Meira. Modeling performance of parallel programs. TR859. Computer Science Department, University of Rochester, 1995. Nicholas Dale Trebon. Performance measurement and modeling of component applications in a high performance computing environment. PhD thesis, University of Oregon, 2005. 50