MPI Application Tune-Up Four Steps to Performance

MPI Application Tune-Up Four Steps to Performance Abstract Cluster systems continue to grow in complexity and capability. Getting optimal performance can be challenging. Making sense of the MPI communications, whether determining load balance or finding platform bandwidth limitations in the process/ranks can be daunting and result in different levels of performance. A four step process is outlined using a Poisson solver that is implemented as an MPI application. The tools that will be used for the profiling and analysis will be Intel Trace Analyzer and Collector and Intel VTune Amplifier XE to demonstrate the process. This paper is a first introduction and overview of the methodology with emphasis on the most important features that Intel Trace Analyzer and Collector (1) (2) and VTune Amplifier XE (3) offer for analysis of pure MPI applications on HPC clusters using Poisson solver as an illustrative example. We concentrated on a fixed size workload to perform a strong scaling analysis by varying the number of MPI ranks, each of them mapped to a single physical core. Besides showing a sample analysis, the reader will learn detailed command lines and GUI usage information demonstrating how to use the above mentioned tools effectively in a cluster environment. We have used the Intel Endeavor cluster, housed at Intel s Customer Response Team (CRT) Datacenter in New Mexico. Each node of this cluster comprises 2 Intel 12-core Xeon E5 v2 processors using the Intel microarchitecture code name Ivy Town (4), and connected through a Mellanox* InfiniBand. 1

Methodology Parallel High Performance Computing (HPC) applications often rely on multi-node architectures of modern clusters. Performance tuning of such applications must involve analysis of cross-node application behavior as well as single node performance analysis. Two performance analysis tools, Intel Trace Analyzer and Collector and VTune Amplifier, can provide important insights to help in this analysis. For example, message passing interface (MPI) communication hotspots, synchronization bottlenecks, load balancing and other complex issues can be investigated using Intel Trace Analyzer and Collector. At the same time, VTune Amplifier can be used to understand intra-node performance issues. We will apply both of these tools to the pure MPI version of the Poisson solver. The more complex case of hybrid applications is left for later studies. The methodology presented below represents broad recommendations to combine global performance application metrics such as speedup and parallel efficiency with more detailed measurements such as message passing rates and memory bandwidth. With more detailed analysis beyond simple scaling profiling, the use of software tools such as Intel Trace Analyzer and Collector and VTune Amplifier is necessary. Our goal is not to achieve better performance per se but to show how to use Intel Trace Analyzer and Collector and VTune Amplifier to better understand performance problems on a specific hardware platform as a prerequisite for subsequent tuning. We focus on the cluster level performance optimizations assuming that application already went through single core performance tuning and hence achieved some level of maturity. The performance issues related to the scalability have to be evaluated at level of concurrency the application is used in real-life situation. For example, by performing analysis using realistic level workloads running at scale on the cluster, we investigate application at relevant memory footprint conditions. The methodology consists of 4 phases depicted in Figure 1 (a more detailed version is shown in Appendix C): 2

1.Global Analysis characteristic values for the whole program: Timing, Speedup, Efficiencies + Imbalance diagram Yes Imbalance > Interconnect No 2. Algorithmic Investigations Change algorithm: some workloads provide alternatives. Analyze and change MPI pattern. Analyze load imbalance of the computation. 3. MPI Runtime tuning Setting Intel MPI environment variables. Using a faster comm. network. Applying an optimized rank to node mapping. Yes Runtime reduced? No Runtime reduced? Yes 4. Single node/process tuning Hotspot analysis for MPI routines. Tuning of routines showing a load imbalance. Bandwidth analysis for single node scalability Iterate Figure 1 Flow Chart of the tuning methodology. The central decision is taken based on the relation of Imbalance time vs. Interconnect time. These times can be determined by ITAC (see below). Large imbalance time means that we have much waiting time in MPI routines and we cannot get much faster even with an ideal communication network. Large interconnect time means that any improvement of the network performance speeds up the program significantly. 3

Scaling using different MPI rank distributions on the computational grid The experiments were conducted on Intel Endeavor cluster with 2 sockets per node of Xeon E5 v2 processors (the Intel microarchitecture code name Ivy Town) providing 24 physical cores per node. Intel Hyper-Threading Technology was enabled on the cluster nodes. However, in our case of pure MPI runs, we used no more than 24 ranks per node, mapping one MPI rank to one physical core. We investigate our Poisson solver on a square 3200x3200 computational grid that is large enough to run into bandwidth limitations. The bandwidth limitations can be remedied by undersubscribing computational nodes, tuning hardware and software prefetchers, and by algorithmic changes and other means. The 3200x3200 grid points can be distributed to MPI ranks using a 2D process grid, e.g., in the case of 4 ranks, one can use 2 rows x 2 columns of processes or a 1D distribution with 4 rows x 1 column or 1 row x 4 columns (Figure 2). 2 3 1600x1600 local grid points per MPI rank 0 1 0 1 2 3 3200x800 local grid points per MPI rank Figure 2 Mapping of computation grid on MPI Processes/Grid using 2D (2x2) and 1D (1x4) processes grid. 4

According to the proposed methodology, the first stage is global Analysis and we start it with a scaling investigation by running the application with different numbers of processes (p), recording the timings T[p] and then measuring speedup defined as S[p] = T[1]/T[p]. The complimentary metric, parallel efficiency, defined as E[p] = S[p]/p, will be used later, as well. The speedup curves for the 2D quadratic and 1D process (1D 1xN and 1D Nx1) grids show some differences in scaling (Figure 3). Indeed, the 48x32 = 1536 process grid delivers a speedup of 284, while the 1536x1 process grid gives 288 speedup but the 1x1536 process decomposition only produces a 144x scaling (Please see Appendix A for the cluster configuration that was used in these experiments). Benchmark configuration in Appendix A Figure 3 Speedup for 2D and 1D process grids. Note Y axis logarithmic scale. A single node contains 24 cores (IVT). The Ideal curve is just speedup == # ranks. A small additional dent in the Ideal curve results from the fact that the ranks are not powers of 2. The 2D distribution uses a square distribution NxM with n == m. If this is not possible then the nearest approximation with n>m is chosen: e.g. for 384 ranks NxM == 24x16. From the timing data it is not obvious whether the scaling degradation is due to MPI performance or to single core compute performance. We can use Intel Trace Analyzer and Collector function profile functionality to separate and further analyze impact of different ranks placements on inter-node and intra-node performances (ITAC: chart->function profile). 5

Before we do that we may have a look on the message passing performance due to different process mappings. Message Passing Profile Message Passing Profile is a display of various characteristics of message passing in a sender/receiver matrix that can be obtained through Charts-> Message Profile Chart. Because dealing with 1536 ranks generates a huge matrix we may fuse all ranks for each node: Advanced-> Process Aggregation-> All Nodes. The diagonal now shows the intra-node performance characteristics while the off diagonals show the inter-node statistics. Without Process Aggregation the diagonal will be only filled if we send messages from rank n to the same rank n which is usually not a good idea. Several attributes may be displayed by using Right click->attribute to show. Most interesting is the attribute Average Transfer Rate, displaying the message passing rate including all waiting times. But some other attributes like total volume [in MB] and the number of messages may also be of interest. Figure 4: The Message Passing Profile for 1536 = 48 x 32 ranks process placement on 64 nodes. The rows and columns represent senders and receivers correspondingly with squares also showing a color coded legend for the chosen attribute and representing the messages between different sender/receiver pairs. In this case we fused all ranks inside each node. The attribute is Average Transfer rate [MB/s] in this case. Rates are pretty low and this gives a first indication that we should investigate the message passing in more detail here. 6

Figure 5: The Message Passing Profile for 1536x1 ranks 1D process placement on 64 nodes. Rates are much better compared to the 2D case. Especially, the intra node communication (1D) has about a factor of 1.38 GB/ 30 MB better throughput compared to 2D! By looking at other attributes offered by the charts on Figure 4 and Figure 5 one can realize that we have fewer massages in the 1D case that are larger, which in turn leads to a higher average transfer rate. But in the 1D case we also transfer a larger amount of data. These 2 different MPI ranks placements, 2D (48x32) and 1D (1536x1) with message profiles illustrated on Figure 4 and Figure 5, result in similar performance. While the simple 1D message passing pattern offers not much potential for optimization, we may apply some optimizations to the 2D pattern like reordering the MPI_Isend, MPI_I receive and MPI_Waitall() for several messages. There are other optimization ideas that can be worked out using Intel Trace Analyzer and Collector, and they are left for future work. 7

Application and compute part parallel efficiency. To shed the light on the issues related to the scalability on the cluster level, we first look at MPI communication and compute performance breakdown of total runtime T[p] = T_comp[p] + T_mpi[p] that can be accessed through the Trace Analyzer s Function Profile (Intel Trace Analyzer displays the Function Profile Chart when opening a trace file) (Figure 6). The trace file for Trace Analyzer and Collector can be generated by adding the flag -trace to the Intel MPI mpirun or mpiexec.hydra command. The trace analyzer API was used to time just 100 of 1653 iterations. VT_API is paused time Timing is accumulated over Ranks. Application time is T_comp This column is the average time per process. It can be added by right click and Function Profile Settings Figure 6: Intel Trace Analyzer and Collector Function Profile Chart. This snapshot shows the output for a 768 process run. A breakdown of MPI functions can be seen by right clicking on Group MPI and Ungroup MPI. This will simply reveal that MPI_Allreduce and MPI_Waitall are the main hotspots in the MPI library. Speedup and parallel efficiency can then be calculated and plotted for the compute time of the application separately (Figure 7). First, one can see that MPI Time is insignificant up to 48 8

cores (the equivalent of two nodes). Above 96 ranks (4 nodes), pure computational application performance also yields super linear scaling. However, at the same data point, at around 96 ranks, MPI time becomes the main reason for low efficiency. Benchmark configuration in Appendix A Figure 7: Compute vs. Total parallel efficiency as a function of number of MPI ranks for the 2D distribution. Compute + MPI is the whole application. This Plot is still part of the Global Analysis. It shows that the efficiency is determined by the computation efficiency for small rank count and by the network overhead for more than 2 nodes. MPI hotspot functions were also determined by the Function Profile To investigate why this is the case, we may need to look beyond the flat profile since it is not clear if the poor timings shown in calls to MPI routines are caused by slow network performance or algorithmic inefficiencies causing unnecessary wait time as a first decision branch in the proposed methodology chart (Figure 1) Interconnect time vs. Imbalance time To understand relative impact of MPI application imbalance vs. interconnect (hardware and software stack) on application scalability (see Flow Chart of tuning methodology, Figure 1), we can start by employing the ideal network simulator (invoked through the Advanced- >Idealization menu). This allows us to separate network stack performance impact on total 9

MPI performance from algorithmic inefficiencies like imbalance and dependencies. A simple network model for the transfer time as a function of message volume V is T_trans[V] = L + (1/BW)*V where L is latency, defined as the time needed to transfer a 0 byte message, and bandwidth BW is the transfer rate for asymptotically large messages. The ideal network may be simulated by setting all transfer times to 0. This would mean L = 0 and BW =. The process of ideal trace generation is automated in Intel Trace Analyzer and Collector. It can be invoked through the Advanced->Idealization menu. The analyzer s imbalance diagrams (Advanced -> Application Imbalance Diagram menu) can then be generated using real and idealized traces. The imbalance diagrams are represented as stacked column charts for the different processes distributions (Figure 8). Benchmark configuration in Appendix A Figure 8: Poisson Solver Imbalance diagram for different process (rank) placements in case of 1536 ranks (=64 24-core nodes). Timings looked up automatically by Intel Trace Analyzer and Collector from the original and simulated traces. The 2D distribution and the 1536x1 distribution looks quite the same although the MPI exchange pattern is very different. It was expected that the 1x1536 distribution compute performance is much worse because each row will only contain one or two elements. The traces were collected for a reduced run-time of only 100 iterations. The Y-axis is time in seconds summed over all ranks. We suggest using a minimum number of iterations and ranks that would still capture main performance application features to minimize analysis time. This Graph is not the original imbalance diagram delivered by Intel Trace Analyzer and Collector because we wanted to combine three different experiments in a single plot. 10

The imbalance diagram can be dominated in theory by either transfer times (the algorithm is balanced but we have to improve the network performance by different process placement or new network hardware) or waiting times (the algorithm has to be revisited for better load balancing and removal of dependencies). In case of the Poisson solver, the imbalance diagram (Figure 8) shows that the application suffers predominantly from high transfer times. Therefore, following tuning methodology chart (Figure 1), the first decision after Global Analysis should be to concentrate on the reducing of the transfer times (Interconnect) first by MPI runtime tuning before the investigations related to imbalance. Algorithmic Investigations In the last chapter we determined that investigations related to imbalance are not the most efficient next step for this Poisson solver. However, to illustrate Trace Collector and Analyzer capabilities, we may shortly describe what the root causes of the observed wait times are since each application is unique and imbalance impact can prevail in case of other application. One of the common reasons behind possible imbalance issues is the inability to perfectly map the processor grid onto the computational grid. For example, if we map 1536 processors on a computational grid of 3200x3200 points we have in the 2D case with the 48x32 processor grid a local size of 3200/48 x 3200/32 points which leads to 32x32 processes with 67x100 grid points and 16x32 processes with 66x100 grid points. The difference of less than 2% for the number of grid points may be not observable. In the 1D case with a 1536x1 process grid we would get 128 processors with 3x3200 grid points and 1408 processors with 2x3200 local grid points. The differences in run time are observable by clicking on the trace analyzer s Load Balance Tab, available next to Flat Profile Tab (Figure 9). 11

One additional row of grid points for process 0-127 Figure 9 Intel Trace Analyzer and Collector Load Balance information for each MPI process. Another possible cause of load imbalance might be not because of the algorithmic inefficiencies but due to a phenomenon called OS jitter (OS for Operating System). The OS jitter describes OS events that can slow down compute performance. Some processes may run slower for some iterations and they will cause imbalances and MPI wait time. For applications running on thousands of nodes (common scenario these days), this noise can become crucial because a single event on single process may slow down the whole application. The reduction of this noise by using a specialized minimal OS is a current HPC research topic (5). The other source of MPI waiting time is dependency 1 in the MPI coding techniques used in Poisson solver. A closer look at the MPI hotspots in the idealized trace file reveals that 85% of the wait time is due to MPI_Allreduce. The rest is due to MPI_Waitall. This shows that 1 Dependency means that Part B of an application can only be started when part A has been finished. Each message introduces a dependency because the program on the receiver rank can only proceed after the message from Part A has been send and Part B on the receiver rank has received it. If the receiver has already arrived at Part B while Part A has not sent the message, we see the receive routine inside waiting mode. The time is, however, reported as plain MPI time and only the idealization can tell whether we have waiting time (imbalance time). 12

practically all message passing dependencies have been removed. In a previous version of Poisson the exchange was programmed with blocking MPI_Send/Recv. This version clearly showed substantial wait time in MPI_Recv in the idealized trace file. Changing the algorithm to MPI_Isend/Irecv/Waitall successfully reduced these dependencies. MPI Runtime tuning This is the third phase of the methodology and it is necessary when we observe high transfer times as shown in the imbalance diagram. We often can improve MPI performance without changing the source code. This can be done by using Intel MPI environment variables or by changing the process mapping of ranks to compute nodes. The process to node mapping can be altered by advanced methodologies like machine- or configuration files or by reordering the ranks inside of a communicator. The MPI standard also contains support for Cartesian topologies as it is described in chapter 4 of (6). For the applications with high transfer times, it is also beneficial to use faster communication hardware in contrast to high wait time that cannot be removed even by the ideal network. We may start the tuning by concentrating on global operations. The simple Poisson solver uses only MPI_Allreduce with an MPI_SUM operation for a global sum. Just a single double precision (8Byte) value is summed over all processes. This is necessary for building the residual which is a measure for the difference of the result array computed for two subsequent iterations. The solver iterations stop when the residuum falls below a predefined threshold. It is always good advice to set the environment variable I_MPI_DEBUG to the integer value 5. This will print valuable information about used variables, network fabrics and process placement. Setting I_MPI_DEBUG to 6 will further reveal the default algorithms for collective algorithms. The Intel MPI reference guide reveals that we can select 8 different algorithms for MPI_Allreduce. Some of these algorithms will be not appropriate for single 8-byte values but we can simply test all of these algorithms for our application and find out if a non-default value may provide better performance. The algorithm can easily be changed by setting the environment variable I_MPI_ADJUST_ALLREDUCE to an integer value in the range of 1-8, corresponding to the algorithm found in the Intel MPI reference manual. A comparison of run times for each algorithm is shown on Figure 10. Algorithm 1 and 5 are pretty close but #5 delivers the best performance for 1536 ranks. 13

Benchmark configuration in Appendix A Figure 10 Poisson speedup for the 8 different MPI_Allreduce algorithms. Algorithm 1 and 5 are pretty close but #5 delivers the best performance for 1536 ranks. The default is algorithm #1 Recursive doubling algorithm for 1 48 processors and algorithm #5 Binomial gather + scatter algorithm above. We see that we can slightly optimize the performance for 96-384 processors by choosing algorithm #1. Analyzing Intra-Node performance using Intel VTune Amplifier XE Hotspot Analysis of MPI functions VTune Amplifier was designed for single node analysis including threading. Many performance events can be read from the Performance Monitoring Unit (PMU) for a detailed analysis of Intel processor core and uncore behavior under a specific program. For a complete analysis of parallel programs Intel Trace Analyzer and Collector is not sufficient due to its primarily focus on MPI performance. This becomes even more obvious when we start analyzing hybrid codes that combine parallel MPI processes with threading for a more efficient exploitation of computing resources. We show first how to further analyze MPI hotspots with VTune Amplifier. Then we measure the bandwidth in search of a better understanding of the efficiency curve plotted in Figure 7. For Vtune Amplifier hotspot analysis we may run the amplifier command line interface as the parallel MPI program to distribute with N MPI ranks. The Poisson solver invocation comes as a 14

parameter to the amplifier command line: mpirun n N amplxe-cl result-dir hotspots_n collect hotspots -- poisson.x This command line runs poisson.x on the N ranks and produces for each rank a result directory containing a hotspot analysis for this rank. The result directory for rank # m will be named: hotspots_n.m. The hotspot analysis for a chosen rank is done usually by transferring the result directory, the executable and sources to the workstation for further inspection using VTune Amplifier GUI. After unpacking the results on a workstation, we open VTune Amplifier bottom up tab and select the Call Stack Mode : user functions +1. This action will show MPI functions (prefixed with P for the profile version) in VTune Amplifier GUI (Figure 11). From Intel Trace Analyzer and Collector analysis we know which MPI functions are the hotspots but we don t know which occurrence of MPI_Waitall function actually has the largest contribution to the application runtime. By revealing call stack information, VTune Amplifier can point to specific MPI_Waitall function dominating the runtime. This is useful starting point for implementing code changes to improve application performance. Figure 11 MPI functions in VTune Amplifier GUI. One of the MPI_Waitcall functions dominates with 57.7% of runtime 15

As a result we see that the last MPI_Waitall call stack (Appendix D, poisson.c, line 226, function call #1) dominates with 57.7% of runtime. The second call stack (Appendix D, call #2) got 34.6% and the first call stack (Appendix D, #3) has only 7.7%. The corresponding source can be found in Appendix D. The reason is that the first exchange starts almost at the same time for all processes but generates an imbalance. The second call is slowed down because of this imbalance. This gets worse in the last exchange until all ranks are synchronized again by MPI_Allreduce and a new iteration starts. The corresponding Intel Trace Analyzer and Collector snapshot for a single iteration is depicted on Figure 12: Callstack #3 Callstack #2 Callstack #1 Figure 12 Intel Trace Analyzer and Collector snapshot for a single iteration of Poisson solver. A single iteration can be detected by using more advance user function instrumentation. For this simple implementation we just know that at the end of each iteration there is an MPI_Allreduce.We just have to zoom in after an Allreduce including the following Allreduce. The numbering of call stacks is done by Vtune AmplifierXE. The first call stack is associated with the largest time fraction. Such VTune Amplifier Hotspots analysis of MPI hotspot functions analysis may be conducted for all ranks but usually we can begin with a single rank, at least in the case of homogeneous clusters (clusters consisting of Intel Xeon or Xeon Phi only but not a mixture of both). Since the speedup curve for a single node (Fig. 4) shows saturation on the node (24 cores per node), 16

we may anticipate some bandwidth saturation issues. Fortunately, VTune Amplifier XE provides a bandwidth analysis collection to verify this assumption. Bandwidth Analysis To use VTune Amplifier bandwidth analysis in conjunction with MPI, we can use the same trick as in the previous example for interposing VTune Amplifier with the MPI invocations of poisson.x, but with an added wrinkle to restrict VTune Amplifier to a single rank. Here is an example of a command to start 59 ranks as usual with VTune Amplifier bandwidth analysis data being collected only for the first rank: mpirun -n 1 amplxe-cl -start-paused --result-dir snb-bandwidth_60 --collect snb-bandwidth \ -- poisson.x : -n 59 poisson.x We have used above the snb-bandwidth analysis type, which also incorporates the architecture bandwidth analysis for the microarchitecture code name Ivy Town (as follows from <vtune_installation_dir>/config/analysis_type/snb_bandwidth.cfg) in the current release of VTune Amplifier XE 2013 Update 16. Since bandwidth analysis employs hardware collection sampling, the SEP (sampling) driver must be installed on each node where data will be collected. We also had to disable NMI watchdog to enable collection with hardware counters (Please see Appendix B). Since we are not interested in analyzing the MPI startup or data initialization section of the application we would like to collect VTune Amplifier data only for a specific time period, when the application runs a computational kernel. The command line arguments allow data collection to begin after a specified number of seconds (through that -start-paused command line option). This option was used in conjunction with VTune Amplifier API functions itt_resume() and itt_pause() surrounding computation kernel in the source code. It should be understood that even though we specified only one rank to invoke VTune Amplifier in the command above, that invocation will collect everything running on the node executing that rank. When collecting event-based sampling data such as LLC (Last Level Cache) misses, those events can be linked to the appropriate process under which they occurred. In case of a few MPI ranks on a single node, we will see the event based data (LLC) divided up among these MPI processes in the VTune Amplifier GUI as well as the total number of LLC misses. However, the bandwidth data are reported per memory channels and packages (not per rank/process) and then summed up for the whole node. A summary of the collected bandwidth data comes out via standard output after collection is done and it gets redirected by the LSF scheduler into the job report. Alternatively, one can use the command line tool to generate a summary report to standard output using this command: amplxe-cl -R summary -r <results-directory>. 17

Bandwidth on one node, GB/s Parallel Efficiency The total bandwidth, in GB/S, is reported separately for each package on the node. We subsequently summed up the reported bandwidths for both packages to obtain the total bandwidth for the node. By plotting the result of bandwidth analysis and parallel efficiency at the same time as we scale out (Figure 13), we observe an inverse correlation between them. The bandwidth in our experiments saturates at about 87 GB/s. This is about 88% of the STREAM benchmark (6) bandwidth results, 98GB/sec, performed during the same job on the same node. The STREAM bandwidth result in turn is ~80% of peak theoretical for quad-channel RAM installed on the nodes used for this test. Bandwidth vs. Parallel Efficiency on a first node 100 90 80 70 60 50 40 30 20 10 0 83.548 86.569 80.717 61.836 30.392 15.839 8.908 1 6 12 24 48 72 96 Number of ranks 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Bandwidth, GB/s Parallel Efficiency Benchmark configuration in Appendix A Figure 13 Efficiency vs. Bandwidth, GB/S as function of number of MPI ranks.1-24 ranks are located on a single IVT node. Summary We present a methodology for performing analysis of HPC applications using Intel Trace Analyzer and Collector and VTune Amplifier. This methodology is applied starting from the whole application (or part of program that should be tuned) followed by detailed analysis with focus on the communication patterns and single MPI routines. A key role is played by the Intel Trace Analyzer and Collector s idealizer, which can simulate program execution on an ideal network with infinitely fast communications but the same processor speed. The outcome of this simulation guides us to the next steps: algorithmic investigations or tuning of MPI library. While algorithmic investigations may lead to the code 18

changes, we also can choose to use the Intel MPI environment variables or MPI processes placement on specific physical cores to reduce MPI communications runtime. Intel VTune Amplifier XE can be used for call stack analysis of hotspot MPI functions (MPI_Waitall in the Poisson case). Vtune Amplifier based bandwidth analysis has been shown to be useful in finding performance bottlenecks of Poisson solver application on an HPC cluster. It can clearly explain the reasons behind the scaling saturation on a single node. References 1. Intel Trace Collector. Reference Guide. [Online] http://software.intel.com/sites/products/documentation/hpc/ics/itac/81/itc_reference_guid e/index.htm. 2. Intel Trace Analyzer.Reference Guide. [Online] http://software.intel.com/sites/products/documentation/hpc/ics/itac/81/ita_reference_guid e/index.htm. 3. Intel VTune Amplifier XE 2013. [Online] http://software.intel.com/en-us/intel-vtuneamplifier-xe. 4. Intel Xeon Processor E5-2695 v2 (30M Cache, 2.40 GHz). [Online] http://ark.intel.com/products/75281/intel-xeon-processor-e5-2695-v2-30m-cache-2_40- GHz. 5. OS Jitter Mitigation Techniques. [Online] https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/w51a7ffcf4df d_4b40_9d82_446ebc23c550/page/os%20jitter%20mitigation%20techniques. 6. William Gropp, Ewing L. Lusk, Anthony Skjellum. Using MPI - 2nd Edition: Portable Parallel Programming with the Message Passing Interface. s.l. : The MIT Press, 1999. 7. STREAM benchmark. [Online] https://www.nersc.gov/systems/nersc-8-procurement/trinitynersc-8-rfp/nersc-8-trinity-benchmarks/stream. 8. Intel MPI Library. Reference Manual for Linux* OS. [Online] http://software.intel.com/sites/products/documentation/hpc/ics/impi/41/lin/reference_manu al/index.htm. 9. Intel Xeon Processor E5 v2 and E7 v2. Product Families Uncore Performance Monitoring Reference Manual. [Online] http://www.intel.com/content/dam/www/public/us/en/documents/manuals/xeon-e5-2600- v2-uncore-manual.pdf. 19

Appendix A Benchmark Environment Intel Xeon E5 v2 processors (Ivy Town) with 12 cores. Frequency: 2.7 GHz 2 processors per node (24 cores per node) Mellanox QDR Infiniband Operating system: RedHat EL 6.4 Intel MPI 4.1.3.045 Appendix B Disabling of Non Maskable Interrupt in cluster environment. The Non Maskable Interrupt (NMI) Watchdog has to be disabled for VTune Amplifier to function properly. NMI can be used in Linux kernel to periodically detect if a CPU is locked. However NMI Watchdog needs to use a hardware performance counter, so other performance tools including VTune Amplifier can t use PMU event-based sampling data collection. To permanently disable the nmi_watchdog interrupt: 1. Under the root account, edit /boot/grub/grub.conf by adding the nmi_watchdog=0 parameter to the kernel line so that it looks like: /boot/vmlinuz-2.6.32-131.0.15.el6.x86_64 ro root=/dev/sda8 panic=60 nmi_watchdog=0 2. 2. Reboot the system. 3. After rebooting, enter the following command to verify whether nmi_watchdog is disabled: grep NMI /proc/interrupts. If you see zeroes, nmi_watchdog is successfully disabled. To temporarily disable the nmi_watchdog interrupt, enter: echo 0 > /proc/sys/kernel/nmi_watchdog On Endeavor cluster, disabling of NMI interrupt is implemented through setting a run-time variable NMI_WATCHDOG=OFF. This runtime variable defines a new behavior in the modified LSF job manager where VTune Amplifier is treated as a resource enabled by a user request in the cluster manager prologue. In the LSF prologue, this resource gets cleaned up once the job is done. 20

Appendix C Detailed Methodology The methodology consists of 4 main phases: 1. Global Analysis of the whole application that gives first indications of performance issues that can be further subdivided into: a. Run time and scaling analysis b. Message Passing performance analysis on an inter/intra node level, including finding of MPI hotspots c. Network Idealization that yields an imbalance diagram, providing guidance on how to proceed, either to phase 2 below if significant wait time is found, or in case of high transfer times we may skip phase 2 and proceed directly to phase 3. 2. Algorithmic investigation: source code changes to implement better message passing practices or improve the load balance of the application by: a. Fixing imbalances in communication patterns of MPI and non-mpi routines. For example, slow sequential I/O often causes imbalances. b. Removing unnecessary synchronization. For example, message passing patterns using blocking send and receive may cause a send/receive order that increases wait times. This may be resolved by using non-blocking MPI_Isend/MPI_Irecv pairs. 3. MPI run-time tuning: Intel MPI can be tuned without changing the source code using: a. Environment variables for tuning of collective operations, e.g., I_MPI_ADJUST_ALLREDUCE b. Environment variables for changing the message passing characteristics, e.g., I_MPI_DAPL_DIRECT_COPY_THRESHOLD c. It is also possible to change the MPI process/rank to node mapping for a better inter/intra node communication balance 4. Single process/node tuning: is necessary for serial performance optimizations. Furthermore, single node tuning is important for improving overall application scalability and reducing load imbalance. a. We suggest conducting a hotspot analysis for each rank or critical ranks identified in phase 1 and 2. The call stack information for a specific MPI routine may be also helpful in refining of the analysis in 1 (b). b. Bandwidth analysis on the node is important for an understanding of deficiencies in cluster level scaling. This technique will be used in the paper to explain the dive in the parallel efficiency curve of our Poisson solver. After each tuning step the analysis can be repeated by starting with phase 1. At least steps 1 (a-c) should be conducted again to get new advice for the next tuning actions. 21

Appendix D Compute Part Source Code The following source code shows the iteration loop. The enumeration of exchange functions calls (from #1 through #3) corresponds to the MPI functions hotspot analysis weight (the hottest function is marked as #1 below and through the text surrounding Figure 11) Iteration index it. ITAC API cutting iteration 100 to 200 Copy array for later residuum calc. CALL #3, Exchange routine at line #205 contains MPI_Waitall Update of red points CALL #2, Exchange routine at line #216 Update black points CALL # 1, Exchange routine at line #226 Residuum calc. contains MPI_Allreduce 22

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice. Optimization Notice: Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 2015, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Cilk Plus, and Intel VTune are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 23