ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs

Transcription

1 ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs ABSTRACT Xu Liu Department of Computer Science College of William and Mary Williamsburg, VA xl10@cs.wm.edu It is difficult to scale parallel programs in a system that employs a large number of cores. To identify scalability bottlenecks, existing tools principally pinpoint poor thread synchronization strategies or unnecessary data communication. Memory subsystem is one of the key contributors to poor parallel scaling in multicore machines. State-of-theart tools, however, either lack sophisticated capabilities or are completely ignorant in pinpointing scalability bottlenecks arising from the memory subsystem. To address this issue, we develop a tool ScaAnalyzer to pinpoint scaling losses due to poor memory access behaviors of parallel programs. ScaAnalyzer collects, attributes, and analyzes memory-related metrics during program execution while incurring very low overhead. ScaAnalyzer provides high-level, detailed guidance to programmers for scalability optimization. We demonstrate the utility of ScaAnalyzer with case studies of three parallel programs. For each benchmark, Sca- Analyzer identifies scalability bottlenecks caused by poor memory access behaviors and provides optimization guidance that yields significant improvement in scalability. Keywords Memory bottlenecks, scalability, parallel profiler. 1. INTRODUCTION The number of hardware threads in emerging multi-core processors is growing dramatically. For example, an IBM POWER7 processor [35] has 32 threads, an IBM Blue Gene Q processor [9] has 64 threads, and an Intel Xeon Phi processor [14] has more than 240 threads. Moreover, multiple processors can be integrated in the same node, forming a nonuniform memory access (NUMA) architecture. For exam- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. SC 15, November 15-20, 2015, Austin, TX, USA Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM /15/11...$ Bo Wu Department of EECS Colorado School of Mines Golden, CO bwu@mines.edu ple, the SGI UV 1000 [33] has 8192 Intel Nehalem hardware threads interconnected via the NUMALink technology [34]. Scalability bottlenecks, however, prevent applications from benefiting from the large number of threads on existing or emerging shared-memory architectures. Given the importance of scalability, it is necessary to identify and eliminate scalability bottlenecks in multithreaded applications. However, it is difficult for programmers to conduct analysis and apply appropriate optimizations. There are three principal challenges. First, applications can be complex. A typical HPC application or an industrial workload usually consists of hundreds of thousands of lines of code, hiding scalability bottlenecks deep in the code. Second, modern parallel architectures have sophisticated micro architectures, integrating many hardware threads and multiple levels of memory. Identifying which hardware component incurs scalability bottlenecks is challenging. Finally, program execution can have complicated behaviors, such as interactions between threads as well as interactions between user and operating system. Hence, understanding abnormal behaviors in a long-running parallel program is challenging. Given these three complexities, people need performance tools to automatically identify the scalability bottlenecks and provide insightful guidance for optimization. State-of-the-art tools [41, 6, 7, 37, 43] identify the poor scalability caused only by thread load imbalance, frequent synchronizations, thread serialization, and excessive data communications. They omit one important cause of poor scalability the memory subsystem. Modern architectures adopt deep memory hierarchies that are shared among multiple cores. Compared to the rapid rise of the number of CPU cores in a machine, the memory bandwidth per core has barely improved, and in fact decreased in some cases. With this trend, CPU cores compete for the shared resources in the memory subsystem, causing severe scaling bottlenecks. To identify such contention, existing approaches [42, 3, 20, 44, 30, 28] apply heavyweight methods to monitor memory access patterns, which are impractical for longrunning parallel programs. Moreover, they do not evaluate the effect of such contention across executions at different scales. To provide programmers insightful guidance for tuning scalability bottlenecks in memory hierarchies, we developed ScaAnalyzer a profiler to pinpoint, quantify, and analyze scalability bottlenecks in parallel programs. We make three contributions in ScaAnalyzer. We employ lightweight profiling techniques in ScaAnalyzer and derive a new metric, memory scaling frac-

2 Figure 1: The scalability of IRSmk on a 48-core AMD machine. The x-axis is the number of cores, while the y-axis is the execution time in seconds. Without scalability optimization, locality optimization via array regrouping does not show any speedup when the program runs on 16 and 32 cores. tion, as well as a new differential analysis to quantify scalability bottlenecks in memory. We develop novel schemes in ScaAnalyzer to provide rich but intuitive guidance for programmers to optimize their code. Such schemes include differential analysis and root-cause analysis. We demonstrate ScaAnalyzer s low runtime overhead, space overhead, and superior scaling by profiling parallel programs running on a large number of cores. ScaAnalyzer is based on HPCToolkit [23], which works on unmodified, fully optimized binary executables compiled by any compiler with any threading model, such as pthread [5] and OpenMP [29]. To evaluate ScaAnalyzer, we studied three well-known multithreaded benchmarks, all of which have been highly optimized by benchmark developers. Surprisingly, we found that all these benchmarks fail to scale on modern multi-core architectures. With the help of ScaAnalyzer, we easily identify the scalability bottlenecks in memory hierarchies and obtain significant scalability improvements with little code modification. The paper is organized as follows. Section 2 gives an example that highlights the importance of scalability issues in memory. Section 3 systematically studies the potential causes of scalability bottlenecks in memory hierarchies. Section 4 introduces the monitoring support for ScaAnalyzer in modern hardware for efficient scalability analysis. Section 5 describes the differential analysis for quantifying scaling losses in memory. Section 6 describes the techniques of root-cause analysis used by ScaAnalyzer to provide insights for code optimization. Section 7 illustrates the implementation of ScaAnalyzer and how it achieves efficient analysis. Section 8 studies three benchmarks to evaluate the capabilities of ScaAnalyzer. Section 9 reviews existing work and distinguishes our approaches. Section 10 gives some conclusions of the paper and previews future work. 2. A MOTIVATING EXAMPLE To highlight the significant impact of scalability bottlenecks in memory hierarchies, we discuss a concrete example. The example shows that without scaling the performance of memory accesses, memory optimizations for data locality may not always improve performance when the code is running at a large scale. Moreover, existing tools, such as ArrayTool [24] and HPCToolkit [23], fail to provide the insights on how these memory locality bottlenecks evolve when using more cores in the system. Our example is a multithreaded benchmark, IRSmk 1, which is one of the Sequoia benchmarks [18] from Lawrence Livermore National Laboratory. We ran IRSmk on an AMD Magy-Cours machine with GHz cores distributed in eight sockets. To achieve load balancing, we set the number cores to be a power of two when running the benchmark. IRSmk was compiled using gcc 4.6 with -O3 option. Figure 1 uses blue circles to show how IRSmk scales from 1 to 32 cores. The code stops scaling when running on more than eight cores. The execution time even increases when running on 16 to 32 cores. We first optimize the data locality of IRSmk without optimizing its memory scalability. Following our earlier studies of this code with ArrayTool [24], we apply array regrouping to improve IRSmk s spatial locality. The basic idea for this optimization is regrouping arrays that are always accessed together for better cache utilization. The green triangles in Figure 1 show the execution time of IRSmk after applying the array regrouping optimization. We can see that the optimized IRSmk has a 3 speedup when running sequentially. However, when the number of cores grows, the speedup shrinks quickly and disappears when running on 16 and 32 cores. The reason is that IRSmk has memory scalability bottlenecks when running on this architecture. Array regrouping, despite improving locality, does not address the scalability bottlenecks. Thus, to fully exploit the data locality optimization with multiple cores in the system, one needs to first fix the scalability problems in IRSmk. We discuss this benchmark in detail in Section 8. In the next section, we review the memory hierarchy of a typical multi-core system and describe how it prevents a program execution from scaling. 3. SCALABILITY ISSUES IN MEMORY Memory hierarchies associated with a typical multi-socket, multi-core system have three layers: (1) a private layer (e.g., L1 and L2 cache) for each core, (2) a shared layer (e.g., L3 cache and local memory) between cores within the same socket, and (3) an across-socket memory layer, including remote memory and interconnects between sockets. Each layer may introduce scalability bottlenecks, which may need different optimizations to address. We elaborate on them individually as follows. Private layers. The aggregate capacity and bandwidth of private memory layers scale with the number of cores. However, due to data sharing between cores, private layers can hurt program scalability. Because each core keeps a copy of shared data in its own private caches, writing to one copy may cause the invalidation of other copies; such cache invalidations are costly. Usually, the more cores used for computation, the greater the impact of invalidations on performance. Specifically, threads may falsely share cache lines by accessing different words in the same cache line. False sharing is a performance bottleneck that hurts program scalability in private memory layers [20]. 1 The out-of-box IRSmk is a sequential program. We parallelize it with OpenMP.

3 Solution: To avoid private memory layers from being a scalability bottleneck, one should eliminate false sharing in caches. For example, one can pad cache lines with zeros to force threads to access different cache lines. Shared layers. Both capacity and bandwidth of shared memory layers do not scale with the number of cores. As a result, threads can contend for resources, e.g., shared cache in the shared layer. For example, cache lines loaded by one thread can be evicted without having been fully utilized by other threads sharing the same cache. Moreover, issuance of excessive outstanding memory requests by threads can saturate the bandwidth, thus, delaying data accesses. The contention becomes more severe when more threads are used, which hurts program scalability. Solution: Several approaches can be used to prevent contention in shared layers. For example, one can reduce contention via efficient cache management, such as cache partitioning [19], bandwidth optimization [24], and contentionaware code transformation [1]. If CPU cores in the system are not fully subscribed, one can spread threads across different sockets to benefit from multiple shared caches with more space and higher bandwidth. Socket (NUMA) layers. Memory across sockets has non-uniform access latency. The socket together with the memory attached to it forms a NUMA domain. A core can access local memory in the same NUMA domain and also remote memory in other NUMA domains via interconnects. Because of limited bandwidth, without careful code design, the interconnects may introduce severe congestion. A common problem occurs when data are allocated in one NUMA domain but threads access this data from multiple NUMA domains. The congestion can further degrade program performance when more NUMA nodes are used for computation. Solution: To alleviate congestion, one can interleave page allocation across all NUMA domains to avoid a single NUMA domain from being a bottleneck [23, 16]. Moreover, tasks can be scheduled so that each task runs in the NUMA domain where the data of its primary working set reside [25]. Besides different layers in memory hierarchies, memory management in the operating system can also incur significant scalability bottlenecks. Section 8 shows that such OS bottlenecks can lead to substantial scaling losses. These are caused by problematic interactions between user-space code and the OS kernel, which are difficult to identify and fix. Given multiple sources of scaling losses in memory, one needs to identify the root causes of scalability bottlenecks in order to apply appropriate optimizations. To the best of our knowledge, ScaAnalyzer is the first lightweight tool that provides such root-cause analysis. In the next section, we introduce the lightweight data collection mechanisms available in modern processors used by ScaAnalzyer. 4. HARDWARE SAMPLING METHODS To guarantee lightweight analysis, ScaAnalyzer leverages hardware sampling mechanisms. Modern processors employ performance monitoring units (PMU) to measure program execution. The PMUs can collect a variety of performance statistics during program execution to help identify performance bottlenecks. In this section, we review the lightweight performance monitoring mechanisms available in PMUs and identify the most efficient support for memory scalability analysis. Section 4.1 reviews event-based sampling, which has been extensively used for performance analysis. Section 4.2 studies instruction sampling which was introduced in recent architecture generations. 4.1 Event-based sampling Event-based sampling (EBS) uses hardware performance counters to trigger an interrupt when a particular count threshold is reached. Microprocessors support several performance counters, which enable multiple events to be monitored in a single execution. However, EBS provides insufficient information to support efficient memory scalability analysis, mainly due to two reasons. First, several factors make precise attribution of EBS difficult, including long pipelines, speculative execution, and the delay between the time an event counter reaches a threshold and the time it triggers an interrupt. As a result, it is difficult to use EBS to monitor individual memory access instructions. Second, EBS only monitors the occurrences of events but does not interpret their effects. Without an accurate, detailed model or simulator, it is hard to leverage EBS to identify and quantify which memory layer causes the most serious scalability bottleneck. Thus, EBS is not an appropriate measurement method for efficient memory scalability analysis. 4.2 Instruction sampling In recent processors, PMUs can monitor instructions rather than events. PMUs periodically select an instruction to monitor when a specific event occurs more times than a pre-defined threshold. Common features of these instruction sampling mechanisms are (1) support for memory access sampling, (2) reporting of memory-related metrics, and (3) support for the identification of a precise instruction pointer (IP) and an effective data address for a sampled access. With these features, instruction sampling enables accurate attribution and provides more information for memory analysis than EBS. There are four instruction sampling mechanisms available in modern processors. AMD Opteron processors (family 10h and successors) support instruction-based sampling (IBS) [8]. IBM POWER5 and successors support sampling marked events [38] 2. Intel processors support precise eventbased sampling (PEBS) [10], starting from the Nehalem micro architecture. The Intel Itanium employs event address registers [11] to sample memory access instructions. However, to effectively support memory scalability analysis, instruction sampling should have two additional features: PMUs should assess the cost of sampled memory accesses in terms of data access latency in CPU cycles. This latency information can be used to identify and quantify scaling bottlenecks in the memory hierarchy. PMUs should record the details about the memory hierarchy response (i.e., data source) of the fetched data touched by sampled accesses. The data source is one layer in the memory hierarchy, such as L1/L2/L3 2 The PMU monitors a memory access instruction after a predefined number of marked events.

4 caches and local/remote main memory. This information helps identify in which memory layers scalability bottlenecks occur. From our investigation, only AMD IBS and Intel PEBS can glean this extra information. We built ScaAnalyzer based on both IBS and PEBS for efficient scalability analysis. 5. DIFFERENTIAL ANALYSIS FOR MEM- ORY SCALING LOSSES Differential analysis [27, 7] is an effective way to identify scalability bottlenecks in parallel programs. Its basic idea is to run a program twice with two configurations to generate two profiles. The difference between these profiles, if not matching an expected performance pattern, shows scalability problems. For example, a program with the same input is run on four and eight cores to test strong scaling. We expect the eight-core execution to be half the execution time of the four-core execution. If the eight-core execution does not achieve the expected performance, we can quantify the scaling losses as the difference between expected and observed performance. Moreover, differential analysis can measure weak scaling, which proportionally increases both program input size and cores for execution. The expectation of weak scaling is to have the same execution time across runs. However, it is not straightforward to apply differential analysis to quantify scaling losses in memory hierarchies, because there are no obvious expectations we can have for execution time. A key contribution of ScaAnalyzer is that it extends differential analysis for scaling loss in memory. We first describe an insight which is the foundation for our differential analysis: With ideal memory scaling in hardware and software, memory accesses in a program should incur the same or less latency when the program uses more cores and the same input. There are two observations to support this insight. From the hardware perspective, the capacity and bandwidth of memory layers stay constant or increase when more cores are involved in computation. For example, when using more cores in one NUMA socket, the capacity and bandwidth of shared caches do not change, but the aggregate capacity and bandwidth for private caches increase. Ideal memory scaling also assumes that shared caches have unlimited capacity and bandwidth. From the software perspective, because the input is unchanged, the amount of data processed by the program is constant when running at different scales. Thus, software does not trigger more memory accesses while hardware provides the same or more cache space and bandwidth. Ideally, memory access latency should remain constant or become less with more cores. This insight provides an expectation for differential analysis of strong scaling in memory hierarchies. ScaAnalyzer defines a scaling fraction metric, f, as the memory access latency increment when the program runs on more cores, as shown in Equation 1. f = l large l small (1) In this equation, l large is the aggregate latency of memory accesses triggered by a program running on a large number of cores, while l small is the aggregate latency of memory accesses triggered by the program running on a small number of cores. With instruction sampling, we cannot obtain latency for all memory accesses. Therefore, ScaAnalyzer approximates l large and l small using average latency from the sampled memory accesses. Because both IBS and PEBS can randomly sample a large number of memory accesses, the approximation can achieve high accuracy [23]. The value of f indicates the scaling losses in memory. If f = 1, the program has perfect scalability in the memory hierarchy. If f < 1, the program has super-linear scalability. If f > 1, the program has sub-linear scalability. If f 1, the program has significant scalability bottlenecks in memory hierarchies. It is worth noting that the metric f helps ScaAnalyzer identify root causes of performance degradation for serial code sections in a parallel program due to the side-effects of parallel execution. Without any side-effect, f is 1 for serial computation because serial computation has constant memory behavior across executions on different numbers of cores. However, if f is exceedingly high, this serial section is affected by the parallel execution. For instance, parallel execution allocates data in a NUMA socket rather than the one that the serial section runs in, thus incurring remote memory accesses. In the next section, we discuss how ScaAnalyzer leverages the latency and scaling factor metrics from differential analysis to identify the root causes of memory scalability bottlenecks. 6. ROOT-CAUSE ANALYSIS OF SCALING LOSSES IN MEMORY Understanding the root causes of scaling losses in memory is critically important for code optimization. ScaAnalyzer performs root-cause analysis of memory scalability bottlenecks from two perspectives. From the hardware s perspective, ScaAnalyzer identifies the scaling losses in the memory hierarchies, which guides the selection of optimization methods, as described in Section 3. From the program s perspective, ScaAnalyzer pinpoints problematic data objects and their accesses in the source code to guide code optimization. 6.1 Locating scaling losses in memory layers Given the latency and data source information captured by the instruction sampling PMU, ScaAnalyzer derives three data source latency metrics, l p, l s, and l numa, to represent the aggregate sampled latency caused by fetching data from private memory layers, shared memory layers, and NUMA memory layers, respectively. Using the instruction sampling mechanisms described in Section 4.2, every time ScaAnalyzer captures a memory access sample, it extracts the latency and accumulates it to the corresponding latency metric according to the data source information that is recorded along with the sample. Equation 2 shows the decomposition of the total observed latency l from samples. l = l p + l s + l numa (2) With the latency decomposition, ScaAnalyzer determines which memory layer dominates data access latency. Furthermore, by adapting Equation 1, ScaAnalyzer can compute special scaling factors f p, f s, and f numa, using data source latency metrics, l p, l s, and l numa, respectively. We

5 can identify the causes of scaling losses in memory hierarchies using these derived metrics. If a specific memory layer dominates the memory access latency and has a high scaling factor, we can treat this memory layer as the cause of poor scalability and apply the appropriate optimization methods as described in Section Pinpointing scaling losses in code To efficiently guide code optimization, ScaAnalyzer associates performance metrics with data objects and their accesses in the code. Section describes the metric attribution mechanism, which provides necessary information for memory optimization. Section describes the technique of interpreting the derived metrics to identify data objects or memory accesses that can benefit most from scalability optimization Sample attribution Instruction sampling records the precise instruction pointer (IP) of each sampled memory access and the effective memory address it touches. ScaAnalyzer leverages this information to attribute samples using a mechanism similar to that used in HPCToolkit [23]. ScaAnalyzer uses precise IPs to associate samples with memory access instructions, which can be further mapped to the source code using debugging information generated by the compiler. In addition, ScaAnalyzer determines the call path of sampled memory accesses with a lightweight on-the-fly binary analysis technique [40]. Identifying problematic memory accesses within their calling context is a key component in ScaAnalyzer for the support of insightful scalability analysis in memory hierarchies. At the same time, ScaAnalyzer also leverages effective addresses to associate samples with data objects. This attribution mechanism can help understand which data objects do not scale in the memory hierarchy. There are three kinds of data objects which can be allocated during program execution: Static data. Data objects allocated in the.bss section are static data. Each static variable has a named entry in the symbol table that identifies the memory range for the variable with an offset from the beginning of the load module. Heap data. Variables in the heap are allocated dynamically during execution by one of the malloc family functions (malloc, calloc, realloc). Stack data. Data objects can be allocated on the execution stack. Automatic variables, local arrays, and objects allocated with the alloc function belong to this category. Table 1 shows how ScaAnalyzer monitors these three types of data objects. ScaAnalyzer identifies each of the monitored data objects using a unique ID and records the associated memory range allocated. It reads the symbol table in each load module and extracts the names and memory ranges for static data objects. It overloads the allocation functions and triggers synchronous samples at each allocation. Sca- Analyzer determines the allocation call path as the ID of the head data object and records its memory range allocated. For stack data objects, ScaAnalyzer tracks the stack pointers of function frames in the call stack. It uses the memory range of stack frames to bound all data allocated data types ID memory range static name reading from symbol table heap allocation path overloading allocation functions stack line mapping capturing stack frames Table 1: Methods of monitoring different types of data objects. l f optimization decisions high high scalability optimization yields high benefit high low memory bottlenecks not related to scalability low high scalability optimization yields little benefit low low good memory performance Table 2: Optimization decisions according to different l and f pairs. on the stack. ScaAnalyzer uses the mapping of memory accesses to lines in source code to identify the data objects. If the source code line contains more than one data object, one can modify the source code to break the line and put each data object in one line. Because ScaAnalyzer captures the precise instruction pointer of the sampled memory access, one easily knows which data object the sample is attributed to by examining the source code that the precise IP is mapped to. For all static, heap, and stack data objects, ScaAnalyzer enters their memory ranges and ID into a map for sample attribution. This map is implemented with a splay tree [36] that uses memory intervals of data objects as keys and data IDs as values. The splay tree can accelerate the lookup time with its self-adjusting upon insertion, deletion, and lookup operations. With the effective address captured in the sample, ScaAnalyzer checks in the map which data object s memory range interval includes this effective address. If the data object is found in the map, ScaAnalyzer attributes this sample together with its calling context associated to that data object. By associating samples to both data objects and their accesses with full calling contexts, ScaAnalyzer provides deep insights for developers to understand poor memory behaviors Metric interpretation ScaAnalyzer attributes metrics, such as latency l and scaling factor f, to data objects and their accesses together with samples. These metrics, besides identifying the root causes of scaling losses in memory hierarchies, can also pinpoint problematic data objects and code regions, e.g., memory access instructions, loops, and functions. ScaAnalzyer provides an effective metric interpretation strategy to report bottlenecks that lead to significant scalability improvement after optimization. ScaAnalyzer interprets the metrics in three steps. First, ScaAnalyzer identifies whether a program is memory-bound or not. Only memory-bound programs receive our attention for memory scalability analysis and optimization. To obtain this insight, ScaAnalyzer uses a metric named latency per instruction based on the raw metrics collected by instruction sampling. This metric is described in our previous work [21]. It is worth noting that this metric can be attributed to any contexts to assess how memory bound they are, including statements, loops, procedures, and the whole program. Next, ScaAnalyzer uses f and l to determine which data objects have the severest scalability bottleneck in memory,

6 executable binaries data collector execution profile 1 execution profile N data analyzer data visualizer database Figure 2: The workflow of ScaAnalyzer. There are three components: data collector, analyzer, and visualizer. ScaAnalyzer works on the unmodified binary and automatically produces the intuitive analysis results for the scalability bottlenecks. as shown in Table 2. Targeting only the data objects with high f and high l, our optimization can improve the whole program s scalability significantly. From our experiments, a data object warrants investigation if its associated f is higher than 1.5 and l accounts for more than 30% of total latency. Finally, ScaAnalyzer computes data source latency metrics for data objects, such as l p, l s, and l numa. Applying the technique described in Section 6.1, ScaAnalyzer reports the response of scaling issues for problematic data objects in different memory layers. With these three steps, ScaAnalyzer performs root-cause analysis for scalability bottlenecks in program code and correlates it with the root-cause analysis in memory layers (Section 6.1). Thus, ScaAnalyzer provides insightful guidance for code optimization: where in the code to apply which optimization strategy for memory scalability improvement. 7. SCAANALYZER IMPLEMENTATION Figure 2 shows the workflow of ScaAnalyzer. ScaAnalyzer extends HPCToolkit [23] and consists of three components: data collector, analyzer, and visualizer. The data collector takes the executable binary and runs it multiple times with the same input but at different scales of cores. As a default, we use a compact thread-to-core placement to evaluate the scalability: we first use cores in one socket; when no more idle cores are available in this socket, we spread threads to another socket. Then the data analyzer investigates all profiles, derives scaling metrics, and associates analysis results with program information, such as data or function symbols and source code lines. Typically, users first analyze the two profiles with the most significant scaling losses, like the execution profiles with 8 and 16 cores in Figure 1. All profile data are maintained in a database. Finally, the data visualizer presents the analysis data from the database in a graphical interface, with the metrics sorted. Based on the data collected, analyzed, and presented by ScaAnalyzer, users need to manually assess the potential benefit of each scaling bottleneck. To eliminate the observed scaling bottlenecks, users transform code or adjust thread placement under the guidance of ScaAnalyzer. The implementation challenges of each ScaAnalyzer component include keeping low measurement overhead, scaling the analysis to many cores, and providing intuitive analysis results. In the rest of this section, we describe how ScaAnalyzer addresses these challenges in each component. Online data collector. ScaAnalyzer programs each core s PMU to enable instruction sampling with a pre-defined period. ScaAnalyzer s data collector gleans and attributes samples dynamically. To minimize the runtime overhead and achieve scalable measurement, we avoid any thread synchronization operation during data collection and attribution. Each thread collects its own samples and attributes to data objects based on its own map. Moreover, to avoid the high overhead incurred by monitoring every memory allocation, ScaAnalyzer can be configured to monitor memory allocations that are larger than a predefined threshold, e.g., 4096 bytes. Therefore, the data collector of ScaAnalyzer has very low runtime overhead, which, in addition, does not increase proportionally when more cores are used for execution. We evaluate ScaAnalyzer s overhead in Section 8. Offline data analyzer. As ScaAnalyzer s data collector produces a profile per thread, the analyzer coalesces the profiles for the whole execution. Such a compact view of profiles can scale the analysis of program executions to a large number of cores. The coalescing procedure follows two rules: (1) data objects with the same ID are coalesced into one; and (2) memory accesses with the same calling context are coalesced into one. The latency metrics are also accumulated along with coalescing. The coalescing technique also provides a way for ScaAnalyzer to perform the differential analysis, which is in turn a coalescing process on two aggregate profiles from different executions. The profile coalescing overhead grows linearly with the number of threads and processes used by the monitored program. ScaAnalyzer leverages the reduction tree technique [39] applied in HPCToolkit to parallelize the merging process. ScaAnalyzer requires less then 10 seconds to produce the aggregate profiles for all of our case study programs. Data visualizer. To provide intuitive analysis, ScaAnalyzer s data visualizer presents the analysis data in a friendly way. The visualizer shows code- and data-centric profiles, scaling metrics, and source code in a graphical interface. It sorts the metrics and highlight the problematic data objects and their accesses with poor memory scalability. In the next section, we discuss ScaAnalyzer s visualizer with its example snapshots. 8. CASE STUDIES We evaluated ScaAnalyzer on two machines. One is a single node server with four two-socket AMD Magny-Cours processors. There are GHz cores in this machine spread across eight NUMA domains (six cores per NUMA domain). The memory in the machine is 128GB. This machine supports instruction-based sampling (IBS). The other platform is a single node server with two Intel Sandy Brdige EP processors and 256GB memory. There are GHz cores in this machine, two hardware threads per core, and two NUMA domains. This machine supports precise eventbased sampling (PEBS). Our study focused on evaluating measurement and analysis of highly multithreaded executions on one node. We studied three well-known OpenMP benchmarks, all of which have been optimized by developers for years. We built these programs using gcc 4.6 on the AMD machine and gcc 4.8 on the Intel machine with the -O3 optimization option. The

7 benchmarks AMD (IBS) Intel (PEBS) native profiling native profiling LULESH 378s 403s (+6.6%) 105s 112s (+6.7%) IRSmk 62.2s 65s (+4.5%) 23s 23.6s (+2.6%) Linear Regression 21.6s 23.2s (+7.4%) 20.6s 22s (+6.8%) Table 3: The runtime overhead of each benchmark when monitored by ScaAnalyzer. description of each benchmark is as follows: LULESH [17], a Lawrence Livermore National Laboratory (LLNL) application benchmark, is an Arbitrary Lagrangian Eulerian code that solves the Sedov blast wave problem for one material in 3D. In this paper, we study a highly-tuned LULESH implementation written in C++ with OpenMP. IRSmk, an LLNL Sequoia benchmark [18], is an implicit radiation solver. IRSmk is important to LLNL because it has highly representative styles of loops and array indexing for production applications. IRSmk is written in C, parallelized with OpenMP, and thoroughly optimized by LLNL s code team. Figure 3: The execution time in seconds (left) and parallel efficiency (right) of the original LULESH code running on different number of cores. allocation sites Linear Regression, of the Phoenix benchmarks [31], tries to find the best-fitting straight line through a set of points. It is written in C and parallelized with Pthread. To perform the differential analysis, ScaAnalyzer measures multiple executions of each benchmark running on different numbers of cores. To give thorough analysis, Sca- Analyzer monitors every memory allocation. We report the overhead of ScaAnalyzer when running with the maximum number of cores allowed for the monitored program. For all of these benchmarks, ScaAnalyzer adopts a sampling period of 65,535 instructions with AMD IBS and 10,000 memory accesses with Intel PEBS. Table 3 shows the runtime overhead of these three benchmarks. From the table, we can see that ScaAnalyzer incurs low runtime overhead. Moreover, ScaAnalyzer has about 4MB space overhead per thread to maintain the profiling data. Thus, ScaAnalyzer is an efficient profiler, suitable for real applications parallelized on large-scale multi-core systems. In the rest of this section, we discuss the optimization guided by ScaAnalyzer for each benchmark in detail. We perform the case studies on both of our AMD and Intel machines. We mainly report the results on the AMD machine because our AMD machine has more cores than the Intel machine, and exposes more interesting scalability issues. At the end of this section, we also show a case study from our Intel machine. 8.1 LULESH Figure 3 shows the scalability test of LULESH from 1 to 48 cores available in the AMD machine. As shown in the left of the figure, LULESH enjoys speedups on up to 8 cores. When using 16 cores or more, the performance begins to drop dramatically. The parallel efficiency 3 is as low as 3.5% when running on 48 cores. We leverage ScaAnalyzer to identify its scalability bottlenecks. Figure 4 shows the differential analysis of two executions on 16 and 32 cores, respectively. There are three panes in ScaAnalyzer s visualizer. The top 3 T Parallel efficiency is computed as 1 T n n 100%, Ti is the execution time of program running on i cores. unscalable memory accesses 16 cores 32 cores Figure 4: ScaAnalyzer visualizer s data presentation for scalability analysis of original LULESH. one shows the program source code; the bottom left shows the program structure, organized as data object IDs and their accesses; the bottom right shows the metrics correlated with any monitored data object and memory accesses in full calling contexts. ScaAnalyzer identifies that LULESH is a memory-bound benchmark. The average memory access latency for the whole program grows from 110 cycles to nearly 600 cycles when doubling the number of cores, leading to a scaling factor f as high as 5.43 shown in Figure 4. Therefore, LULESH suffers from significant scaling bottlenecks in the memory hierarchy. Figure 4 also shows that data objects allocated on the stack account for most of the memory access latency, 81.7% and 96.9% of 16-core and 32-core executions, respec-

8 16 cores 32 cores Figure 6: The left figure shows the parallel efficiency of LULESH after optimizing the page initialization bottleneck. The right figure shows the parallel efficiency of LULESH after continuous optimization of NUMA bottlenecks. Figure 5: Scalability analysis of LULESH after optimizing OS s page initialization. ScaAnalyzer highlights the unscalable arrays. tively. We further expand the program structure in the bottom left pane for memory accesses with the most latency. We find that accesses annotated in the rectangle to six arrays, pfx, pfy, pfz, x1, y1, and z1, have unscalable latency. These arrays are allocated on the stack at line 1510 and The average latency grows from 250 cycles to 960 cycles (not shown in the figure) from 16 to 32 cores. The data source latency metrics show that 95% of accesses to these arrays are from shared layers, which means that thread contention in the shared cache or memory hurt the scalability, according to our analysis in Section 3. A further study reveals the reason that causes the poor scalability in the assignments from line 1522 to 1527: operating system page initialization. The six arrays on the left-hand side of these assignments are always allocated and freed before and after the parallel loop whenever the enclosing function CalcHourglassControlForElems is called. Figure 4 uses an ellipse to highlight the allocation sites of these arrays but omits the free sites due to the limited space. Because of frequent data allocation and reclaim, the OS needs to initialize the allocated pages at a high frequency. The initialization occurs at the first time the threads access pages, so the assignments highlighted in the rectangle are the place where OS initializes pages. However, these assignments are in a parallel loop and they contend for the page initializer in the OS from different threads. To fix the problem, we allocate and initialize all these arrays once in the global section to avoid frequently page initialization across threads. Moreover, ScaAnalyzer identifies three other parallel loops that account for 22.4%, 21%, and 5.1% of total latency, as well as high scaling factors, 9.58, 5.95, and 3.4, respectively. We apply similar optimizations to all of these parallel loops. The left of Figure 6 shows the scalability result of LULESH after this optimization, which is much better than the original code. However, the optimized scalability is not good enough. Figure 7: The scalability measurement of LULESH after all optimizations; it is better than the original version shown in Figure 3. The parallel efficiency drops to nearly 50% when running with 48 cores. We continue to profile the optimized LULESH with ScaAnalyzer. As shown in Figure 5, The memory scaling factor of the overall program reduces to 1.26 (from 5.43) after the previous optimization. Moreover, the average memory latency is reduced to 90 and 112 cycles for 16-core and 32-core execution, respectively. The stack arrays do not cause significant access latency any more. These metrics reveal that our first round of optimization significantly improved the scalability of LULESH. However, the scaling factor is not low enough to indicate high memory scalability. The main reason is from the NUMA layer. When scaling from 16 to 32 cores, the scaling factor in the NUMA layer is 1.9, as shown in Figure 5. The heap-allocated arrays highlighted in the figure are the root causes of poor scalability in the NUMA layer. They suffer from high access latency and cause the scaling factor in the NUMA layer to be as high as 2. We further examine the accesses to these arrays and find that they all suffer from high remote NUMA accesses. We also find that these arrays are all allocated in one NUMA domain but accessed by threads from all other NUMA domains, which can easily saturate the bandwidth of the interconnects, thus hurting the scalability. Therefore, we continue to optimize these NUMA problems. We redistribute the pages allocated for these arrays using libnuma [15] to match their access patterns and avoid interconnect congestion. After this optimization, the parallel efficiency improves a lot, especially when running on more than 16 cores, as shown in the right of Figure 6. Although LULESH does not achieve perfect scaling, the memory scaling factor is reduced to around 1, which means that the scaling loss is not due to memory but other reasons, such as serialization or thread synchronization. Continuing the optimization for LULESH is beyond the scope of this paper. Figure 7 shows the scal-

9 Figure 8: The left figure is the parallel efficiency of original IRSmk, while the right figure is the parallel efficiency of IRSmk after NUMA optimization. 2 cores 4 cores 16 cores 32 cores Figure 10: IRSmk has improved scaling factor after NUMA optimization. The scalability bottleneck now comes from memory contention in the shared layer. Figure 9: The array with unscalable accesses in IRSmk. The high scaling factor and NUMA latency indicate the root cause is from the NUMA layer. able execution of our optimized LULESH, which is significantly better over the original version, shown in Figure IRSmk The left of Figure 8 shows the poor parallel efficiency of IRSmk executions on our AMD machine. We use ScaAnalyzer to measure the scalability of IRSmk. For the overall program, ScaAnalyzer reports that the average latency per memory access grows from 183 cycles to 362 cycles when doubling the execution cores from 16 to 32. ScaAnalyzer also reports that the latency related to remote NUMA accesses grows rapidly from 31.5% to 53.2% of total latency. Moreover, the NUMA scaling factor is more than 2 for the whole program and for the array highlighted in Figure 9. Therefore, optimizing NUMA bottlenecks is the first thing to do to improve the scalability of IRSmk. We further investigate the profiles provided by ScaAnalyzer. The reason for high NUMA-related latency is that all of these arrays are allocated in a single NUMA domain but accessed by threads from all NUMA domains, similar to the NUMA problem in LULESH. Therefore, we slightly modify the code to match the computation with the data distribution. The right of Figure 8 shows IRSmk s parallel efficiency after the optimization. Compared to the original version, as shown in the left of the figure, this optimization improves IRSmk s scalability significantly. However, the optimized IRSmk does not achieve optimal scalability. From the right of Figure 8, we can see that parallel efficiency drops significantly from using 1 to 8 cores and is stable with 16 and 32 cores. To identify the scalability bottleneck, we continue to use ScaAnalyzer to profile Figure 11: The parallel efficiency (on the left) and execution time (on the right) of IRSmk after optimization running on 1 to 32 cores. the optimized IRSmk code and compare its execution on 2 and 4 cores. Figure 10 shows the measurement and analysis result. The scaling factor for the whole program is 1.55 and the most significant array ABCD still suffers from 1.56 scaling factor. With the data source latency metrics, ScaAnalyzer identifies that the unscalable accesses are from the shared layers: shared L3 caches and local memory, due to both limited capacity and limited bandwidth. The average latency per access is around 37 cycles running on two cores, and 57 cycles running on four cores. The reason is that all threads are allocated in one NUMA domain by default, incurring high contention. To address this issue, we use all NUMA domains instead of just one to place the threads in a roundrobin manner. This replaces the default compact thread placement, and threads benefit from more cache space and memory bandwidth. Moreover, because the previous optimization on IRSmk already matches the data distribution and memory access patterns, utilizing multiple NUMA domains does not saturate NUMA interconnects. Figure 11 shows the scalability measurement for IRSmk after our continuous optimization. The parallel efficiencies, on the left of the figure, are all higher than 75%. The scaling factor is reduced to almost 1, which means that we have fixed the memory scalability bottlenecks in IRSmk. In the right of the figure, we show the execution time of the optimized

10 Figure 12: The poor scalability of Linear Regression. The left figure shows the execution time, while the right figure shows the parallel efficiency. Figure 14: Linear Regression has good scalability in execution time (left) and parallel efficiency (right) after removing false sharing. 4 cores 8 cores Figure 15: The original (left) and optimized (right) parallel efficiency of IRSmk running on the Intel machine. Figure 13: ScaAnalyzer s differential analysis shows high scaling factor. The latency metrics show the scalability bottleneck comes from the private layer. IRSmk, with good scalability. It is worth noting that after addressing the memory scalability issues, array regrouping optimization described in Section 2 becomes valid, showing a 3.4 speedup on 32 cores. 8.3 Linear Regression Figure 12 shows the execution time and parallel efficiency of Linear Regression running on 1 to 16 cores on our AMD machine. Obviously, this benchmark does not scale even on two cores. The parallel efficiency is less than 10% when running on 16 cores. To identify the scaling bottlenecks, we leverage ScaAnalyzer to monitor the executions of Linear Regression with four and eight threads. As shown in Figure 13, the scaling factor for the whole program is as high as For memory accesses highlighted in the rectangles, the scaling factor can be up to Therefore, accesses to the array args are unscalable. From the latency and data source information, ScaAnalyzer shows that more than 95% of scaling losses come from the private layer, especially from private cache invalidation. According to our analysis in Section 3, the poor scalability comes from inefficient data sharing. We examine the allocation site of args and identify that the number of array elements is equal to the number of threads. Each thread accesses one element to update the computation results. Each element is a structure. However, the structure is only 52 bytes, which is less than the cache line size, 64 bytes. Thus, two threads can access the same cache line, causing false sharing. With this insight, we optimize the code by padding the structure of args, separating the accesses from different threads to different cache lines. Figure 14 shows the optimization results of Linear Regression. We can see that this benchmark achieves high scalability and parallel efficiency more than 90%. 8.4 Scalability issues on the Intel machine We run IRSmk on our Intel machine with 1 to 16 cores. Figure 15 shows the scalability improvement with the optimizations described in Section 8.2. IRSmk has good scalability when running on 1 to 8 cores. However, there is no significant speedup when running on 16 cores over 8 cores. ScaAnalyzer shows that the scaling factor is 1.3. The problem is in the shared layer. The average latency of accessing shared L3 cache is increased from 99 to 166 cycles; the average latency of accessing the main memory is increased from 2,493 to 3,523 cycles. The poor scaling comes from the bandwidth contention in both L3 cache and main memory. To address this problem, one needs to adapt the hardware to scale the memory bandwidth. 9. RELATED WORK To the best of our knowledge, ScaAnalyzer is the first lightweight profiler to study memory scalability bottlenecks and provide insightful guidance for optimization. 9.1 Tools for lightweight memory analysis We review existing lightweight tools that identify memory bottlenecks, such as false sharing, cache locality, and NUMA bottlenecks. These tools provide insights to optimize program based on sampling, incurring overhead from 10% to 5. To detect false sharing, existing tools [20, 12, 13] leverage sampling methods, via either software instrumentation or hardware PMU. They monitor memory accesses and examine whether they write to the same cache line but dif-

11 benchmark #core scalability issue speedup LULESH 48 OS page initialization NUMA contention 43 IRSmk 32 NUMA contention shared cache contention 20 LR 16 false sharing 5.7 Table 4: The speedups of all benchmarks we studied on the maximum number of cores after eliminating the memory scalability bottlenecks. ferent words. Moreover, tools like HPCToolkit [21], SLO [2], ThreadSpotter [32], Cache Scope [4], and MACPO [30] capture memory reuse distance information with lightweight sampling methods to identify memory bottlenecks with poor data locality. These tools map reuse distance information to source code to provide intuitive guidance for locality optimization. Finally, tools, such as HPCToolkit-NUMA [22], MemProf [16], and Memphis [26] pinpoint NUMA bottlenecks with the data collected from hardware PMUs. All these tools focus on identifying individual memory bottlenecks in a single execution, without considering program s scalability. Moreover, these tools do not give insights into which kind of memory bottlenecks incur the most significant performance issues when we attempt to scale a program on a system with many cores. In contrast, ScaAnalyzer analyzes all kinds of memory bottlenecks related to the scalability problem. It performs scalability studies to understand how bottlenecks in different layers of the memory hierarchy hinder the whole program s scalability. 9.2 Tools for scalability bottlenecks IPM [41] analyzes scalability of parallel programs running on large-scale clusters, but only characterizes the scalability of the whole program without providing any insights to fix the scaling problems. Calotoiu et al. [6] developed a model to automatically identify scalability bottlenecks in MPI programs. HPCToolkit [7] identifies the scalability bottlenecks in parallel programs. It utilizes differential analysis [27] on the calling contexts to quantify the scaling losses. Scal- Tool [37] uses lightweight hardware counters to analyze the program scalability in distributed shared-memory systems. Scalasca [43] traces program executions on different numbers of cores to identify scaling bottlenecks in MPI communication routines. These tools can successfully pinpoint the scalability bottlenecks in program source code to guide optimization. However, unlike ScaAnalyzer, they do not provide insightful information to guide scalability optimization in memory hierarchies. Such information includes metrics to quantify the performance gains, attributions to identify problematic data objects and their accesses, and root cause analysis to choose appropriate optimization methods. There is also work focusing on studying the scalability of the memory hierarchies using heavyweight memory instrumentation. CHiP [3] identifies code sections with cache contention by computing reuse distance. Wu et al. [42] also use reuse distance to study multi-core processor scaling. PSnAP [28] collects address streams to study memory performance under strong scaling. Unlike ScaAnalyzer, it does not associate performance bottlenecks with data objects or with specific layers of the memory hierarchy. All these tools incur high overhead to collect memory related metrics. Typically, such overhead can be more than 1000%. In contrast, ScaAnalyzer incurs less than 10% runtime overhead. 10. CONCLUSIONS AND FUTURE WORK In conclusion, this paper describes ScaAnalyzer, a profiler to identify, quantify, and analyze the scalability bottlenecks in the memory subsystem. ScaAnalyzer incurs little measurement overhead but provides insightful feedback for program optimization with novel differential analysis and root-cause analysis. Guided by ScaAnalyzer, we are able to pinpoint scalability bottlenecks in three parallel benchmarks and identify the causes of the bottlenecks in the memory hierarchies or operating systems. The optimization involves little code modification but achieves significant improvement in scalability for all these programs. Table 4 shows the speedups of these benchmarks after optimization, when running on the maximum number of cores allowed in their configurations. Our future work is to extend ScaAnalyzer to perform autotuning of the program on-the-fly. Given its low overhead, ScaAnalzyer has intrinsic features as an online analyzer to guide thread and data migration for better scalability. Moreover, we will study the scalability bottlenecks in more complicated memory hierarchies, such as heterogeneous architectures consisting of both CPUs and accelerators. Acknowledgements This research was supported by the National Science Foundation (NSF) under Grant No REFERENCES [1] B. Bao and C. Ding. Defensive loop tiling for shared cache. In CGO, [2] K. Beyls and E. D Hollander. Discovery of locality-improving refactorings by reuse path analysis. Proc. of the 2 nd Intl. Conf. on High Performance Computing and Communications (HPCC), [3] B. Brett, P. Kumar, M. Kim, and H. Kim. CHiP: A profiler to measure the effect of cache contention on scalability. In IPDPS Workshops. IEEE, [4] B. R. Buck and J. K. Hollingsworth. Data centric cache measurement on the Intel Itanium 2 processor. In SC, [5] D. R. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, [6] A. Calotoiu, T. Hoefler, M. Poke, and F. Wolf. Using automated performance modeling to find scalability bugs in complex codes. In SC, [7] C. Coarfa, J. Mellor-Crummey, N. Froyd, and Y. Dotsenko. Scalability analysis of SPMD codes using expectations. In ICS, [8] P. J. Drongowski. Instruction-based sampling: A new performance analysis technique for AMD family 10h processors, Fall [9] R. Haring, M. Ohmacht, T. Fox, M. Gschwind, D. Satterfield, K. Sugavanam, P. Coteus, P. Heidelberger, M. Blumrich, R. Wisniewski, A. Gara, G.-T. Chiu, P. Boyle, N. Chist, and C. Kim. The IBM Blue Gene/Q compute chip. Micro, IEEE, 32(2):48 60, March [10] Intel Corporation. Intel 64 and IA-32 architectures software developer s manual, Volume 3B: System programming guide, Part 2, Number , June 2010.

12 [11] Intel Corporation. Intel Itanium Processor 9300 series reference manual for software development and optimization, Number , March [12] Intel Corporation. Intel Performance Tuning Utility 4.0 Update 5. articles/intel-performance-tuning-utility, October Last accessed: Aug. 10, [13] S. Jayasena, S. Amarasinghe, A. Abeyweera, G. Amarasinghe, H. De Silva, S. Rathnayake, X. Meng, and Y. Liu. Detection of false sharing using machine learning. In SC, [14] J. Jeffers and J. Reinders. Intel Xeon Phi Coprocessor High Performance Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, [15] A. Kleen. A NUMA API for Linux. 10/LibNUMA-WP-fv1.pdf, Last accessed: Dec. 12, [16] R. Lachaize, B. Lepers, and V. Quéma. MemProf: A memory profiler for NUMA multicore systems. In USENIX ATC, [17] Lawrence Livermore National Laboratory. Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH). Last accessed: Dec. 12, [18] Lawrence Livermore National Laboratory. LLNL Sequoia Benchmarks. Last accessed: Dec. 12, [19] J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In HPCA, [20] T. Liu et al. PREDATOR: Predictive false sharing detection. In PPoPP, [21] X. Liu and J. Mellor-Crummey. Pinpointing data locality bottlenecks with low overheads. In Proc. of the 2013 IEEE Intl. Symp. on Performance Analysis of Systems and Software, [22] X. Liu and J. Mellor-Crummey. A tool to analyze the performance of multithreaded programs on NUMA architectures. In PPoPP, [23] X. Liu and J. M. Mellor-Crummey. A data-centric profiler for parallel programs. In SC, [24] X. Liu, K. Sharma, and J. Mellor-Crummey. ArrayTool: a lightweight profiler to guide array regrouping. In PACT, [25] Z. Majo and T. R. Gross. Matching memory access patterns and data placement for NUMA systems. In CGO, [26] C. McCurdy and J. S. Vetter. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In Proc. of 2010 IEEE Intl. Symp. on Performance Analysis of Systems Software, [27] P. E. McKenney. Differential profiling. Software: Practice and Experience, 29(3): , [28] C. R. M. Olschanowsky. HPC Application Address Stream Compression, Replay and Scaling. Ph.D. dissertation, University of California, San Diego, [29] OpenMP Architecture Review Board. OpenMP application program interface, version 4.0. http: // July [30] A. Rane and J. Browne. Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics. In PACT, [31] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating MapReduce for multi-core and multiprocessor systems. In HPCA, [32] Rogue Wave Software. ThreadSpotter manual, version Command=Core_Download&EntryId=1492, August Last accessed: Dec. 12, [33] SGI. SGI Altix UV 1000 system user s guide /pdf/ pdf. Last accessed: Mar. 30, [34] SGI. Technical advances in the SGI UV architecture. Last accessed: Mar. 30, [35] B. Sinharoy et al. IBM POWER7 multicore server processor. IBM JRD, 55(3):1:1 29, May [36] D. D. Sleator and R. E. Tarjan. Self-adjusting binary search trees. Journal of the ACM, 32(3): , [37] Y. Solihin, V. Lam, and J. Torrellas. Scal-tool: Pinpointing and quantifying scalability bottlenecks in DSM multiprocessors. In SC, [38] M. Srinivas et al. IBM POWER7 performance modeling, verification, and evaluation. IBM JRD, 55(3):4:1 19, May/June [39] N. R. Tallent, L. Adhianto, and J. M. Mellor-Crummey. Scalable identification of load imbalance in parallel executions using call path profiles. In SC, [40] N. R. Tallent, J. Mellor-Crummey, and M. W. Fagan. Binary analysis for measurement and attribution of program performance. In PLDI, [41] N. J. Wright, W. Pfeiffer, and A. Snavely. Characterizing parallel scaling of scientific applications using IPM. In Proc. of the 10th LCI International Conference on High-Performance Clustered Computing, [42] M.-J. Wu, M. Zhao, and D. Yeung. Studying multicore processor scaling via reuse distance analysis. In ISCA, [43] B. J. N. Wylie, M. Geimer, and F. Wolf. Performance measurement and analysis of large-scale parallel applications on leadership computing systems. Sci. Program., 16(2-3): , [44] Q. Zhao, D. Koh, S. Raza, D. Bruening, W.-F. Wong, and S. Amarasinghe. Dynamic cache contention detection in multi-threaded applications. In VEE, 2011.