ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs"

Transcription

1 ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs ABSTRACT Xu Liu Department of Computer Science College of William and Mary Williamsburg, VA It is difficult to scale parallel programs in a system that employs a large number of cores. To identify scalability bottlenecks, existing tools principally pinpoint poor thread synchronization strategies or unnecessary data communication. Memory subsystem is one of the key contributors to poor parallel scaling in multicore machines. State-of-theart tools, however, either lack sophisticated capabilities or are completely ignorant in pinpointing scalability bottlenecks arising from the memory subsystem. To address this issue, we develop a tool ScaAnalyzer to pinpoint scaling losses due to poor memory access behaviors of parallel programs. ScaAnalyzer collects, attributes, and analyzes memory-related metrics during program execution while incurring very low overhead. ScaAnalyzer provides high-level, detailed guidance to programmers for scalability optimization. We demonstrate the utility of ScaAnalyzer with case studies of three parallel programs. For each benchmark, Sca- Analyzer identifies scalability bottlenecks caused by poor memory access behaviors and provides optimization guidance that yields significant improvement in scalability. Keywords Memory bottlenecks, scalability, parallel profiler. 1. INTRODUCTION The number of hardware threads in emerging multi-core processors is growing dramatically. For example, an IBM POWER7 processor [35] has 32 threads, an IBM Blue Gene Q processor [9] has 64 threads, and an Intel Xeon Phi processor [14] has more than 240 threads. Moreover, multiple processors can be integrated in the same node, forming a nonuniform memory access (NUMA) architecture. For exam- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from SC 15, November 15-20, 2015, Austin, TX, USA Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM /15/11...$ Bo Wu Department of EECS Colorado School of Mines Golden, CO ple, the SGI UV 1000 [33] has 8192 Intel Nehalem hardware threads interconnected via the NUMALink technology [34]. Scalability bottlenecks, however, prevent applications from benefiting from the large number of threads on existing or emerging shared-memory architectures. Given the importance of scalability, it is necessary to identify and eliminate scalability bottlenecks in multithreaded applications. However, it is difficult for programmers to conduct analysis and apply appropriate optimizations. There are three principal challenges. First, applications can be complex. A typical HPC application or an industrial workload usually consists of hundreds of thousands of lines of code, hiding scalability bottlenecks deep in the code. Second, modern parallel architectures have sophisticated micro architectures, integrating many hardware threads and multiple levels of memory. Identifying which hardware component incurs scalability bottlenecks is challenging. Finally, program execution can have complicated behaviors, such as interactions between threads as well as interactions between user and operating system. Hence, understanding abnormal behaviors in a long-running parallel program is challenging. Given these three complexities, people need performance tools to automatically identify the scalability bottlenecks and provide insightful guidance for optimization. State-of-the-art tools [41, 6, 7, 37, 43] identify the poor scalability caused only by thread load imbalance, frequent synchronizations, thread serialization, and excessive data communications. They omit one important cause of poor scalability the memory subsystem. Modern architectures adopt deep memory hierarchies that are shared among multiple cores. Compared to the rapid rise of the number of CPU cores in a machine, the memory bandwidth per core has barely improved, and in fact decreased in some cases. With this trend, CPU cores compete for the shared resources in the memory subsystem, causing severe scaling bottlenecks. To identify such contention, existing approaches [42, 3, 20, 44, 30, 28] apply heavyweight methods to monitor memory access patterns, which are impractical for longrunning parallel programs. Moreover, they do not evaluate the effect of such contention across executions at different scales. To provide programmers insightful guidance for tuning scalability bottlenecks in memory hierarchies, we developed ScaAnalyzer a profiler to pinpoint, quantify, and analyze scalability bottlenecks in parallel programs. We make three contributions in ScaAnalyzer. We employ lightweight profiling techniques in ScaAnalyzer and derive a new metric, memory scaling frac-

2 Figure 1: The scalability of IRSmk on a 48-core AMD machine. The x-axis is the number of cores, while the y-axis is the execution time in seconds. Without scalability optimization, locality optimization via array regrouping does not show any speedup when the program runs on 16 and 32 cores. tion, as well as a new differential analysis to quantify scalability bottlenecks in memory. We develop novel schemes in ScaAnalyzer to provide rich but intuitive guidance for programmers to optimize their code. Such schemes include differential analysis and root-cause analysis. We demonstrate ScaAnalyzer s low runtime overhead, space overhead, and superior scaling by profiling parallel programs running on a large number of cores. ScaAnalyzer is based on HPCToolkit [23], which works on unmodified, fully optimized binary executables compiled by any compiler with any threading model, such as pthread [5] and OpenMP [29]. To evaluate ScaAnalyzer, we studied three well-known multithreaded benchmarks, all of which have been highly optimized by benchmark developers. Surprisingly, we found that all these benchmarks fail to scale on modern multi-core architectures. With the help of ScaAnalyzer, we easily identify the scalability bottlenecks in memory hierarchies and obtain significant scalability improvements with little code modification. The paper is organized as follows. Section 2 gives an example that highlights the importance of scalability issues in memory. Section 3 systematically studies the potential causes of scalability bottlenecks in memory hierarchies. Section 4 introduces the monitoring support for ScaAnalyzer in modern hardware for efficient scalability analysis. Section 5 describes the differential analysis for quantifying scaling losses in memory. Section 6 describes the techniques of root-cause analysis used by ScaAnalyzer to provide insights for code optimization. Section 7 illustrates the implementation of ScaAnalyzer and how it achieves efficient analysis. Section 8 studies three benchmarks to evaluate the capabilities of ScaAnalyzer. Section 9 reviews existing work and distinguishes our approaches. Section 10 gives some conclusions of the paper and previews future work. 2. A MOTIVATING EXAMPLE To highlight the significant impact of scalability bottlenecks in memory hierarchies, we discuss a concrete example. The example shows that without scaling the performance of memory accesses, memory optimizations for data locality may not always improve performance when the code is running at a large scale. Moreover, existing tools, such as ArrayTool [24] and HPCToolkit [23], fail to provide the insights on how these memory locality bottlenecks evolve when using more cores in the system. Our example is a multithreaded benchmark, IRSmk 1, which is one of the Sequoia benchmarks [18] from Lawrence Livermore National Laboratory. We ran IRSmk on an AMD Magy-Cours machine with GHz cores distributed in eight sockets. To achieve load balancing, we set the number cores to be a power of two when running the benchmark. IRSmk was compiled using gcc 4.6 with -O3 option. Figure 1 uses blue circles to show how IRSmk scales from 1 to 32 cores. The code stops scaling when running on more than eight cores. The execution time even increases when running on 16 to 32 cores. We first optimize the data locality of IRSmk without optimizing its memory scalability. Following our earlier studies of this code with ArrayTool [24], we apply array regrouping to improve IRSmk s spatial locality. The basic idea for this optimization is regrouping arrays that are always accessed together for better cache utilization. The green triangles in Figure 1 show the execution time of IRSmk after applying the array regrouping optimization. We can see that the optimized IRSmk has a 3 speedup when running sequentially. However, when the number of cores grows, the speedup shrinks quickly and disappears when running on 16 and 32 cores. The reason is that IRSmk has memory scalability bottlenecks when running on this architecture. Array regrouping, despite improving locality, does not address the scalability bottlenecks. Thus, to fully exploit the data locality optimization with multiple cores in the system, one needs to first fix the scalability problems in IRSmk. We discuss this benchmark in detail in Section 8. In the next section, we review the memory hierarchy of a typical multi-core system and describe how it prevents a program execution from scaling. 3. SCALABILITY ISSUES IN MEMORY Memory hierarchies associated with a typical multi-socket, multi-core system have three layers: (1) a private layer (e.g., L1 and L2 cache) for each core, (2) a shared layer (e.g., L3 cache and local memory) between cores within the same socket, and (3) an across-socket memory layer, including remote memory and interconnects between sockets. Each layer may introduce scalability bottlenecks, which may need different optimizations to address. We elaborate on them individually as follows. Private layers. The aggregate capacity and bandwidth of private memory layers scale with the number of cores. However, due to data sharing between cores, private layers can hurt program scalability. Because each core keeps a copy of shared data in its own private caches, writing to one copy may cause the invalidation of other copies; such cache invalidations are costly. Usually, the more cores used for computation, the greater the impact of invalidations on performance. Specifically, threads may falsely share cache lines by accessing different words in the same cache line. False sharing is a performance bottleneck that hurts program scalability in private memory layers [20]. 1 The out-of-box IRSmk is a sequential program. We parallelize it with OpenMP.

3 Solution: To avoid private memory layers from being a scalability bottleneck, one should eliminate false sharing in caches. For example, one can pad cache lines with zeros to force threads to access different cache lines. Shared layers. Both capacity and bandwidth of shared memory layers do not scale with the number of cores. As a result, threads can contend for resources, e.g., shared cache in the shared layer. For example, cache lines loaded by one thread can be evicted without having been fully utilized by other threads sharing the same cache. Moreover, issuance of excessive outstanding memory requests by threads can saturate the bandwidth, thus, delaying data accesses. The contention becomes more severe when more threads are used, which hurts program scalability. Solution: Several approaches can be used to prevent contention in shared layers. For example, one can reduce contention via efficient cache management, such as cache partitioning [19], bandwidth optimization [24], and contentionaware code transformation [1]. If CPU cores in the system are not fully subscribed, one can spread threads across different sockets to benefit from multiple shared caches with more space and higher bandwidth. Socket (NUMA) layers. Memory across sockets has non-uniform access latency. The socket together with the memory attached to it forms a NUMA domain. A core can access local memory in the same NUMA domain and also remote memory in other NUMA domains via interconnects. Because of limited bandwidth, without careful code design, the interconnects may introduce severe congestion. A common problem occurs when data are allocated in one NUMA domain but threads access this data from multiple NUMA domains. The congestion can further degrade program performance when more NUMA nodes are used for computation. Solution: To alleviate congestion, one can interleave page allocation across all NUMA domains to avoid a single NUMA domain from being a bottleneck [23, 16]. Moreover, tasks can be scheduled so that each task runs in the NUMA domain where the data of its primary working set reside [25]. Besides different layers in memory hierarchies, memory management in the operating system can also incur significant scalability bottlenecks. Section 8 shows that such OS bottlenecks can lead to substantial scaling losses. These are caused by problematic interactions between user-space code and the OS kernel, which are difficult to identify and fix. Given multiple sources of scaling losses in memory, one needs to identify the root causes of scalability bottlenecks in order to apply appropriate optimizations. To the best of our knowledge, ScaAnalyzer is the first lightweight tool that provides such root-cause analysis. In the next section, we introduce the lightweight data collection mechanisms available in modern processors used by ScaAnalzyer. 4. HARDWARE SAMPLING METHODS To guarantee lightweight analysis, ScaAnalyzer leverages hardware sampling mechanisms. Modern processors employ performance monitoring units (PMU) to measure program execution. The PMUs can collect a variety of performance statistics during program execution to help identify performance bottlenecks. In this section, we review the lightweight performance monitoring mechanisms available in PMUs and identify the most efficient support for memory scalability analysis. Section 4.1 reviews event-based sampling, which has been extensively used for performance analysis. Section 4.2 studies instruction sampling which was introduced in recent architecture generations. 4.1 Event-based sampling Event-based sampling (EBS) uses hardware performance counters to trigger an interrupt when a particular count threshold is reached. Microprocessors support several performance counters, which enable multiple events to be monitored in a single execution. However, EBS provides insufficient information to support efficient memory scalability analysis, mainly due to two reasons. First, several factors make precise attribution of EBS difficult, including long pipelines, speculative execution, and the delay between the time an event counter reaches a threshold and the time it triggers an interrupt. As a result, it is difficult to use EBS to monitor individual memory access instructions. Second, EBS only monitors the occurrences of events but does not interpret their effects. Without an accurate, detailed model or simulator, it is hard to leverage EBS to identify and quantify which memory layer causes the most serious scalability bottleneck. Thus, EBS is not an appropriate measurement method for efficient memory scalability analysis. 4.2 Instruction sampling In recent processors, PMUs can monitor instructions rather than events. PMUs periodically select an instruction to monitor when a specific event occurs more times than a pre-defined threshold. Common features of these instruction sampling mechanisms are (1) support for memory access sampling, (2) reporting of memory-related metrics, and (3) support for the identification of a precise instruction pointer (IP) and an effective data address for a sampled access. With these features, instruction sampling enables accurate attribution and provides more information for memory analysis than EBS. There are four instruction sampling mechanisms available in modern processors. AMD Opteron processors (family 10h and successors) support instruction-based sampling (IBS) [8]. IBM POWER5 and successors support sampling marked events [38] 2. Intel processors support precise eventbased sampling (PEBS) [10], starting from the Nehalem micro architecture. The Intel Itanium employs event address registers [11] to sample memory access instructions. However, to effectively support memory scalability analysis, instruction sampling should have two additional features: PMUs should assess the cost of sampled memory accesses in terms of data access latency in CPU cycles. This latency information can be used to identify and quantify scaling bottlenecks in the memory hierarchy. PMUs should record the details about the memory hierarchy response (i.e., data source) of the fetched data touched by sampled accesses. The data source is one layer in the memory hierarchy, such as L1/L2/L3 2 The PMU monitors a memory access instruction after a predefined number of marked events.

4 caches and local/remote main memory. This information helps identify in which memory layers scalability bottlenecks occur. From our investigation, only AMD IBS and Intel PEBS can glean this extra information. We built ScaAnalyzer based on both IBS and PEBS for efficient scalability analysis. 5. DIFFERENTIAL ANALYSIS FOR MEM- ORY SCALING LOSSES Differential analysis [27, 7] is an effective way to identify scalability bottlenecks in parallel programs. Its basic idea is to run a program twice with two configurations to generate two profiles. The difference between these profiles, if not matching an expected performance pattern, shows scalability problems. For example, a program with the same input is run on four and eight cores to test strong scaling. We expect the eight-core execution to be half the execution time of the four-core execution. If the eight-core execution does not achieve the expected performance, we can quantify the scaling losses as the difference between expected and observed performance. Moreover, differential analysis can measure weak scaling, which proportionally increases both program input size and cores for execution. The expectation of weak scaling is to have the same execution time across runs. However, it is not straightforward to apply differential analysis to quantify scaling losses in memory hierarchies, because there are no obvious expectations we can have for execution time. A key contribution of ScaAnalyzer is that it extends differential analysis for scaling loss in memory. We first describe an insight which is the foundation for our differential analysis: With ideal memory scaling in hardware and software, memory accesses in a program should incur the same or less latency when the program uses more cores and the same input. There are two observations to support this insight. From the hardware perspective, the capacity and bandwidth of memory layers stay constant or increase when more cores are involved in computation. For example, when using more cores in one NUMA socket, the capacity and bandwidth of shared caches do not change, but the aggregate capacity and bandwidth for private caches increase. Ideal memory scaling also assumes that shared caches have unlimited capacity and bandwidth. From the software perspective, because the input is unchanged, the amount of data processed by the program is constant when running at different scales. Thus, software does not trigger more memory accesses while hardware provides the same or more cache space and bandwidth. Ideally, memory access latency should remain constant or become less with more cores. This insight provides an expectation for differential analysis of strong scaling in memory hierarchies. ScaAnalyzer defines a scaling fraction metric, f, as the memory access latency increment when the program runs on more cores, as shown in Equation 1. f = l large l small (1) In this equation, l large is the aggregate latency of memory accesses triggered by a program running on a large number of cores, while l small is the aggregate latency of memory accesses triggered by the program running on a small number of cores. With instruction sampling, we cannot obtain latency for all memory accesses. Therefore, ScaAnalyzer approximates l large and l small using average latency from the sampled memory accesses. Because both IBS and PEBS can randomly sample a large number of memory accesses, the approximation can achieve high accuracy [23]. The value of f indicates the scaling losses in memory. If f = 1, the program has perfect scalability in the memory hierarchy. If f < 1, the program has super-linear scalability. If f > 1, the program has sub-linear scalability. If f 1, the program has significant scalability bottlenecks in memory hierarchies. It is worth noting that the metric f helps ScaAnalyzer identify root causes of performance degradation for serial code sections in a parallel program due to the side-effects of parallel execution. Without any side-effect, f is 1 for serial computation because serial computation has constant memory behavior across executions on different numbers of cores. However, if f is exceedingly high, this serial section is affected by the parallel execution. For instance, parallel execution allocates data in a NUMA socket rather than the one that the serial section runs in, thus incurring remote memory accesses. In the next section, we discuss how ScaAnalyzer leverages the latency and scaling factor metrics from differential analysis to identify the root causes of memory scalability bottlenecks. 6. ROOT-CAUSE ANALYSIS OF SCALING LOSSES IN MEMORY Understanding the root causes of scaling losses in memory is critically important for code optimization. ScaAnalyzer performs root-cause analysis of memory scalability bottlenecks from two perspectives. From the hardware s perspective, ScaAnalyzer identifies the scaling losses in the memory hierarchies, which guides the selection of optimization methods, as described in Section 3. From the program s perspective, ScaAnalyzer pinpoints problematic data objects and their accesses in the source code to guide code optimization. 6.1 Locating scaling losses in memory layers Given the latency and data source information captured by the instruction sampling PMU, ScaAnalyzer derives three data source latency metrics, l p, l s, and l numa, to represent the aggregate sampled latency caused by fetching data from private memory layers, shared memory layers, and NUMA memory layers, respectively. Using the instruction sampling mechanisms described in Section 4.2, every time ScaAnalyzer captures a memory access sample, it extracts the latency and accumulates it to the corresponding latency metric according to the data source information that is recorded along with the sample. Equation 2 shows the decomposition of the total observed latency l from samples. l = l p + l s + l numa (2) With the latency decomposition, ScaAnalyzer determines which memory layer dominates data access latency. Furthermore, by adapting Equation 1, ScaAnalyzer can compute special scaling factors f p, f s, and f numa, using data source latency metrics, l p, l s, and l numa, respectively. We

5 can identify the causes of scaling losses in memory hierarchies using these derived metrics. If a specific memory layer dominates the memory access latency and has a high scaling factor, we can treat this memory layer as the cause of poor scalability and apply the appropriate optimization methods as described in Section Pinpointing scaling losses in code To efficiently guide code optimization, ScaAnalyzer associates performance metrics with data objects and their accesses in the code. Section describes the metric attribution mechanism, which provides necessary information for memory optimization. Section describes the technique of interpreting the derived metrics to identify data objects or memory accesses that can benefit most from scalability optimization Sample attribution Instruction sampling records the precise instruction pointer (IP) of each sampled memory access and the effective memory address it touches. ScaAnalyzer leverages this information to attribute samples using a mechanism similar to that used in HPCToolkit [23]. ScaAnalyzer uses precise IPs to associate samples with memory access instructions, which can be further mapped to the source code using debugging information generated by the compiler. In addition, ScaAnalyzer determines the call path of sampled memory accesses with a lightweight on-the-fly binary analysis technique [40]. Identifying problematic memory accesses within their calling context is a key component in ScaAnalyzer for the support of insightful scalability analysis in memory hierarchies. At the same time, ScaAnalyzer also leverages effective addresses to associate samples with data objects. This attribution mechanism can help understand which data objects do not scale in the memory hierarchy. There are three kinds of data objects which can be allocated during program execution: Static data. Data objects allocated in the.bss section are static data. Each static variable has a named entry in the symbol table that identifies the memory range for the variable with an offset from the beginning of the load module. Heap data. Variables in the heap are allocated dynamically during execution by one of the malloc family functions (malloc, calloc, realloc). Stack data. Data objects can be allocated on the execution stack. Automatic variables, local arrays, and objects allocated with the alloc function belong to this category. Table 1 shows how ScaAnalyzer monitors these three types of data objects. ScaAnalyzer identifies each of the monitored data objects using a unique ID and records the associated memory range allocated. It reads the symbol table in each load module and extracts the names and memory ranges for static data objects. It overloads the allocation functions and triggers synchronous samples at each allocation. Sca- Analyzer determines the allocation call path as the ID of the head data object and records its memory range allocated. For stack data objects, ScaAnalyzer tracks the stack pointers of function frames in the call stack. It uses the memory range of stack frames to bound all data allocated data types ID memory range static name reading from symbol table heap allocation path overloading allocation functions stack line mapping capturing stack frames Table 1: Methods of monitoring different types of data objects. l f optimization decisions high high scalability optimization yields high benefit high low memory bottlenecks not related to scalability low high scalability optimization yields little benefit low low good memory performance Table 2: Optimization decisions according to different l and f pairs. on the stack. ScaAnalyzer uses the mapping of memory accesses to lines in source code to identify the data objects. If the source code line contains more than one data object, one can modify the source code to break the line and put each data object in one line. Because ScaAnalyzer captures the precise instruction pointer of the sampled memory access, one easily knows which data object the sample is attributed to by examining the source code that the precise IP is mapped to. For all static, heap, and stack data objects, ScaAnalyzer enters their memory ranges and ID into a map for sample attribution. This map is implemented with a splay tree [36] that uses memory intervals of data objects as keys and data IDs as values. The splay tree can accelerate the lookup time with its self-adjusting upon insertion, deletion, and lookup operations. With the effective address captured in the sample, ScaAnalyzer checks in the map which data object s memory range interval includes this effective address. If the data object is found in the map, ScaAnalyzer attributes this sample together with its calling context associated to that data object. By associating samples to both data objects and their accesses with full calling contexts, ScaAnalyzer provides deep insights for developers to understand poor memory behaviors Metric interpretation ScaAnalyzer attributes metrics, such as latency l and scaling factor f, to data objects and their accesses together with samples. These metrics, besides identifying the root causes of scaling losses in memory hierarchies, can also pinpoint problematic data objects and code regions, e.g., memory access instructions, loops, and functions. ScaAnalzyer provides an effective metric interpretation strategy to report bottlenecks that lead to significant scalability improvement after optimization. ScaAnalyzer interprets the metrics in three steps. First, ScaAnalyzer identifies whether a program is memory-bound or not. Only memory-bound programs receive our attention for memory scalability analysis and optimization. To obtain this insight, ScaAnalyzer uses a metric named latency per instruction based on the raw metrics collected by instruction sampling. This metric is described in our previous work [21]. It is worth noting that this metric can be attributed to any contexts to assess how memory bound they are, including statements, loops, procedures, and the whole program. Next, ScaAnalyzer uses f and l to determine which data objects have the severest scalability bottleneck in memory,

6 executable binaries data collector execution profile 1 execution profile N data analyzer data visualizer database Figure 2: The workflow of ScaAnalyzer. There are three components: data collector, analyzer, and visualizer. ScaAnalyzer works on the unmodified binary and automatically produces the intuitive analysis results for the scalability bottlenecks. as shown in Table 2. Targeting only the data objects with high f and high l, our optimization can improve the whole program s scalability significantly. From our experiments, a data object warrants investigation if its associated f is higher than 1.5 and l accounts for more than 30% of total latency. Finally, ScaAnalyzer computes data source latency metrics for data objects, such as l p, l s, and l numa. Applying the technique described in Section 6.1, ScaAnalyzer reports the response of scaling issues for problematic data objects in different memory layers. With these three steps, ScaAnalyzer performs root-cause analysis for scalability bottlenecks in program code and correlates it with the root-cause analysis in memory layers (Section 6.1). Thus, ScaAnalyzer provides insightful guidance for code optimization: where in the code to apply which optimization strategy for memory scalability improvement. 7. SCAANALYZER IMPLEMENTATION Figure 2 shows the workflow of ScaAnalyzer. ScaAnalyzer extends HPCToolkit [23] and consists of three components: data collector, analyzer, and visualizer. The data collector takes the executable binary and runs it multiple times with the same input but at different scales of cores. As a default, we use a compact thread-to-core placement to evaluate the scalability: we first use cores in one socket; when no more idle cores are available in this socket, we spread threads to another socket. Then the data analyzer investigates all profiles, derives scaling metrics, and associates analysis results with program information, such as data or function symbols and source code lines. Typically, users first analyze the two profiles with the most significant scaling losses, like the execution profiles with 8 and 16 cores in Figure 1. All profile data are maintained in a database. Finally, the data visualizer presents the analysis data from the database in a graphical interface, with the metrics sorted. Based on the data collected, analyzed, and presented by ScaAnalyzer, users need to manually assess the potential benefit of each scaling bottleneck. To eliminate the observed scaling bottlenecks, users transform code or adjust thread placement under the guidance of ScaAnalyzer. The implementation challenges of each ScaAnalyzer component include keeping low measurement overhead, scaling the analysis to many cores, and providing intuitive analysis results. In the rest of this section, we describe how ScaAnalyzer addresses these challenges in each component. Online data collector. ScaAnalyzer programs each core s PMU to enable instruction sampling with a pre-defined period. ScaAnalyzer s data collector gleans and attributes samples dynamically. To minimize the runtime overhead and achieve scalable measurement, we avoid any thread synchronization operation during data collection and attribution. Each thread collects its own samples and attributes to data objects based on its own map. Moreover, to avoid the high overhead incurred by monitoring every memory allocation, ScaAnalyzer can be configured to monitor memory allocations that are larger than a predefined threshold, e.g., 4096 bytes. Therefore, the data collector of ScaAnalyzer has very low runtime overhead, which, in addition, does not increase proportionally when more cores are used for execution. We evaluate ScaAnalyzer s overhead in Section 8. Offline data analyzer. As ScaAnalyzer s data collector produces a profile per thread, the analyzer coalesces the profiles for the whole execution. Such a compact view of profiles can scale the analysis of program executions to a large number of cores. The coalescing procedure follows two rules: (1) data objects with the same ID are coalesced into one; and (2) memory accesses with the same calling context are coalesced into one. The latency metrics are also accumulated along with coalescing. The coalescing technique also provides a way for ScaAnalyzer to perform the differential analysis, which is in turn a coalescing process on two aggregate profiles from different executions. The profile coalescing overhead grows linearly with the number of threads and processes used by the monitored program. ScaAnalyzer leverages the reduction tree technique [39] applied in HPCToolkit to parallelize the merging process. ScaAnalyzer requires less then 10 seconds to produce the aggregate profiles for all of our case study programs. Data visualizer. To provide intuitive analysis, ScaAnalyzer s data visualizer presents the analysis data in a friendly way. The visualizer shows code- and data-centric profiles, scaling metrics, and source code in a graphical interface. It sorts the metrics and highlight the problematic data objects and their accesses with poor memory scalability. In the next section, we discuss ScaAnalyzer s visualizer with its example snapshots. 8. CASE STUDIES We evaluated ScaAnalyzer on two machines. One is a single node server with four two-socket AMD Magny-Cours processors. There are GHz cores in this machine spread across eight NUMA domains (six cores per NUMA domain). The memory in the machine is 128GB. This machine supports instruction-based sampling (IBS). The other platform is a single node server with two Intel Sandy Brdige EP processors and 256GB memory. There are GHz cores in this machine, two hardware threads per core, and two NUMA domains. This machine supports precise eventbased sampling (PEBS). Our study focused on evaluating measurement and analysis of highly multithreaded executions on one node. We studied three well-known OpenMP benchmarks, all of which have been optimized by developers for years. We built these programs using gcc 4.6 on the AMD machine and gcc 4.8 on the Intel machine with the -O3 optimization option. The

7 benchmarks AMD (IBS) Intel (PEBS) native profiling native profiling LULESH 378s 403s (+6.6%) 105s 112s (+6.7%) IRSmk 62.2s 65s (+4.5%) 23s 23.6s (+2.6%) Linear Regression 21.6s 23.2s (+7.4%) 20.6s 22s (+6.8%) Table 3: The runtime overhead of each benchmark when monitored by ScaAnalyzer. description of each benchmark is as follows: LULESH [17], a Lawrence Livermore National Laboratory (LLNL) application benchmark, is an Arbitrary Lagrangian Eulerian code that solves the Sedov blast wave problem for one material in 3D. In this paper, we study a highly-tuned LULESH implementation written in C++ with OpenMP. IRSmk, an LLNL Sequoia benchmark [18], is an implicit radiation solver. IRSmk is important to LLNL because it has highly representative styles of loops and array indexing for production applications. IRSmk is written in C, parallelized with OpenMP, and thoroughly optimized by LLNL s code team. Figure 3: The execution time in seconds (left) and parallel efficiency (right) of the original LULESH code running on different number of cores. allocation sites Linear Regression, of the Phoenix benchmarks [31], tries to find the best-fitting straight line through a set of points. It is written in C and parallelized with Pthread. To perform the differential analysis, ScaAnalyzer measures multiple executions of each benchmark running on different numbers of cores. To give thorough analysis, Sca- Analyzer monitors every memory allocation. We report the overhead of ScaAnalyzer when running with the maximum number of cores allowed for the monitored program. For all of these benchmarks, ScaAnalyzer adopts a sampling period of 65,535 instructions with AMD IBS and 10,000 memory accesses with Intel PEBS. Table 3 shows the runtime overhead of these three benchmarks. From the table, we can see that ScaAnalyzer incurs low runtime overhead. Moreover, ScaAnalyzer has about 4MB space overhead per thread to maintain the profiling data. Thus, ScaAnalyzer is an efficient profiler, suitable for real applications parallelized on large-scale multi-core systems. In the rest of this section, we discuss the optimization guided by ScaAnalyzer for each benchmark in detail. We perform the case studies on both of our AMD and Intel machines. We mainly report the results on the AMD machine because our AMD machine has more cores than the Intel machine, and exposes more interesting scalability issues. At the end of this section, we also show a case study from our Intel machine. 8.1 LULESH Figure 3 shows the scalability test of LULESH from 1 to 48 cores available in the AMD machine. As shown in the left of the figure, LULESH enjoys speedups on up to 8 cores. When using 16 cores or more, the performance begins to drop dramatically. The parallel efficiency 3 is as low as 3.5% when running on 48 cores. We leverage ScaAnalyzer to identify its scalability bottlenecks. Figure 4 shows the differential analysis of two executions on 16 and 32 cores, respectively. There are three panes in ScaAnalyzer s visualizer. The top 3 T Parallel efficiency is computed as 1 T n n 100%, Ti is the execution time of program running on i cores. unscalable memory accesses 16 cores 32 cores Figure 4: ScaAnalyzer visualizer s data presentation for scalability analysis of original LULESH. one shows the program source code; the bottom left shows the program structure, organized as data object IDs and their accesses; the bottom right shows the metrics correlated with any monitored data object and memory accesses in full calling contexts. ScaAnalyzer identifies that LULESH is a memory-bound benchmark. The average memory access latency for the whole program grows from 110 cycles to nearly 600 cycles when doubling the number of cores, leading to a scaling factor f as high as 5.43 shown in Figure 4. Therefore, LULESH suffers from significant scaling bottlenecks in the memory hierarchy. Figure 4 also shows that data objects allocated on the stack account for most of the memory access latency, 81.7% and 96.9% of 16-core and 32-core executions, respec-

8 16 cores 32 cores Figure 6: The left figure shows the parallel efficiency of LULESH after optimizing the page initialization bottleneck. The right figure shows the parallel efficiency of LULESH after continuous optimization of NUMA bottlenecks. Figure 5: Scalability analysis of LULESH after optimizing OS s page initialization. ScaAnalyzer highlights the unscalable arrays. tively. We further expand the program structure in the bottom left pane for memory accesses with the most latency. We find that accesses annotated in the rectangle to six arrays, pfx, pfy, pfz, x1, y1, and z1, have unscalable latency. These arrays are allocated on the stack at line 1510 and The average latency grows from 250 cycles to 960 cycles (not shown in the figure) from 16 to 32 cores. The data source latency metrics show that 95% of accesses to these arrays are from shared layers, which means that thread contention in the shared cache or memory hurt the scalability, according to our analysis in Section 3. A further study reveals the reason that causes the poor scalability in the assignments from line 1522 to 1527: operating system page initialization. The six arrays on the left-hand side of these assignments are always allocated and freed before and after the parallel loop whenever the enclosing function CalcHourglassControlForElems is called. Figure 4 uses an ellipse to highlight the allocation sites of these arrays but omits the free sites due to the limited space. Because of frequent data allocation and reclaim, the OS needs to initialize the allocated pages at a high frequency. The initialization occurs at the first time the threads access pages, so the assignments highlighted in the rectangle are the place where OS initializes pages. However, these assignments are in a parallel loop and they contend for the page initializer in the OS from different threads. To fix the problem, we allocate and initialize all these arrays once in the global section to avoid frequently page initialization across threads. Moreover, ScaAnalyzer identifies three other parallel loops that account for 22.4%, 21%, and 5.1% of total latency, as well as high scaling factors, 9.58, 5.95, and 3.4, respectively. We apply similar optimizations to all of these parallel loops. The left of Figure 6 shows the scalability result of LULESH after this optimization, which is much better than the original code. However, the optimized scalability is not good enough. Figure 7: The scalability measurement of LULESH after all optimizations; it is better than the original version shown in Figure 3. The parallel efficiency drops to nearly 50% when running with 48 cores. We continue to profile the optimized LULESH with ScaAnalyzer. As shown in Figure 5, The memory scaling factor of the overall program reduces to 1.26 (from 5.43) after the previous optimization. Moreover, the average memory latency is reduced to 90 and 112 cycles for 16-core and 32-core execution, respectively. The stack arrays do not cause significant access latency any more. These metrics reveal that our first round of optimization significantly improved the scalability of LULESH. However, the scaling factor is not low enough to indicate high memory scalability. The main reason is from the NUMA layer. When scaling from 16 to 32 cores, the scaling factor in the NUMA layer is 1.9, as shown in Figure 5. The heap-allocated arrays highlighted in the figure are the root causes of poor scalability in the NUMA layer. They suffer from high access latency and cause the scaling factor in the NUMA layer to be as high as 2. We further examine the accesses to these arrays and find that they all suffer from high remote NUMA accesses. We also find that these arrays are all allocated in one NUMA domain but accessed by threads from all other NUMA domains, which can easily saturate the bandwidth of the interconnects, thus hurting the scalability. Therefore, we continue to optimize these NUMA problems. We redistribute the pages allocated for these arrays using libnuma [15] to match their access patterns and avoid interconnect congestion. After this optimization, the parallel efficiency improves a lot, especially when running on more than 16 cores, as shown in the right of Figure 6. Although LULESH does not achieve perfect scaling, the memory scaling factor is reduced to around 1, which means that the scaling loss is not due to memory but other reasons, such as serialization or thread synchronization. Continuing the optimization for LULESH is beyond the scope of this paper. Figure 7 shows the scal-

9 Figure 8: The left figure is the parallel efficiency of original IRSmk, while the right figure is the parallel efficiency of IRSmk after NUMA optimization. 2 cores 4 cores 16 cores 32 cores Figure 10: IRSmk has improved scaling factor after NUMA optimization. The scalability bottleneck now comes from memory contention in the shared layer. Figure 9: The array with unscalable accesses in IRSmk. The high scaling factor and NUMA latency indicate the root cause is from the NUMA layer. able execution of our optimized LULESH, which is significantly better over the original version, shown in Figure IRSmk The left of Figure 8 shows the poor parallel efficiency of IRSmk executions on our AMD machine. We use ScaAnalyzer to measure the scalability of IRSmk. For the overall program, ScaAnalyzer reports that the average latency per memory access grows from 183 cycles to 362 cycles when doubling the execution cores from 16 to 32. ScaAnalyzer also reports that the latency related to remote NUMA accesses grows rapidly from 31.5% to 53.2% of total latency. Moreover, the NUMA scaling factor is more than 2 for the whole program and for the array highlighted in Figure 9. Therefore, optimizing NUMA bottlenecks is the first thing to do to improve the scalability of IRSmk. We further investigate the profiles provided by ScaAnalyzer. The reason for high NUMA-related latency is that all of these arrays are allocated in a single NUMA domain but accessed by threads from all NUMA domains, similar to the NUMA problem in LULESH. Therefore, we slightly modify the code to match the computation with the data distribution. The right of Figure 8 shows IRSmk s parallel efficiency after the optimization. Compared to the original version, as shown in the left of the figure, this optimization improves IRSmk s scalability significantly. However, the optimized IRSmk does not achieve optimal scalability. From the right of Figure 8, we can see that parallel efficiency drops significantly from using 1 to 8 cores and is stable with 16 and 32 cores. To identify the scalability bottleneck, we continue to use ScaAnalyzer to profile Figure 11: The parallel efficiency (on the left) and execution time (on the right) of IRSmk after optimization running on 1 to 32 cores. the optimized IRSmk code and compare its execution on 2 and 4 cores. Figure 10 shows the measurement and analysis result. The scaling factor for the whole program is 1.55 and the most significant array ABCD still suffers from 1.56 scaling factor. With the data source latency metrics, ScaAnalyzer identifies that the unscalable accesses are from the shared layers: shared L3 caches and local memory, due to both limited capacity and limited bandwidth. The average latency per access is around 37 cycles running on two cores, and 57 cycles running on four cores. The reason is that all threads are allocated in one NUMA domain by default, incurring high contention. To address this issue, we use all NUMA domains instead of just one to place the threads in a roundrobin manner. This replaces the default compact thread placement, and threads benefit from more cache space and memory bandwidth. Moreover, because the previous optimization on IRSmk already matches the data distribution and memory access patterns, utilizing multiple NUMA domains does not saturate NUMA interconnects. Figure 11 shows the scalability measurement for IRSmk after our continuous optimization. The parallel efficiencies, on the left of the figure, are all higher than 75%. The scaling factor is reduced to almost 1, which means that we have fixed the memory scalability bottlenecks in IRSmk. In the right of the figure, we show the execution time of the optimized

10 Figure 12: The poor scalability of Linear Regression. The left figure shows the execution time, while the right figure shows the parallel efficiency. Figure 14: Linear Regression has good scalability in execution time (left) and parallel efficiency (right) after removing false sharing. 4 cores 8 cores Figure 15: The original (left) and optimized (right) parallel efficiency of IRSmk running on the Intel machine. Figure 13: ScaAnalyzer s differential analysis shows high scaling factor. The latency metrics show the scalability bottleneck comes from the private layer. IRSmk, with good scalability. It is worth noting that after addressing the memory scalability issues, array regrouping optimization described in Section 2 becomes valid, showing a 3.4 speedup on 32 cores. 8.3 Linear Regression Figure 12 shows the execution time and parallel efficiency of Linear Regression running on 1 to 16 cores on our AMD machine. Obviously, this benchmark does not scale even on two cores. The parallel efficiency is less than 10% when running on 16 cores. To identify the scaling bottlenecks, we leverage ScaAnalyzer to monitor the executions of Linear Regression with four and eight threads. As shown in Figure 13, the scaling factor for the whole program is as high as For memory accesses highlighted in the rectangles, the scaling factor can be up to Therefore, accesses to the array args are unscalable. From the latency and data source information, ScaAnalyzer shows that more than 95% of scaling losses come from the private layer, especially from private cache invalidation. According to our analysis in Section 3, the poor scalability comes from inefficient data sharing. We examine the allocation site of args and identify that the number of array elements is equal to the number of threads. Each thread accesses one element to update the computation results. Each element is a structure. However, the structure is only 52 bytes, which is less than the cache line size, 64 bytes. Thus, two threads can access the same cache line, causing false sharing. With this insight, we optimize the code by padding the structure of args, separating the accesses from different threads to different cache lines. Figure 14 shows the optimization results of Linear Regression. We can see that this benchmark achieves high scalability and parallel efficiency more than 90%. 8.4 Scalability issues on the Intel machine We run IRSmk on our Intel machine with 1 to 16 cores. Figure 15 shows the scalability improvement with the optimizations described in Section 8.2. IRSmk has good scalability when running on 1 to 8 cores. However, there is no significant speedup when running on 16 cores over 8 cores. ScaAnalyzer shows that the scaling factor is 1.3. The problem is in the shared layer. The average latency of accessing shared L3 cache is increased from 99 to 166 cycles; the average latency of accessing the main memory is increased from 2,493 to 3,523 cycles. The poor scaling comes from the bandwidth contention in both L3 cache and main memory. To address this problem, one needs to adapt the hardware to scale the memory bandwidth. 9. RELATED WORK To the best of our knowledge, ScaAnalyzer is the first lightweight profiler to study memory scalability bottlenecks and provide insightful guidance for optimization. 9.1 Tools for lightweight memory analysis We review existing lightweight tools that identify memory bottlenecks, such as false sharing, cache locality, and NUMA bottlenecks. These tools provide insights to optimize program based on sampling, incurring overhead from 10% to 5. To detect false sharing, existing tools [20, 12, 13] leverage sampling methods, via either software instrumentation or hardware PMU. They monitor memory accesses and examine whether they write to the same cache line but dif-

11 benchmark #core scalability issue speedup LULESH 48 OS page initialization NUMA contention 43 IRSmk 32 NUMA contention shared cache contention 20 LR 16 false sharing 5.7 Table 4: The speedups of all benchmarks we studied on the maximum number of cores after eliminating the memory scalability bottlenecks. ferent words. Moreover, tools like HPCToolkit [21], SLO [2], ThreadSpotter [32], Cache Scope [4], and MACPO [30] capture memory reuse distance information with lightweight sampling methods to identify memory bottlenecks with poor data locality. These tools map reuse distance information to source code to provide intuitive guidance for locality optimization. Finally, tools, such as HPCToolkit-NUMA [22], MemProf [16], and Memphis [26] pinpoint NUMA bottlenecks with the data collected from hardware PMUs. All these tools focus on identifying individual memory bottlenecks in a single execution, without considering program s scalability. Moreover, these tools do not give insights into which kind of memory bottlenecks incur the most significant performance issues when we attempt to scale a program on a system with many cores. In contrast, ScaAnalyzer analyzes all kinds of memory bottlenecks related to the scalability problem. It performs scalability studies to understand how bottlenecks in different layers of the memory hierarchy hinder the whole program s scalability. 9.2 Tools for scalability bottlenecks IPM [41] analyzes scalability of parallel programs running on large-scale clusters, but only characterizes the scalability of the whole program without providing any insights to fix the scaling problems. Calotoiu et al. [6] developed a model to automatically identify scalability bottlenecks in MPI programs. HPCToolkit [7] identifies the scalability bottlenecks in parallel programs. It utilizes differential analysis [27] on the calling contexts to quantify the scaling losses. Scal- Tool [37] uses lightweight hardware counters to analyze the program scalability in distributed shared-memory systems. Scalasca [43] traces program executions on different numbers of cores to identify scaling bottlenecks in MPI communication routines. These tools can successfully pinpoint the scalability bottlenecks in program source code to guide optimization. However, unlike ScaAnalyzer, they do not provide insightful information to guide scalability optimization in memory hierarchies. Such information includes metrics to quantify the performance gains, attributions to identify problematic data objects and their accesses, and root cause analysis to choose appropriate optimization methods. There is also work focusing on studying the scalability of the memory hierarchies using heavyweight memory instrumentation. CHiP [3] identifies code sections with cache contention by computing reuse distance. Wu et al. [42] also use reuse distance to study multi-core processor scaling. PSnAP [28] collects address streams to study memory performance under strong scaling. Unlike ScaAnalyzer, it does not associate performance bottlenecks with data objects or with specific layers of the memory hierarchy. All these tools incur high overhead to collect memory related metrics. Typically, such overhead can be more than 1000%. In contrast, ScaAnalyzer incurs less than 10% runtime overhead. 10. CONCLUSIONS AND FUTURE WORK In conclusion, this paper describes ScaAnalyzer, a profiler to identify, quantify, and analyze the scalability bottlenecks in the memory subsystem. ScaAnalyzer incurs little measurement overhead but provides insightful feedback for program optimization with novel differential analysis and root-cause analysis. Guided by ScaAnalyzer, we are able to pinpoint scalability bottlenecks in three parallel benchmarks and identify the causes of the bottlenecks in the memory hierarchies or operating systems. The optimization involves little code modification but achieves significant improvement in scalability for all these programs. Table 4 shows the speedups of these benchmarks after optimization, when running on the maximum number of cores allowed in their configurations. Our future work is to extend ScaAnalyzer to perform autotuning of the program on-the-fly. Given its low overhead, ScaAnalzyer has intrinsic features as an online analyzer to guide thread and data migration for better scalability. Moreover, we will study the scalability bottlenecks in more complicated memory hierarchies, such as heterogeneous architectures consisting of both CPUs and accelerators. Acknowledgements This research was supported by the National Science Foundation (NSF) under Grant No REFERENCES [1] B. Bao and C. Ding. Defensive loop tiling for shared cache. In CGO, [2] K. Beyls and E. D Hollander. Discovery of locality-improving refactorings by reuse path analysis. Proc. of the 2 nd Intl. Conf. on High Performance Computing and Communications (HPCC), [3] B. Brett, P. Kumar, M. Kim, and H. Kim. CHiP: A profiler to measure the effect of cache contention on scalability. In IPDPS Workshops. IEEE, [4] B. R. Buck and J. K. Hollingsworth. Data centric cache measurement on the Intel Itanium 2 processor. In SC, [5] D. R. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, [6] A. Calotoiu, T. Hoefler, M. Poke, and F. Wolf. Using automated performance modeling to find scalability bugs in complex codes. In SC, [7] C. Coarfa, J. Mellor-Crummey, N. Froyd, and Y. Dotsenko. Scalability analysis of SPMD codes using expectations. In ICS, [8] P. J. Drongowski. Instruction-based sampling: A new performance analysis technique for AMD family 10h processors, Fall [9] R. Haring, M. Ohmacht, T. Fox, M. Gschwind, D. Satterfield, K. Sugavanam, P. Coteus, P. Heidelberger, M. Blumrich, R. Wisniewski, A. Gara, G.-T. Chiu, P. Boyle, N. Chist, and C. Kim. The IBM Blue Gene/Q compute chip. Micro, IEEE, 32(2):48 60, March [10] Intel Corporation. Intel 64 and IA-32 architectures software developer s manual, Volume 3B: System programming guide, Part 2, Number , June 2010.

12 [11] Intel Corporation. Intel Itanium Processor 9300 series reference manual for software development and optimization, Number , March [12] Intel Corporation. Intel Performance Tuning Utility 4.0 Update 5. articles/intel-performance-tuning-utility, October Last accessed: Aug. 10, [13] S. Jayasena, S. Amarasinghe, A. Abeyweera, G. Amarasinghe, H. De Silva, S. Rathnayake, X. Meng, and Y. Liu. Detection of false sharing using machine learning. In SC, [14] J. Jeffers and J. Reinders. Intel Xeon Phi Coprocessor High Performance Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, [15] A. Kleen. A NUMA API for Linux. 10/LibNUMA-WP-fv1.pdf, Last accessed: Dec. 12, [16] R. Lachaize, B. Lepers, and V. Quéma. MemProf: A memory profiler for NUMA multicore systems. In USENIX ATC, [17] Lawrence Livermore National Laboratory. Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH). Last accessed: Dec. 12, [18] Lawrence Livermore National Laboratory. LLNL Sequoia Benchmarks. Last accessed: Dec. 12, [19] J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In HPCA, [20] T. Liu et al. PREDATOR: Predictive false sharing detection. In PPoPP, [21] X. Liu and J. Mellor-Crummey. Pinpointing data locality bottlenecks with low overheads. In Proc. of the 2013 IEEE Intl. Symp. on Performance Analysis of Systems and Software, [22] X. Liu and J. Mellor-Crummey. A tool to analyze the performance of multithreaded programs on NUMA architectures. In PPoPP, [23] X. Liu and J. M. Mellor-Crummey. A data-centric profiler for parallel programs. In SC, [24] X. Liu, K. Sharma, and J. Mellor-Crummey. ArrayTool: a lightweight profiler to guide array regrouping. In PACT, [25] Z. Majo and T. R. Gross. Matching memory access patterns and data placement for NUMA systems. In CGO, [26] C. McCurdy and J. S. Vetter. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In Proc. of 2010 IEEE Intl. Symp. on Performance Analysis of Systems Software, [27] P. E. McKenney. Differential profiling. Software: Practice and Experience, 29(3): , [28] C. R. M. Olschanowsky. HPC Application Address Stream Compression, Replay and Scaling. Ph.D. dissertation, University of California, San Diego, [29] OpenMP Architecture Review Board. OpenMP application program interface, version 4.0. http: // July [30] A. Rane and J. Browne. Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics. In PACT, [31] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating MapReduce for multi-core and multiprocessor systems. In HPCA, [32] Rogue Wave Software. ThreadSpotter manual, version Command=Core_Download&EntryId=1492, August Last accessed: Dec. 12, [33] SGI. SGI Altix UV 1000 system user s guide /pdf/ pdf. Last accessed: Mar. 30, [34] SGI. Technical advances in the SGI UV architecture. Last accessed: Mar. 30, [35] B. Sinharoy et al. IBM POWER7 multicore server processor. IBM JRD, 55(3):1:1 29, May [36] D. D. Sleator and R. E. Tarjan. Self-adjusting binary search trees. Journal of the ACM, 32(3): , [37] Y. Solihin, V. Lam, and J. Torrellas. Scal-tool: Pinpointing and quantifying scalability bottlenecks in DSM multiprocessors. In SC, [38] M. Srinivas et al. IBM POWER7 performance modeling, verification, and evaluation. IBM JRD, 55(3):4:1 19, May/June [39] N. R. Tallent, L. Adhianto, and J. M. Mellor-Crummey. Scalable identification of load imbalance in parallel executions using call path profiles. In SC, [40] N. R. Tallent, J. Mellor-Crummey, and M. W. Fagan. Binary analysis for measurement and attribution of program performance. In PLDI, [41] N. J. Wright, W. Pfeiffer, and A. Snavely. Characterizing parallel scaling of scientific applications using IPM. In Proc. of the 10th LCI International Conference on High-Performance Clustered Computing, [42] M.-J. Wu, M. Zhao, and D. Yeung. Studying multicore processor scaling via reuse distance analysis. In ISCA, [43] B. J. N. Wylie, M. Geimer, and F. Wolf. Performance measurement and analysis of large-scale parallel applications on leadership computing systems. Sci. Program., 16(2-3): , [44] Q. Zhao, D. Koh, S. Raza, D. Bruening, W.-F. Wong, and S. Amarasinghe. Dynamic cache contention detection in multi-threaded applications. In VEE, 2011.

Oracle Developer Studio Performance Analyzer

Oracle Developer Studio Performance Analyzer Oracle Developer Studio Performance Analyzer The Oracle Developer Studio Performance Analyzer provides unparalleled insight into the behavior of your application, allowing you to identify bottlenecks and

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10 Performance Counter Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10 Performance Counter Hardware Unit for event measurements Performance Monitoring Unit (PMU) Originally for CPU-Debugging

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Oracle Database Scalability in VMware ESX VMware ESX 3.5 Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises

More information

An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors

An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos The College of William &

More information

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General

More information

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment Technical Paper Moving SAS Applications from a Physical to a Virtual VMware Environment Release Information Content Version: April 2015. Trademarks and Patents SAS Institute Inc., SAS Campus Drive, Cary,

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Understanding the Benefits of IBM SPSS Statistics Server

Understanding the Benefits of IBM SPSS Statistics Server IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

Performance Characteristics of Large SMP Machines

Performance Characteristics of Large SMP Machines Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark

More information

OpenMP Programming on ScaleMP

OpenMP Programming on ScaleMP OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Optimizing Shared Resource Contention in HPC Clusters

Optimizing Shared Resource Contention in HPC Clusters Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs

More information

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology 3. The Lagopus SDN Software Switch Here we explain the capabilities of the new Lagopus software switch in detail, starting with the basics of SDN and OpenFlow. 3.1 SDN and OpenFlow Those engaged in network-related

More information

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Scheduling Task Parallelism on Multi-Socket Multicore Systems Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1 Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System

More information

Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support

Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support Ranjit Noronha and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) The Ohio State University Outline

More information

Optimizing Linux for Dual-Core AMD Opteron Processors

Optimizing Linux for Dual-Core AMD Opteron Processors Technical White Paper DATA CENTER Optimizing Linux for Dual-Core * AMD Opteron Processors Optimizing Linux for Dual-Core AMD Opteron Processors Table of Contents: 2.... SUSE Linux Enterprise and the AMD

More information

A Review of Customized Dynamic Load Balancing for a Network of Workstations

A Review of Customized Dynamic Load Balancing for a Network of Workstations A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester

More information

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization An Experimental Model to Analyze OpenMP Applications for System Utilization Mark Woodyard Principal Software Engineer 1 The following is an overview of a research project. It is intended

More information

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda Objectives At the completion of this module, you will be able to: Understand the intended purpose and usage models supported by the VTune Performance Analyzer. Identify hotspots by drilling down through

More information

Full and Para Virtualization

Full and Para Virtualization Full and Para Virtualization Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF x86 Hardware Virtualization The x86 architecture offers four levels

More information

A Topology-Aware Performance Monitoring Tool for Shared Resource Management in Multicore Systems

A Topology-Aware Performance Monitoring Tool for Shared Resource Management in Multicore Systems A Topology-Aware Performance Monitoring Tool for Shared Resource Management in Multicore Systems TADaaM Team - Nicolas Denoyelle - Brice Goglin - Emmanuel Jeannot August 24, 2015 1. Context/Motivations

More information

Delivering Quality in Software Performance and Scalability Testing

Delivering Quality in Software Performance and Scalability Testing Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,

More information

Performance Monitoring of Parallel Scientific Applications

Performance Monitoring of Parallel Scientific Applications Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure

More information

Parallel Processing and Software Performance. Lukáš Marek

Parallel Processing and Software Performance. Lukáš Marek Parallel Processing and Software Performance Lukáš Marek DISTRIBUTED SYSTEMS RESEARCH GROUP http://dsrg.mff.cuni.cz CHARLES UNIVERSITY PRAGUE Faculty of Mathematics and Physics Benchmarking in parallel

More information

Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.

Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier. Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core

More information

Virtuoso and Database Scalability

Virtuoso and Database Scalability Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance M. Rangarajan, A. Bohra, K. Banerjee, E.V. Carrera, R. Bianchini, L. Iftode, W. Zwaenepoel. Presented

More information

Oracle Solaris Studio Code Analyzer

Oracle Solaris Studio Code Analyzer Oracle Solaris Studio Code Analyzer The Oracle Solaris Studio Code Analyzer ensures application reliability and security by detecting application vulnerabilities, including memory leaks and memory access

More information

MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp

MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source

More information

UTS: An Unbalanced Tree Search Benchmark

UTS: An Unbalanced Tree Search Benchmark UTS: An Unbalanced Tree Search Benchmark LCPC 2006 1 Coauthors Stephen Olivier, UNC Jun Huan, UNC/Kansas Jinze Liu, UNC Jan Prins, UNC James Dinan, OSU P. Sadayappan, OSU Chau-Wen Tseng, UMD Also, thanks

More information

Running a Workflow on a PowerCenter Grid

Running a Workflow on a PowerCenter Grid Running a Workflow on a PowerCenter Grid 2010-2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)

More information

An Oracle White Paper September 2013. Advanced Java Diagnostics and Monitoring Without Performance Overhead

An Oracle White Paper September 2013. Advanced Java Diagnostics and Monitoring Without Performance Overhead An Oracle White Paper September 2013 Advanced Java Diagnostics and Monitoring Without Performance Overhead Introduction... 1 Non-Intrusive Profiling and Diagnostics... 2 JMX Console... 2 Java Flight Recorder...

More information

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance. Agenda Enterprise Performance Factors Overall Enterprise Performance Factors Best Practice for generic Enterprise Best Practice for 3-tiers Enterprise Hardware Load Balancer Basic Unix Tuning Performance

More information

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures 11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the

More information

Windows Server Performance Monitoring

Windows Server Performance Monitoring Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

OpenMP and Performance

OpenMP and Performance Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Kathlene Hurt and Eugene John Department of Electrical and Computer Engineering University of Texas at San Antonio

More information

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance 11 th International LS-DYNA Users Conference Session # LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton 3, Onur Celebioglu

More information

Scalability evaluation of barrier algorithms for OpenMP

Scalability evaluation of barrier algorithms for OpenMP Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

An Oracle White Paper March 2013. Load Testing Best Practices for Oracle E- Business Suite using Oracle Application Testing Suite

An Oracle White Paper March 2013. Load Testing Best Practices for Oracle E- Business Suite using Oracle Application Testing Suite An Oracle White Paper March 2013 Load Testing Best Practices for Oracle E- Business Suite using Oracle Application Testing Suite Executive Overview... 1 Introduction... 1 Oracle Load Testing Setup... 2

More information

Accelerating CST MWS Performance with GPU and MPI Computing. CST workshop series

Accelerating CST MWS Performance with GPU and MPI Computing.  CST workshop series Accelerating CST MWS Performance with GPU and MPI Computing www.cst.com CST workshop series 2010 1 Hardware Based Acceleration Techniques - Overview - Multithreading GPU Computing Distributed Computing

More information

Parallel Large-Scale Visualization

Parallel Large-Scale Visualization Parallel Large-Scale Visualization Aaron Birkland Cornell Center for Advanced Computing Data Analysis on Ranger January 2012 Parallel Visualization Why? Performance Processing may be too slow on one CPU

More information

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7 Introduction 1 Performance on Hosted Server 1 Figure 1: Real World Performance 1 Benchmarks 2 System configuration used for benchmarks 2 Figure 2a: New tickets per minute on E5440 processors 3 Figure 2b:

More information

Resource Utilization of Middleware Components in Embedded Systems

Resource Utilization of Middleware Components in Embedded Systems Resource Utilization of Middleware Components in Embedded Systems 3 Introduction System memory, CPU, and network resources are critical to the operation and performance of any software system. These system

More information

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...

More information

MAQAO Performance Analysis and Optimization Tool

MAQAO Performance Analysis and Optimization Tool MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22

More information

How System Settings Impact PCIe SSD Performance

How System Settings Impact PCIe SSD Performance How System Settings Impact PCIe SSD Performance Suzanne Ferreira R&D Engineer Micron Technology, Inc. July, 2012 As solid state drives (SSDs) continue to gain ground in the enterprise server and storage

More information

Improving the performance of data servers on multicore architectures. Fabien Gaud

Improving the performance of data servers on multicore architectures. Fabien Gaud Improving the performance of data servers on multicore architectures Fabien Gaud Grenoble University Advisors: Jean-Bernard Stefani, Renaud Lachaize and Vivien Quéma Sardes (INRIA/LIG) December 2, 2010

More information

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM 152 APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM A1.1 INTRODUCTION PPATPAN is implemented in a test bed with five Linux system arranged in a multihop topology. The system is implemented

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

Perfmon2: A leap forward in Performance Monitoring

Perfmon2: A leap forward in Performance Monitoring Perfmon2: A leap forward in Performance Monitoring Sverre Jarp, Ryszard Jurga, Andrzej Nowak CERN, Geneva, Switzerland Sverre.Jarp@cern.ch Abstract. This paper describes the software component, perfmon2,

More information

Intel DPDK Boosts Server Appliance Performance White Paper

Intel DPDK Boosts Server Appliance Performance White Paper Intel DPDK Boosts Server Appliance Performance Intel DPDK Boosts Server Appliance Performance Introduction As network speeds increase to 40G and above, both in the enterprise and data center, the bottlenecks

More information

Achieving a Million I/O Operations per Second from a Single VMware vsphere 5.0 Host

Achieving a Million I/O Operations per Second from a Single VMware vsphere 5.0 Host Achieving a Million I/O Operations per Second from a Single VMware vsphere 5.0 Host Performance Study TECHNICAL WHITE PAPER Table of Contents Introduction... 3 Executive Summary... 3 Software and Hardware...

More information

Locating Cache Performance Bottlenecks Using Data Profiling

Locating Cache Performance Bottlenecks Using Data Profiling Locating Cache Performance Bottlenecks Using Data Profiling Aleksey Pesterev Nickolai Zeldovich Robert T. Morris Massachusetts Institute of Technology Computer Science and Artificial Intelligence Lab {alekseyp,

More information

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud StACC: St Andrews Cloud Computing Co laboratory A Performance Comparison of Clouds Amazon EC2 and Ubuntu Enterprise Cloud Jonathan S Ward StACC (pronounced like 'stack') is a research collaboration launched

More information

White Paper. Recording Server Virtualization

White Paper. Recording Server Virtualization White Paper Recording Server Virtualization Prepared by: Mike Sherwood, Senior Solutions Engineer Milestone Systems 23 March 2011 Table of Contents Introduction... 3 Target audience and white paper purpose...

More information

FLOW-3D Performance Benchmark and Profiling. September 2012

FLOW-3D Performance Benchmark and Profiling. September 2012 FLOW-3D Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: FLOW-3D, Dell, Intel, Mellanox Compute

More information

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009 Performance Study Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009 Introduction With more and more mission critical networking intensive workloads being virtualized

More information

Big Graph Processing: Some Background

Big Graph Processing: Some Background Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs

More information

Copyright www.agileload.com 1

Copyright www.agileload.com 1 Copyright www.agileload.com 1 INTRODUCTION Performance testing is a complex activity where dozens of factors contribute to its success and effective usage of all those factors is necessary to get the accurate

More information

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx

More information

Networking in the Hadoop Cluster

Networking in the Hadoop Cluster Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop

More information

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

More information

benchmarking Amazon EC2 for high-performance scientific computing

benchmarking Amazon EC2 for high-performance scientific computing Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received

More information

Hardware performance monitoring. Zoltán Majó

Hardware performance monitoring. Zoltán Majó Hardware performance monitoring Zoltán Majó 1 Question Did you take any of these lectures: Computer Architecture and System Programming How to Write Fast Numerical Code Design of Parallel and High Performance

More information

A Data Structure Oriented Monitoring Environment for Fortran OpenMP Programs

A Data Structure Oriented Monitoring Environment for Fortran OpenMP Programs A Data Structure Oriented Monitoring Environment for Fortran OpenMP Programs Edmond Kereku, Tianchao Li, Michael Gerndt, and Josef Weidendorfer Institut für Informatik, Technische Universität München,

More information

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Measuring Cache and Memory Latency and CPU to Memory Bandwidth White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary

More information

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays Red Hat Performance Engineering Version 1.0 August 2013 1801 Varsity Drive Raleigh NC

More information

Performance Tuning Guidelines for Relational Database Mappings

Performance Tuning Guidelines for Relational Database Mappings Performance Tuning Guidelines for Relational Database Mappings 1993-2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying,

More information

Chapter 7 Memory Management

Chapter 7 Memory Management Operating Systems: Internals and Design Principles Chapter 7 Memory Management Eighth Edition William Stallings Frame Page Segment A fixed-length block of main memory. A fixed-length block of data that

More information

Building Scalable Applications Using Microsoft Technologies

Building Scalable Applications Using Microsoft Technologies Building Scalable Applications Using Microsoft Technologies Padma Krishnan Senior Manager Introduction CIOs lay great emphasis on application scalability and performance and rightly so. As business grows,

More information

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage

More information

x64 Servers: Do you want 64 or 32 bit apps with that server?

x64 Servers: Do you want 64 or 32 bit apps with that server? TMurgent Technologies x64 Servers: Do you want 64 or 32 bit apps with that server? White Paper by Tim Mangan TMurgent Technologies February, 2006 Introduction New servers based on what is generally called

More information

In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps. Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps. Yu Su, Yi Wang, Gagan Agrawal The Ohio State University In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps Yu Su, Yi Wang, Gagan Agrawal The Ohio State University Motivation HPC Trends Huge performance gap CPU: extremely fast for generating

More information

KVM & Memory Management Updates

KVM & Memory Management Updates KVM & Memory Management Updates KVM Forum 2012 Rik van Riel Red Hat, Inc. KVM & Memory Management Updates EPT Accessed & Dirty Bits 1GB hugepages Balloon vs. Transparent Huge Pages Automatic NUMA Placement

More information

VMWARE WHITE PAPER 1

VMWARE WHITE PAPER 1 1 VMWARE WHITE PAPER Introduction This paper outlines the considerations that affect network throughput. The paper examines the applications deployed on top of a virtual infrastructure and discusses the

More information

Dual Core Architecture: The Itanium 2 (9000 series) Intel Processor

Dual Core Architecture: The Itanium 2 (9000 series) Intel Processor Dual Core Architecture: The Itanium 2 (9000 series) Intel Processor COE 305: Microcomputer System Design [071] Mohd Adnan Khan(246812) Noor Bilal Mohiuddin(237873) Faisal Arafsha(232083) DATE: 27 th November

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION A DIABLO WHITE PAPER AUGUST 2014 Ricky Trigalo Director of Business Development Virtualization, Diablo Technologies

More information

Boosting Long Term Evolution (LTE) Application Performance with Intel System Studio

Boosting Long Term Evolution (LTE) Application Performance with Intel System Studio Case Study Intel Boosting Long Term Evolution (LTE) Application Performance with Intel System Studio Challenge: Deliver high performance code for time-critical tasks in LTE wireless communication applications.

More information

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86

More information

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,

More information

Performance Monitoring of the Software Frameworks for LHC Experiments

Performance Monitoring of the Software Frameworks for LHC Experiments Proceedings of the First EELA-2 Conference R. mayo et al. (Eds.) CIEMAT 2009 2009 The authors. All rights reserved Performance Monitoring of the Software Frameworks for LHC Experiments William A. Romero

More information

Energy-aware Memory Management through Database Buffer Control

Energy-aware Memory Management through Database Buffer Control Energy-aware Memory Management through Database Buffer Control Chang S. Bae, Tayeb Jamel Northwestern Univ. Intel Corporation Presented by Chang S. Bae Goal and motivation Energy-aware memory management

More information

Performance Tuning Guidelines for PowerExchange for Microsoft Dynamics CRM

Performance Tuning Guidelines for PowerExchange for Microsoft Dynamics CRM Performance Tuning Guidelines for PowerExchange for Microsoft Dynamics CRM 1993-2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying,

More information

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Using Synology SSD Technology to Enhance System Performance Synology Inc. Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_SSD_Cache_WP_ 20140512 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges...

More information

Recommended hardware system configurations for ANSYS users

Recommended hardware system configurations for ANSYS users Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range

More information

Lustre Networking BY PETER J. BRAAM

Lustre Networking BY PETER J. BRAAM Lustre Networking BY PETER J. BRAAM A WHITE PAPER FROM CLUSTER FILE SYSTEMS, INC. APRIL 2007 Audience Architects of HPC clusters Abstract This paper provides architects of HPC clusters with information

More information

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based

More information