PERFORMANCE ANALYSIS AND OPTIMIZATION OF LARGE-SCALE SCIENTIFIC APPLICATIONS JINGJIN WU

Size: px
Start display at page:

Download "PERFORMANCE ANALYSIS AND OPTIMIZATION OF LARGE-SCALE SCIENTIFIC APPLICATIONS JINGJIN WU"

Transcription

1 PERFORMANCE ANALYSIS AND OPTIMIZATION OF LARGE-SCALE SCIENTIFIC APPLICATIONS BY JINGJIN WU Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the Illinois Institute of Technology Approved Advisor Chicago, Illinois July 2013

2

3 ACKNOWLEDGMENT First and foremost, I would like to thank my advisor Prof. Zhiling Lan. Without her guidance and support throughout my Ph.D. study, this dissertation would not have been accomplished. I appreciate all her research directions, technical criticisms, as well as valuable feedbacks from innumerable office discussions. Her intelligence, wisdom and diligence are contagious and motivational for me. I will always be in her debt and admire her for the great person she is. I would also like to thank Prof. Nickolay Y. Gnedin, Prof. Andrey V. Kravtsov, Dr. Douglas H. Rudd and Dr. Roberto E. González, who have also guided me and worked with me on research related to this thesis. I am also thankful to my thesis committee: Prof. Xian-He Sun, Prof. Ioan Raicu, and Prof. Jia Wang for their time, interest, suggestions and comments. My special thanks go to my friends and colleagues in the SCS group for their friendship and encouragement. Last, but not least, I would like to thank my husband Xuanxing and our parents for their support at all times, and I hope that I can make them proud. iii

4 TABLE OF CONTENTS Page ACKNOWLEDGEMENT iii LIST OF TABLES vi LIST OF FIGURES ix LIST OF SYMBOLS x ABSTRACT xi CHAPTER 1. INTRODUCTION Motivation Contributions Thesis Organization COSMOLOGY SIMULATIONS Adaptive Mesh Refinement The Adaptive Refinement Tree (ART) Code Performance Analysis PERFORMANCE EMULATION Proposed Approach Load Balancing Schemes Experiments OVERVIEW OF TOPOLOGY MAPPING Background The Topology Mapping Problem Related Works HIERARCHICAL TOPOLOGY MAPPING FOR ART Problem Statement Proposed Approach Performance Evaluation NETWORK AND MULTICORE AWARE TOPOLOGY MAPPING Proposed Methodology iv

5 6.2. Inter-Node Mapping Intra-Node Mapping Performance Evaluation ANALYTICAL TOPOLOGY MAPPING Preliminary Algorithm Overview Global Mapping Legalization Performance Evaluation TOPOMAP: A TOPOLOGY MAPPING LIBRARY CONCLUSION AND FUTURE WORK BIBLIOGRAPHY v

6 LIST OF TABLES Table Page 3.1 Overall Load Balance Ratio of Different Load Balancing Schemes Average Intra-Socket and Inter-Socket Communication Time (Ping- Ping) Properties of Sparse Matrices Overhead of Topology Mapping on Stampede (Time in Seconds) Overhead of Topology Mapping on Kraken by Using the Recursive Bipartitioning Mapping Algorithm (Time in Seconds) Overhead of Topology Mapping on Kraken by Using the Recursive Tree Mapping Algorithm (Time in Seconds) The Relations between Topology Mapping and VLSI Placement Topology of the Blue Gene/P Supercomputer Overhead of the Proposed Analytical Mapping Algorithm. # Proc.: the number of processes; Time: the total runtime in seconds; # Iter.: the number of iterations in the global mapping stage vi

7 LIST OF FIGURES Figure Page 2.1 A 2D block-structured adaptive mesh refinement example with a refinement factor r = A 2D cell-based adaptive mesh refinement example: a quad tree with a refinement factor r = An example of large-scale ART simulation. Three panels show the dark matter (left), cosmic gas (middle) and stars (right), respectively Flow of the ART code. The four steps in the dotted region are the major steps for evolving a time step at each level The execution order for evolving time steps on three levels (refinement factor r = 2) A space-filling curve traversing a 2D mesh (left), and a parallel partition of the 2D adaptive mesh into four parts (right). The cells with the same color are assigned to the same process The communication pattern of ART during a large-scale cosmology simulation with 1536 processes Execution time of the ART code on Abe for simulating a single iteration of gf25cv1 at a late physical epoch with medium amount of particles Execution time of the ART code on Ranger for simulating a single iteration of gf25cv1 at an early physical epoch with large amount of particles Total parallel scaling of the ART code on Abe Total parallel scaling of the ART code on Ranger Flow of performance emulator of ART Part of the time axes of two processes Comparison of actual runtime and emulated runtime of ART Comparison of emulated runtime by using different load balancing schemes for coarse resolution case Comparison of emulated runtime by using different load balancing schemes for fine resolution case vii

8 3.6 Load Balance Ratio of each level for coarse resolution case Load Balance Ratio of each level for fine resolution case Average communication time of each level for coarse resolution case Average communication time of each level for fine resolution case Interconnected multiprocessor clusters with multicore CPUs on each node The recursive bipartitioning algorithm for inter-node mapping The algorithm for intra-node mapping by minimizing the maximum inter-socket message size (MIMS) The intra-node topology of Ranger (from TACC Ranger website [79]) Comparison of inter-node mapping and default mapping on Kraken in terms of hop-bytes Comparison of intra-node mapping and default mapping on Kraken in terms of MIMS (the maximum inter-socket message size) Comparison of hierarchical mapping and inter-node mapping on Kraken in terms of MIMS (the maximum inter-socket message size) Communication time reduction of different mapping mechanisms compared to default mapping on Kraken Comparison of inter-node mapping and default mapping on Ranger in terms of hop-bytes Comparison of intra-node mapping and default mapping on Ranger in terms of MIMS (the maximum inter-socket message size) Comparison of hierarchical mapping and inter-node mapping on Ranger in terms of MIMS (the maximum inter-socket message size) Communication time reduction of different mapping mechanisms compared to default mapping on Ranger High performance computing systems with multicore CPU(s) on each compute node Topology mapping framework A fat-tree network topology, and the neighbor joining tree representing the topology of the nodes allocated to a user application (light green) viii

9 6.4 The recursive tree mapping algorithm A 2D mesh/torus topology, and the neighbor joining tree representing the topology of the nodes allocated to a user application (light green) Recursive bipartitioning of a 2D mesh/torus topology. The nodes allocated to the user application are light green The recursive bipartitioning mapping algorithm for inter-node mapping on mesh/torus topology A machine architecture and its corresponding topology tree Hop-bytes reduction of sparse matrix tests on Stampede Communication time reduction of sparse matrix tests on Stampede Hop-bytes reduction of sparse matrix tests on Kraken Communication time reduction of sparse matrix tests on Kraken Performance results of ART tests on Stampede Performance results of ART tests on Kraken The analytical mapping algorithm The process migration algorithm The communication pattern of SP, CG and ART (1024 processes). nz is the number of blue dots, which indicates the number of inter-process communication NPB benchmark results SP NPB benchmark results CG Cosmology application results Flow of the topology mapping library TOPOMap ix

10 LIST OF SYMBOLS Symbol c(u, v) E p E t G p G t V p V t ϕ Definition The amount of communication in bytes between task u and v The set of edges in the topology graph The set of edges in the task graph The topology graph The task graph The set of processors The set of tasks The mapping from tasks in V t to processors in V p x

11 ABSTRACT Scientific applications are critical for solving complex problems in many areas of research, and often require a large amount of computing resources in terms of both runtime and memory. Massively parallel supercomputers with ever increasing computing power are being built to satisfy the need of large-scale scientific applications. With the advent of petascale era, there is an enlarged gap between the computing power of supercomputers and the parallel scalability of many applications. To take full advantage of the massive parallelism of supercomputers, it is indispensable to improve the scalability of large-scale scientific applications through performance analysis and optimization. This thesis work is motivated by cell-based AMR (Adaptive Mesh Refinement) cosmology simulations, in particular, the Adaptive Refinement Tree (ART) application. Performance analysis is performed to identify its scaling bottleneck, a performance emulator is designed for efficient evaluation of different load balancing schemes, and topology mapping strategies are explored for performance improvements. More importantly, the exploration of topology mapping mechanisms leads to a generic methodology for network and multicore aware topology mapping, and a set of efficient mapping algorithms for popular topologies. These have been implemented in a topology mapping library TOPOMap, which can be used to support MPI topology functions. xi

12 1 CHAPTER 1 INTRODUCTION 1.1 Motivation In the recent decades, numerical simulation has become critical for scientific and engineering research. As an important complement to theory and experiment, it has been successfully utilized to study complex phenomena in many areas, such as computational fluid dynamics, electromagnetics, materials science, meteorology, cosmology, and so on. The resultant scientific applications are often highly complicated, and require a large amount of computing resources in terms of both runtime and memory, especially for large-scale simulations. To satisfy the need of large-scale scientific applications, massively parallel high performance computing (HPC) systems with ever increasing computing power are being built. In the current TOP500 list [81] (June 2013), all the supercomputers have at least several thousands of cores. The No. 1 supercomputer Tianhe-2 has 3, 120, 000 cores, achieving a Linpack performance (Rmax) of 33, TFlop/s. More petascale machines will be available in the near future. In order to take full advantage of the massive parallelism of supercomputers, it is a must to improve the scalability of scientific applications through performance optimization. However, this is very challenging. One important issue is load balancing, which becomes increasingly important as the system size scales up. In practice, the scaling bottleneck of applications is often attributable to the imbalanced workload distribution among processes. Although load balancing schemes has been extensively studied in both theory and application during the past decades, it is still highly non-trivial to achieve ideal load balance on largescale systems due to the following reasons. First, fine-grained load balancing should be adopted to reduce the imbalance, but this often increases the time complexity of load balancing routines. Second, many scientific simulations are highly dynamic, with the

13 2 workload distribution changing dynamically throughout the execution. Hence, highly efficient and scalable (dynamic) load balancing algorithms should be integrated into applications. To achieve this goal, we need to explore and experiment different load balancing algorithms. However, load balancing schemes are tightly related to the problem decomposition and data structure of the application. Developers are often confined to a few choices, which block the exploration of more efficient solutions. To explore new schemes, major code changes are often required, which is costly and risky if the expected performance is not evaluated before hand. Besides, it often takes a long runtime to experiment the performance of different schemes through real runs. As a result, we need a smarter way to evaluate various load balancing schemes accurately and efficiently. Communication optimization is another important issue. If the workload distribution is well-balanced, then the communication time often determines the overall scaling of the application. There are at least three approaches to reduce the communication cost. First, we can aggregate multiple small messages into comparatively large messages to fully utilize the bandwidth of the interconnection network. This can substantially reduce the overall communication latency. Second, communication should be overlapped with computation whenever it is possible, so that the communication overhead can be hid. These two approaches have been widely used by application developers to improve the communication performance. The third approach is topology-aware task mapping [12], or simply topology mapping, which is relatively under-exploited, but critical for scaling on petascale systems and beyond. According to the topology of the interconnection network and the communication pattern of the application, topology-aware task mapping techniques map parallel application tasks (i.e. processes) onto processors (i.e. nodes or cores) properly to minimize the communication cost. The motivation is that message latencies are highly

14 3 dependent on the distance between communicating processors and the network contention. From the HPC system perspective, the interconnection networks are always sparse, e.g., fat-tree [55], 3D mesh or 3D torus [1, 2]. As the system scales up, the diameter of the interconnection network (i.e. the maximum distance between two nodes) increases, and the bisection bandwidth (i.e. the minimum total bandwidth of links connecting one half of the HPC system and the other) often decreases, making the communication increasingly expensive. Consider that parallel scientific applications usually have sparse communication pattern, it is critical to map processes onto nodes properly, so that the traffic in the network will be localized, leading to better communication performance. There are other aspects for performance optimization of scientific applications, e.g. I/O optimization [97], fault tolerance [56], and so on. This thesis work focuses on load balancing and communication optimization as they are the dominating factors for the performance of many applications. Specifically, we consider cosmology simulations, and study the adaptive refinement tree (ART) code [47], which is an advanced hydro+n-body simulation tool for cosmological research. Simulations are critical to making quantum leaps in cosmological research as they provide insight for the evolution of the universe, e.g., the formation of stars and galaxies. There are mainly two categories of cosmology simulation tools: those that only simulate the dark matter (often referred to as N-body ), and those that model gas dynamics (often called hydro ). Since cosmologically relevant scales are mainly dependent on the dark matter, a hydro simulation tool always includes an N-body component for modeling the dark matter. Typically, hydro simulations are much more compute-intensive than purely N-body simulations. Modern hydro simulations become even more computationally demanding in terms of both runtime and memory as more and more physical processes are included, e.g., gas

15 4 cooling, star formation and feedback, radiative transfer, and so on. Adaptive mesh refinement (AMR) [8, 68] has been widely applied to model the dynamics of cosmic baryons (gas and stars) for cosmology simulation, since it can follow the fragmentation of gas down to virtually unlimited small scales. Enzo [32] and the ART code are representative cosmology simulation tools using AMR. In particular, the ART code uses the cell-based AMR algorithm [46, 88], and incorporates many physical calculations for cosmology research, making it unique in its capabilities. As cosmology simulations usually consume a large amount of computing resources in terms of both runtime and memory, they are typically carried out on massively parallel high performance computing (HPC) platforms, e.g., HPC clusters. Production simulations using the ART code often involve physics computations for thousands of time steps, and can take several weeks or even months using hundreds of processing cores on production HPC platforms. It is important to further improve the ART code, so that it can scale up to petascale systems, and enable cosmologists to solve complex problems more efficiently. 1.2 Contributions To improve the ART code, we analyze its performance on production HPC systems to identify its performance bottleneck, design a performance emulator for efficient evaluation of load balancing schemes, and explore topology-aware task mapping techniques for performance optimization. Moreover, we also propose a generic methodology for network and multicore aware topology mapping, and develop a set of efficient mapping algorithms for popular topologies. The major contributions of this dissertation are as follows. Performance Emulation. A performance emulator is designed for cell-based AMR cosmology simulations. The emulator follows the flow of the original

16 5 application, i.e. the ART code, while the major physical computation and interprocess communication are replaced by runtime estimates provided by performance models. We demonstrate the effectiveness of the emulator by means of realistic cosmology simulation data on production systems. Experiments indicate that the emulator achieves good accuracy. Given the dynamic feature of AMR and the wide range of workload per cell, load balancing is an extremely challenging problem for cell-based cosmology simulations. We further evaluate three load balancing schemes for cell-based AMR cosmology simulations via the performance emulator. The use of the emulator enables us to quickly identify the issues associated with different load balancing schemes. Hierarchical Topology Mapping for ART. A hierarchical topology mapping algorithm is proposed to map ART processes onto multiprocessor clusters, where each node contains several multicore CPUs. In order to exploit the architectural properties of multiprocessor clusters (the performance gap between inter-node and intra-node communication, as well as the gap between intersocket and intra-socket communication), we propose to perform topology mapping in a hierarchical manner. First, the mapping of processes onto nodes (i.e., inter-node mapping) is obtained by using the recursive bipartitioning technique to minimize the amount of traffic in the network. Second, for each node, the mapping of processes onto multicore CPUs (i.e., intra-node mapping) is derived by minimizing the maximum size of messages transmitted between CPU sockets. Experiments on production HPC systems show that significant communication time reduction can be achieved by using the optimized mappings. This hierarchical approach has a wide applicability for cell-based AMR cosmology simulations, and the general methodology of performing hierarchical mapping can be used for many parallel applications.

17 6 Network and Multicore Aware Topology Mapping. A generic topology mapping methodology, which considers both the topology of the interconnection network and the hierarchical architecture of multicore compute nodes, is proposed for communication optimization of parallel applications on HPC systems. It is the broad extension of the hierarchical topology mapping of ART. Specifically, the mapping is still performed in two phases. In the first phase, the mapping of processes onto compute nodes is derived by utilizing efficient algorithms, which exploit the structure of the interconnection network to reduce inter-node message traffic. In the second phase, the mapping of processes onto logical processors is determined by partitioning the communication graph according to the intra-node hierarchical architecture, which is represented as a tree. We consider supercomputers with popular fat-tree and mesh/torus topologies, and develop efficient mapping algorithms, including a recursive tree mapping algorithm for generic tree topologies, and a recursive bipartitioning algorithm, which partitions the compute nodes of mesh/torus topologies by using their geometric coordinates. These have been integrated in the topology mapping library TOPOMap, which can be employed to support MPI topology functions. Experiments on production HPC systems show that significant performance gain can be achieved by using TOPOMap to find optimized mappings. Analytical Topology Mapping. An analytical mapping algorithm is proposed for topology mapping onto 3D mesh/torus-connected supercomputers. The design of this algorithm is motivated by mapping applications with irregular communication patterns onto regular 3D mesh/torus topology, and it is inspired by the analytical placement technique [84, 85] for VLSI design. Specifically, a two-stage strategy is used to determine the mapping. The first stage derives a rough global mapping solution by using quadratic programming, which intends to minimize the amount of traffic in the network. The second stage le-

18 7 galizes the rough global mapping solution to get a proper mapping by using a diffusion-like migration algorithm to move the processes between nodes. The resulting mapping is highly optimized due to the global optimization through quadratic programming. Experiments with popular benchmarks and the ART application on IBM Blue Gene/P system [41] show that the analytical mapping algorithm derives highly optimized mappings, which lead to significant performance gains. 1.3 Thesis Organization The rest of this thesis is organized as follows. Chapter 2 introduces the Adaptive Refinement Tree (ART) application, and analyzes its performance. Chapter 3 elaborates the design of the performance emulator and the evaluation of different load balancing schemes. Chapter 4 presents an overview of topology mapping techniques for communication optimization of HPC applications. Chapter 5 proposes the hierarchical topology mapping algorithm for ART. Chapter 6 introduces the methodology and algorithms for network and multicore aware topology mapping. Chapter 7 proposes the analytical topology mapping algorithm for regular 3D mesh/torus topology. Chapter 8 introduces the topology mapping library TOPOMap. Finally, Chapter 9 summarizes this thesis and discusses the future work.

19 8 CHAPTER 2 COSMOLOGY SIMULATIONS This chapter introduces a cosmology simulation application, the Adaptive Refinement Tree (ART) code, which is based on adaptive mesh refinement (AMR) [8,68], and presents its performance analysis on production HPC systems. 2.1 Adaptive Mesh Refinement Numerical simulations of many multiscale physical phenomena consume enormous computing resources in terms of both runtime and memory, because their multiple spatial and(or) temporal scales are discretized into the finest resolution for solution accuracy. This excessive resource requirement usually makes large-scale numerical simulations prohibitive. However, some resources are often underutilized for performing computation of the spatial and(or) temporal regions which do not require the finest resolution. To achieve better computing efficiency, AMR [8, 68] has been proposed to employ high resolution only in those required regions, so that we can focus resources on compute-intensive regions. AMR allows the user to perform simulations that are completely intractable on a uniform mesh. It has been much studied for more than two decades, and has proved to be successful in modeling multiscale phenomena for a variety of disciplines, including computational fluid dynamics, thermal dynamics, material science, geophysics, meteorology, cosmology, astrophysics, and so on. Generally, there are two kinds of AMR strategies: structured and unstructured. Structured AMR uses unions of regular meshes or cells to cover the computational domain, while unstructured AMR utilizes mesh distortion, which provides greater geometric flexibility at the cost of storing all neighborhood relations explicitly. In this dissertation, we focus on structured AMR, which is often implemented in two

20 9 Figure 2.1. A 2D block-structured adaptive mesh refinement example with a refinement factor r = 2. ways: block-structured AMR [8,18] and cell-based AMR [46,88]. The former achieves high spatial resolution by inserting smaller meshes ( blocks ) at places where high resolution is needed. The latter instead refines the computational domain on a cellby-cell basis. In practice, these two methods use different data structures and very different methods for distributing the computational load across a large number of processes. In this section, we briefly review these two structured AMR approaches Block-structured AMR. The basic principle of block-structured AMR is straightforward. Initially, a uniform mesh is adopted for the entire computational domain. In the regions which require higher resolution, finer meshes ( blocks ) are added. If some regions still need more resolution, even finer meshes are added. The computational domain is refined recursively in this way, and turns into a tree of meshes. The initial uniform mesh, as the tree s root, is at the top level. Each finer level decreases the mesh size by a factor r, which is called the refinement factor. Figure 2.1 shows a block-structured AMR example on the 2D mesh, including both mesh hierarchy and the overall structure. As each mesh block is regular, the sequential

21 10 Figure 2.2. A 2D cell-based adaptive mesh refinement example: a quad tree with a refinement factor r = 2. code of regular meshes can be reused for block-structured AMR implementations. Load balancing can be achieved by evenly distributing mesh blocks to processors Cell-based AMR. The cell-based AMR implements the AMR algorithm by performing grid refinement based on each cell. If higher spatial resolution is required for a cell, then it is refined into smaller cells, which are at the finer level of the overall grid hierarchy. With each level up, the cell size is decreased by the refinement factor r. A simple example of cell-based AMR on the 2D mesh is shown Figure 2.2. For simplicity, there are only 3 levels of cells from level 0 to level 2. Initially, a uniform grid on level 0 covers the overall computational domain. The dotted cells need higher resolution, so they are further refined. Throughout the execution of the AMR application, the grid hierarchy changes adaptively, and the cells are organized in refinement trees [46,88]. For each refinement tree, its root is the corresponding cell at level 0. The cell with children is a non-leaf cell. Otherwise, it is a leaf cell. Compared with block-structured AMR, cell-based AMR is more flexible, and can easily adapt to high resolution in localized regions. 2.2 The Adaptive Refinement Tree (ART) Code

22 11 Figure 2.3. An example of large-scale ART simulation. Three panels show the dark matter (left), cosmic gas (middle) and stars (right), respectively. The ART code is an advanced hydro+n-body simulation tool for cosmological research. It simulates the evolution of the universe, or more specifically, the formation of stars and galaxies. It employs a combination of multi-level particle-mesh and shock-capturing Eulerian methods for simulating the evolution of dark matter and cosmic gas, respectively. High dynamic range is achieved by applying adaptive mesh refinement to both gas dynamics and gravity calculations. The ART code is distinguished from the rest of cosmological simulation tools in the large number of physical processes it includes, which enable comprehensive simulation of cosmological phenomena and provide deep insight for cosmologists. Figure 2.3 shows the visualization of a large-scale ART simulation, including dark matter, cosmic gas and stars. In particular, ART is a hybrid MPI+OpenMP C program, with Fortran functions for compute-intensive routines. The MPI parallelization is used between separate compute nodes and the OpenMP parallelization is used inside a multi-core node. This mixed mode parallelization enables us to take full advantage of modern multi-core architectures. The ART code employs the cell-based AMR algorithm, performs refinement locally on individual cells and organizes cells in refinement trees. In order to model the universe, it adopts a cubic computational volume with a refinement factor of 2. For each cubic cell, the refinement operation evenly subdivides the cell

23 12 Figure 2.4. Flow of the ART code. The four steps in the dotted region are the major steps for evolving a time step at each level. into 8 cells, namely an oct. Hence, the refinement tree is also called oct-tree [46, 88]. In a typical cosmology simulation, the highest resolution regions can reach 7 to 10 refinement levels, resulting in a large range of dynamic multidimensional regions. With cell-based AMR, the ART code is able to control the computational mesh on the level of individual cells, such that the refinement mesh can easily be built and modified, and therefore, can effectively match the complex geometry of cosmologically interesting regions. In the following subsections, we briefly review its flow, domain decomposition, and communication pattern. Additional details about the ART code can be found in [47, 48, 97] Flow of ART. Figure 2.4 shows the basic flow of the ART code. First, it reads input files, including parameter files and cosmology data. Second, it initializes oct-tree and cell buffer. Then it checks whether the simulation time reaches the user specified time limit. If yes, the simulation stops, otherwise the ART code performs load balance

24 13 Level 0 Level 1 Level 2 7th 3rd 6th 1st 2nd 4th 5th Time Steps Figure 2.5. The execution order for evolving time steps on three levels (refinement factor r = 2). and simulates another iteration by evolving time steps for the overall computational domain, including the cells at all the levels. For each level, the evolution of a time step mainly consists of four steps: collect boundary information, perform physics computation, project physics data to the coarser level, and adaptively refine/derefine the cells. At the end of each iteration, the ART code generates output files, including log files and cosmology data files if any. Specifically, in each iteration of the simulation, ART evolves a time step dt for the overall computational domain, which is adaptively refined into multiple levels of cells during simulation. This refinement in the spatial domain is accompanied by a refinement in the time domain, where finer level meshes or cells evolve with a smaller time step according to the refinement factor. Since the ART code uses a refinement factor of 2, the time step size at level l is 2 l dt. As a result, ART evolves 2 l time steps at level l in each iteration. Figure 2.5 shows the recursive execution order for evolving these time steps on three levels. Basically, finer levels evolve first, then coarser levels follow. Except the finest level, each level l evolves a new time step when and only when level l + 1 has evolved two time steps ahead Domain Decomposition. The computational domain decomposition is critical to the performance of cosmology applications using AMR. In order to take advantage of the cell-based AMR and exploit the spatial locality, the ART code

25 14 Figure 2.6. A space-filling curve traversing a 2D mesh (left), and a parallel partition of the 2D adaptive mesh into four parts (right). The cells with the same color are assigned to the same process. adopts a domain decomposition scheme [65] based on Hilbert s space-filling curve (SFC) [21]. The SFC is identified by a traversal of all the root cells according to their spatial coordinates. A parallel partition of root cells is obtained by dividing the curve into N p (number of processes) equal parts, where each part is weighted by the total workload of the corresponding computational domain. It is to be noted that each root cell keeps all its child cells at finer refinement levels as a single composite unit, thus being the basic object for domain decomposition. Figure 2.6 presents a space-filling curve on a 2D mesh and a parallel partition of cells into four parts. Because of the spatial locality preserved by the curve, each part is a continuous domain consisting of nearby cells, and this property also holds for the 3D case of parallel partition of the cubic universe in ART. This SFC-based domain decomposition scheme enables efficient partitioning of the adaptive mesh by transforming a multidimensional problem (e.g., 2D or 3D) into a unidimensional one, and it has been widely employed in parallel AMR implementations [19, 20, 35, 47, 75]. Since the computational meshes and cosmological objects evolve dynamically during a cosmology simulation, the workload distribution between processes changes. In order to ensure load balance, ART regularly examines the workload distribution

26 15 during simulation, and performs domain decomposition to re-balance workload when necessary Communication Pattern. With the SFC-based domain decomposition, each process of ART mainly performs computation for its local computational domain, and communicates with other processes to get the boundary information, which are the data associated with the external boundary cells of each process. It is worth noting that each process only keeps the data associated with its local computational domain, enabling a fully parallel solution for both computation and memory. Updating the boundary information is the dominating communication routine of ART. Specifically, in order to exchange boundary information with other processes, each process first posts non-blocking receives (MPI Irecv), then sends data by non-blocking sends (MPI Isend), and finalizes the communication by an MPI Waitall for all sends and receives. Such MPI communication is performed per time step at each level, and cannot be overlapped with computation to reduce runtime, because the physics computation of each process depends on updated boundary data. Generally, each process only communicates with relatively small number of processes whose computational domain is nearby, and the amount of communication between two processes is mainly dependent on the number of boundary cells between their computational domains. Figure 7.3 shows the communication pattern of ART for a production simulation with 1536 processes. Each blue dot at (i, j) represents the communication between process i and j, and nz denotes the total number of blue dots, i.e., the total number of communicating process pairs. Clearly, each process only communicates with a few other processes, and most communication is between neighboring processes since most blue dots are along the diagonal. The communication pattern may vary for different mesh structure and different number of processes, yet the sparse and diagonal dominant property always holds due to the

27 nz = Figure 2.7. The communication pattern of ART during a large-scale cosmology simulation with 1536 processes.

28 17 spatial locality provided by the SFC-based domain decomposition. 2.3 Performance Analysis In order to study the scalability of the ART code, we instrument it with performance counters and timers, and perform simulations with practical cosmology datasets on various production HPC systems. In this section, we present the performance analysis of ART on two HPC systems Two Cluster Systems. The two cluster systems are the Intel 64 cluster Abe at the National Center for Supercomputing Applications (NCSA), and the Sun cluster Ranger at the Texas Advanced Computing Center (TACC). These systems are chosen because they have different hardware/software configurations, and they are representative HPC clusters widely used for scientific computing. The first system is an Intel 64 cluster called Abe which is located at NCSA. It was ranked the 82nd in the top 500 list [81] (June 2010) with a peak performance of TFLOPS. Abe is equipped with InfiniBand network and Lustre parallel filesystem. It contains 9600 processing cores, arranged as 1200 symmetric multiprocessing (SMP) nodes. Each node has either 8GB or 16GB memory, and two quadcore CPUs running at 2.33 GHz. Each CPU has 4MB L2 cache. The operating system on Abe is Red Hat Enterprise Linux 4 (Linux ). The second system is a Sun constellation Linux cluster called Ranger, which is located at TACC. It was ranked the 11th in the top 500 list [81] (June 2010) with a peak performance of 579.4TFlops. It is composed of 3936 nodes, with a total of processing cores. The nodes are connected by a full-clos InfiniBand interconnect, which provides a 1GB/sec point-to-point bandwidth. Each node has 32GB memory, and 4 quad-core AMD 64-bit Barcelona CPUs. Each CPU has a clock frequency of 2.3 GHz and three levels of cache for fast memory access: 2MB L3 cache shared by 4

29 18 cores, 512KB L2 cache dedicated to each core, and 64 KB L1 cache. Also, Ranger is equipped with the Lustre file system and a Linux kernel Cosmology Dataset. A practical cosmology dataset of ART called gf25cv1 is employed in our experiments. This cosmological simulation is used in the box of 36 comoving Mpc on a side, covered with the uniform top level grid of cells. A small fraction of the volume (containing Lagrangian regions of 5 galaxies with masses between and M ) is further resolved in the initial conditions by additional 3 levels, to the effective grid. This high resolution region is allowed to refine dynamically by another 6 levels (10 levels total, counting the top grid). This dataset is chosen because it is a practical dataset used in ongoing cosmological research, and it is similar to the datasets, which will be used in the ultimate large-scale cosmology simulations in future. In fact, the computational domain of a large-scale simulation can be 100 times the box size of gf25cv1, and these 100 box pieces will be made essentially independent of each other. Therefore, this dataset is representative for both current cosmology simulations and future large-scale simulations. The scalability analysis and performance optimization based on this dataset will be beneficial for general simulations. In practice, the simulation of this dataset evolves thousands of iterations, which take large amount of runtime (up to a few months using hundreds of cores), thus it is prohibitive to simulate all the iterations. In our experiments, we selectively simulate this dataset for a few iterations at different physical epoches with different grid resolutions for performance evaluation Experimental Setup. On Abe, we use the ART code to simulate gf25cv1 for a few number of iterations at a late physical epoch with medium amount of particles. Here only medium amount of particles are employed for the experiments

30 19 on Abe, because the quota of disk in the home directory of each user on Abe is only 50GB, and it cannot accommodate the data files for simulating more particles. The experiments are carried out with 256, 512, and 1024 processors (i.e., cores), respectively. As each node of Abe has 8 processors, so actually there are 32, 64 and 128 nodes used in these experiments. Smaller number of nodes are not adopt for experimentation because they cannot accommodate the data for simulating this dataset. As ART is a hybrid MPI+OpenMP code and there are 8 processors available on each node of Abe, we assign an MPI process with 8 OpenMP threads for each node. This kind of configuration is employed for two reasons. First, this configuration often provides good performance, and it is adopt in most production simulations. Second, to simulate the practical cosmology dataset, each process of the ART code needs a large amount of memory. In practice, each node of the Abe machine cannot accommodate two or more MPI processes of ART. Therefore, we use one MPI process and 8 OpenMP threads on each node for performance analysis. As Ranger has sufficient quota of disk per user, we simulate the dataset at an early physical epoch with large amount of particles. A series of experiments with 256, 512, 1024, and 2048 processors (i.e., cores) have been performed. Similarly, we assign one MPI process for each node, and one OpenMP thread for each processor. Recall that each node of Ranger has 16 processors, so there are 16 OpenMP threads on each node Results. Figures 2.8 and 2.9 show the average runtime per process of the ART code on Abe and Ranger, respectively. The overall runtime of the ART code is divided into three parts: physics computation, MPI communication, and others. The physics computation time includes all the runtime spent on solving the physics equations and managing the cells; the MPI communication time is the time spent on message passing function calls; and the other runtime mainly includes IO and

31 Execution Time On Abe Physics Computation MPI Communication Others 6000 Time(s) Number of Processors Figure 2.8. Execution time of the ART code on Abe for simulating a single iteration of gf25cv1 at a late physical epoch with medium amount of particles Execution Time On Ranger Physics Computation MPI Communication Others 1500 Time(s) Number of Processors Figure 2.9. Execution time of the ART code on Ranger for simulating a single iteration of gf25cv1 at an early physical epoch with large amount of particles.

32 Physics Computation Time MPI Communication Time Total Runtime Speedup Relative to 256 Processors on Abe 3 Speedup Number of Processors Figure Total parallel scaling of the ART code on Abe. load balance time. Since different simulations are performed on these two clusters, the absolute runtime of these two sets of experiments are not comparable. As the others only consumes less than 10% of the total runtime, we focus on the analysis of the physics computation time and MPI communication time. The scaling curves of these experiments are illustrated in Figures 2.10 and 2.11, respectively. For both experiments on Abe and Ranger, the physics computation time has a linear speedup as the number of processors increases. In comparison with the physics computation time, the MPI communication time has an inferior scaling trend on both clusters, and this inferior scaling results in the poor scaling of the total runtime. Basically, it is expected that the average physics computation time has a good scaling, because the overall workload of simulation is dependent on the given cosmology data. It is observed that the physics computation time of different processes often varies significantly, which causes large amount of synchronization time (i.e. waiting time) between processes. As such synchronization time constitutes a large part of MPI communication time, it is a must to explore more efficient load balancing schemes, so that the physics computation time (i.e. workload) of different

33 Speedup Relative to 256 Processors on Ranger Physics Computation Time MPI Communication Time Total Runtime 6 Speedup Number of Processors Figure Total parallel scaling of the ART code on Ranger. processes can be well balanced, and the synchronization time can be reduced. Moreover, it would also be beneficial to explore topology-aware mapping techniques for reducing communication cost (i.e. data transmission time).

34 23 CHAPTER 3 PERFORMANCE EMULATION In this chapter, we present the design of the performance emulator for cellbased AMR cosmology simulations, and evaluate three load balancing schemes via the performance emulator. 3.1 Proposed Approach The execution time of a cosmology simulation can be divided into three parts: physics computation, MPI communication, and others. Here, physics computation time includes the runtime spent on solving physics equations and managing the cells; the MPI communication time is the time spent on MPI function calls; and the other runtime mainly includes IO and load balance time. As the sum of physics computation and MPI communication time accounts for more than 95% of the overall runtime, we focus on these two parts during the design of the emulator. In the following, we first present our performance models to estimate physics computation time and MPI communication time, and then describe the performance emulator Performance Models. Runtime performance models are built to estimate physics computation time and MPI communication time. Considering that different levels have different resolutions and time step sizes, our focus is to build models to estimate these runtime of each time step per level. During simulation, each application process stores two types of cells: the cells within its computational domain, namely local cells; and the external neighboring cells of its computational domain, namely buffer cells. Each process conducts physical calculation on local cells, and communicates with other processes to get buffer cell data, which serves as important boundary information for simulating its local computational domain. These two types of cells are the main indicators of physics computation time and MPI communication time,

35 24 respectively Physics Computation Time. In order to characterize the physics computation time accurately, we use principle component analysis (SPCA) [73] to analyze the runtime, and observe that the number of local cells and particles are the dominant terms in determining the physics computation time. It is expected because the ART code solves two kinds of physics equations at each level: the physics equations for hydro dynamics, and the physics equations for N-body simulation. The former is only solved for leaf local cells, and the solution is projected to obtain the solution of non-leaf local cells at the coarser level, while the latter is solved for particles. Thus, we use the following linear model for the physics computation time T P of each level. T P = T nllc P + T llc P + T particle P (3.1) = w 1 N nllc + w 2 N llc + w 3 N particle, where w i (i = 1, 2, 3) are constant coefficients for the level of interest. and T particle P TP nllc, TP llc denote the computation time of non-leaf local cells, leaf local cells and particles, respectively. N nllc, N llc, and N particle denote the number of non-leaf local cells, leaf local cells and particles, respectively. Note that Req.(3.1) is defined for each level of each process, while all the processes share common constant coefficients of each level. To extract these coefficients, we use the cell and particle counts along with the physics computation time at each level of each process to formulate Req.(3.1), and then solve a linear system in the form of Ax = b for each level. As the number of equations is usually larger than the number of coefficients, we apply linear regression to compute the least square fit solution of these coefficients MPI Communication Time. The MPI communication time can be

36 25 further divided into two parts: data transmission time and synchronization time. Data transmission time is simply the runtime spent on transmitting data between processes. As updating the boundary information is a major communication routine and the amount of boundary data for each process is proportional to its buffer cell counts, the data transmission time is largely dependent on the number of buffer cells. In our model, data transmission time is modeled as follows. T trans = t s + n t c, (3.2) where t s can be considered as the latency for message passing, t c is the inverse of the bandwidth and n is the data size for one time data transmission. The latency and bandwidth can be obtained by using Intel MPI Benchmarks (IBM) [43], and the number of transmitted bytes can be calculated using the number of buffer cells. Synchronization time is incurred when processes do not start their communication routines simultaneously. For example, consider two processes with a maximum refinement level of 3 and 6, respectively, and assume that these two processes need to exchange boundary information from level 0 to 3. The first process starts evolving time steps at level 3, while the second process starts at level 6. Once the first process finishes a time step at level 3, it sits idle until the second process catches up to it, and then they can communicate to exchange boundary information. Even if two processes have the same refinement level, when there exists load imbalance, a process may still have to sit idle waiting for the boundary information from the other process. Such extra waiting time is recorded as part of the MPI communication time, because the process is stalled when executing MPI function calls, and we refer to it as synchronization time. The MPI communication time including both data transmission time and synchronization time can be characterized by emulating time steps as detailed in the next subsection.

37 26 Grid Structure and Model Coefficients NO Iter < MaxIter Exit YES Load Balance Emulate Time Steps Analyze Communication Relationships Estimate Total Runtime Estimate Physics Computation Time and Data Transmission Time Using the Performance Models Figure 3.1. Flow of performance emulator of ART. To build these models, we need performance data so as to extract model coefficients. In practice, we can either use the performance data from previous simulations, or conduct a few iterations of the simulation to collect performance data for model construction Emulator Design. Figure 3.1 presents the design of our emulator. Comparing Figure 3.1 and Figure 2.4, we can see that the emulator uses the performance models to estimate physics computation and MPI communication time for each time step, rather than conducting the actual simulation. During each iteration, we use a load balancer to determine workload distribution among processes, and then emulate time steps in exactly the same order as the ART code. For each time step, the emulator first analyzes the communication relationships among processes. Specifically, for each process, the emulator derives the amount of data that needs to be transmitted to and received from all the other processes. Next, according to the cell and particle counts of each process and the communication relationships, the emulator estimates

38 27 Figure 3.2. Part of the time axes of two processes. physics computation time and data transmission time using the performance models shown in equation (3.1) and (3.2), respectively. Finally, the emulator estimates the total runtime. It is achieved by maintaining a time axis for each process to record the computation and communication intervals. Figure 3.2 shows the time axes of two processes P0 and P1. The length of computation time intervals are the estimated physics computation times, while the arrows represent data transmission. Clearly, MPI communication time intervals are determined by computation time intervals and data transmissions. For example, the second MPI communication time interval of P0 can be computed as T c = t 2 t 1 = (t 0 + T trans ) t 1, and the start time of P0 for the communication in the next time step is t 3 = t 2 + T P. Therefore, using the estimated physics computation time and data transmission time for each time step of each process, the emulator is able to estimate the MPI communication time including both data transmission time and synchronization time without evaluating synchronization time separately. In summary, by emulating time steps with the performance models to characterize runtime components, the emulator can estimate the performance of the ART code without executing physics solvers.

39 Load Balancing Schemes One of the major design goals of our performance emulator is to evaluate the performance of different load balancing schemes, thus avoiding time-consuming and complicated implementation in code without knowing potential effects of the modification. The ART code performs cosmology simulations using a cubic computational domain, which represents the universe. The computational domain is initially divided into many uniform cubic cells at level 0. We refer to such cells at level 0 as root cells since they are the roots of oct-trees. Each root cell keeps all its child cells at finer refinement levels as a single composite unit, thus being the basic unit for load balancing. In this section, we present three representative load balancing schemes which will be evaluated later in Section SFC-Based Load Balancing Scheme (SCAB). Currently, the ART code employs a load balancing scheme based on Hilbert space-filling curve (SFC) [21]. We denote this scheme as SCAB. It assigns a unique SFC ID for each root cell according to their spatial coordinates, then generates an SFC curve by connecting root cells with continuous SFC IDs, and finally divides the SFC curve into N p (N p is the number of processes) segments with similar amount of workload. Specifically, this scheme considers the total workload of each root cell, and adopts a greedy algorithm to split the SFC curve, so that the workload can be evenly distributed among all the processes. Currently, SCAB is widely used for parallel AMR [19, 35, 75]. One salient feature of this SCAB scheme is its good spatial locality, where each process gets root cells with continuous SFC IDs resulting in a continuous computational domain. However, SCAB restricts the assignments of root cells to processes by the SFC curve, and does not consider the communication among processes when splitting the SFC curve Graph Partitioning-Based Load Balancing Scheme (Graph). Graph

40 29 partitioning is an alternative approach for the load balancing of cell-based AMR applications. We denote it as Graph. To apply the graph method, we need to map the load balancing problem into a graph partitioning problem. One straightforward mapping is to use vertices to represent root cells, and edges to represent communication relationships between neighboring root cells. The weight of each vertex is the workload of the corresponding root cell. Although this mapping does make sense, it results in a large graph, which is difficult to partition using acceptable amount of runtime and memory. For example, in a medium-sized cosmology simulation, a cubic computational domain of root cells maps into a graph with vertices and about edges. Partitioning such a large graph is almost prohibitive. Therefore, we must reduce the graph size for efficient partitioning. In our preliminary experiments, it is observed that only the root cells in a few localized regions are deeply refined to finer levels, while most root cells are not refined. Thus, we use a single vertex to represent the unrefined root cells with continuous SFC IDs, and create the edges accordingly in order to generate a manageable graph. The assignments of root cells to processes can be obtained by partitioning the vertices in the graph into N p partitions. Note that graph partitioning algorithms typically minimize the total edge-cuts subject to the constraints that the partitions are of similar size. When they are applied for the load balancing of cell-based AMR applications, we are actually minimizing the amount of communication among processes while ensuring that the workload is well-balanced. In our implementation, we use the graph partitioning tool MENTIS [60] to partition the graph Group-Based Load Balancing Scheme (Group). The aforementioned two load balancing schemes only consider the total workload of each root cell. However, as the ART code evolves time steps for each level, it is also critical to balance the workload of each level in order to reduce the synchronization time. Besides, it

41 30 is observed that the communication at deep refinement levels usually results in large synchronization cost, so it is important to minimize the communication at deep refinement levels. To meet such requirements, we design a new load balancing scheme called Group. This scheme first assigns neighboring root cells into groups, where each group has the lowest possible boundary level and satisfies a set of group workload constraints to control the granularity. The assignment of root cells to groups is based on the Friends-of-Friends algorithm [29]. Second, Group assigns root cell groups to processes by solving a constrained bin packing problem to balance the workload of each level, where each bin corresponds to a process. Specifically, we sort the groups in nonincreasing order according to their workload, and pack them into bins sequentially. To achieve good spatial locality, we compute the distances between groups using Coronoid tessellations [86], and try to assign each group to the process which holds its neighboring groups. In this way, Group is able to achieve good level-by-level load balance and spatial locality. 3.3 Experiments In this section, we present two sets of experiments. In the first experiment, we assess the accuracy of the emulator. In the second experiment, we examine the load balancing schemes using the emulator with realistic cosmology data, and discuss their advantages and shortcomings Experimental Setup. We instrument the ART code with performance counters and timers for analysis. We use a real cosmological simulation of a box of 36 comoving Mpc on a side, covered with the uniform top level grid of root cells. This simulation represents a scaled-down version of our future petascale cosmological simulations. In fact, the petascale simulation is 100 times the volume of this dataset,

42 31 and these 100 box pieces are essentially independent of each other. Our tested is the Intel 64 cluster Abe located at NCSA [77]. It is equipped with InfiniBand network and Lustre parallel file system. Each node has either 8GB or 16GB memory, and two quad-core CPUs running at 2.33 GHz. As ART is a hybrid MPI+OpenMP code and there are 8 processors available on each node of Abe, we assign an MPI process with 8 OpenMP threads to each node Accuracy of Performance Emulator. To measure the accuracy of the emulator, we conduct experiments using the emulator with the current load balancing scheme of ART SFCLB, and compare the emulated performance results with the actual runtime of the ART code. As the emulator emulates time steps without running physics solvers, it only consumes a few minutes to estimate the performance of ART. Figure 3.3 presents the comparison of actual runtime and emulated runtime. Here, the runtime includes physics computation time and MPI communication time. The difference between the actual runtime and emulated runtime is within 12%. The error is mainly due to some extra communication routines in the ART code except for exchanging boundary information between processes. The emulator does not model such extra communications because they will be removed in our next version of ART. Therefore, the emulator is accurate. More importantly, we notice that both curves in the figure have exactly the same trend. This further indicates that our emulator can clearly predict the performance and scalability of realistic cosmology simulations, and the performance analysis based on this emulator is reliable Evaluation of Load Balancing Schemes. In this set of experiments, we assess three load balancing schemes presented in Section 3.2 for cosmology simulations by means of the emulator. For the same cosmology dataset described in Section 3.3.1, we test two different resolution cases: the coarse resolution case and

43 32 Comparison of Actual Runtime and Emulated Runtime Time(s) Number of Processors Actual Runtime Emulated Runtime Figure 3.3. Comparison of actual runtime and emulated runtime of ART. the fine resolution case. The coarse resolution case represents a simulation with an intermediate resolution which reaches to a maximum refinement level of 6. The fine resolution case represents a simulation with an extremely high resolution, which is allowed to refine dynamically to level 9. We use three metrics, namely execution time, load balance ratio, and communication time per level, for evaluation. Specifically, load balance ratio represents the quality of workload distributions among processes. It is defined as Load Balance Ratio = 1 Np 1 N p i=0 W i 100%, max 0 i Np 1 W i where W i is the workload of process i and N p is the number of processes. Note that Load Balance Ratio is always smaller than or equal to 100%, and a much closer value to 100% indicates a better load balance quality. Figures 3.4 and 3.5 present emulated runtime by using different load balancing schemes for coarse resolution and fine resolution case, respectively. In both figures, the runtime using GroupLB scheme is smaller than that of the other two schemes, especially in the coarse resolution case. In Figure 3.5, we notice that as the number

44 33 Emulated Runtime of Coarse Resolution Case Time(s) Number of Processors SFCLB GraphLB GroupLB Figure 3.4. Comparison of emulated runtime by using different load balancing schemes for coarse resolution case. of processors increases, three curves are trending toward the same value. Such trend is caused by granularity, which is determined by the maximum workload of root cells. In the fine resolution case, there are several root cells whose individual workload is much larger than the average workload of each process when running on 512 and 1024 processors, thus introducing significant synchronization cost. Table 3.1 presents the overall Load Balance Ratio among processes for different load balancing schemes. The closer the metric is to 100%, the better load balance is achieved. Obviously, both SFCLB and GroupLB achieve better load balance for the coarse resolution case in comparison with the fine resolution case. This is because finer grid resolution increases the workload of root cells and results in larger granularity, which makes load balancing more challenging. GraphLB behaves differently since its primal objective is to minimize the communication among processes instead of balancing the workload. Comparing these schemes, GroupLB achieves the best load balance for the coarse resolution case and the first two tests of the fine resolution case. For the fine resolution case with 512 and 1024 processors, all of these three schemes fail to provide good load balance because of granularity problem.

45 34 Emulated Runtime of Fine Resolution Case Time(s) Number of Processors SFCLB GraphLB GroupLB Figure 3.5. Comparison of emulated runtime by using different load balancing schemes for fine resolution case. Table 3.1. Overall Load Balance Ratio of Different Load Balancing Schemes Number of Coarse Resolution Case Fine Resolution Case Processors SFCLB GraphLB GroupLB SFCLB GraphLB GroupLB % 64.89% 96.63% 90.21% 85.48% 91.17% % 73.54% 96.43% 71.35% 78.81% 88.51% % 53.76% 95.84% 53.74% 54.02% 51.73% % 42.74% 92.21% 26.98% 27.01% 26.75%

46 35 Load_Balance_Ratio 100% 90% 80% 70% 60% 50% 40% 30% Load_Balance_Ratio of Each Level (192 Processors, Coarse Resolution) SFCLB GraphLB GroupLB Load_Balance_Ratio 100% 90% 80% 70% 60% 50% 40% 30% Load_Balance_Ratio of Each Level (256 Processors, Coarse Resolution) SFCLB GraphLB GroupLB 20% 20% 10% 10% 0% Level 0% Level Load_Balance_Ratio 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% Load_Balance_Ratio of Each Level (512 Processors, Coarse Resolution) SFCLB GraphLB GroupLB Load_Balance_Ratio 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% Load_Balance_Ratio of Each Level (1024 Processors, Coarse Resolution) SFCLB GraphLB GroupLB 0% Level 0% Level Figure 3.6. Load Balance Ratio of each level for coarse resolution case. Figures 3.6 and 3.7 show the Load Balance Ratio of each level for coarse resolution case and fine resolution case, respectively. In the coarse resolution tests, the Load Balance Ratio for almost all the levels of GraphLB is obviously smaller than that of the other two schemes; SFCLB and GroupLB provide similar load balance at deep refinement levels 4 to 6; GroupLB also delivers well-balanced workload distribution at level 0 to 3. In the fine resolution tests, these schemes achieve comparable load balance at level 6 to 9, while GroupLB is more effective in balancing the workload at lower levels. In general, GroupLB achieves better level-by-level load balance than SFCLB and GraphLB for both coarse and fine resolution cases. Figures 3.8 and 3.9 illustrate the average communication time of each level for the two resolution cases, respectively. In Figure 3.8, as the number of processors increases, the level-by-level communication time is decreasing. Generally, GroupLB introduces smaller level-by-level communication time than SFCLB and GraphLB because it achieves better load balance at each level. In Figure 3.9, the communica-

47 36 100% Load_Balance_Ratio of Each Level (192 Processors, Fine Resolution) 100% Load_Balance_Ratio of Each Level (256 Processors, Fine Resolution) 90% 90% Load_Balance_Ratio 80% 70% 60% 50% 40% SFCLB GraphLB GroupLB Load_Balance_Ratio 80% 70% 60% 50% 40% SFCLB GraphLB GroupLB 30% 30% 20% 20% 10% 10% 0% Level 0% Level Load_Balance_Ratio 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Load_Balance_Ratio of Each Level (512 Processors, Fine Resolution) Level SFCLB GraphLB GroupLB Load_Balance_Ratio 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Load_Balance_Ratio of Each Level (1024Processors, Fine Resolution) Level SFCLB GraphLB GroupLB Figure 3.7. Load Balance Ratio of each level for fine resolution case. tion time is not scaling down with the increasing number of processors because the granularity problem results in poor load balance. GroupLB and GraphLB have much smaller communication time than SFCLB at level 7 to 9 since both methods try to reduce communication. However, for 512 and 1024 processors, GroupLB and GraphLB still have large communication time due to the granularity. In summary, by comparing these three load balancing schemes, we conclude that GroupLB provides the best performance. It achieves a good load balance quality by balancing both overall and level-by-level workload, and minimizes communication cost by preserving spatial locality. While SFCLB maintains an overall load balance and keeps spatial locality using the SFC curve, it does not take into consideration of level-by-level load balance, thereby introducing non-trivial synchronization cost. Although GraphLB minimizes communication cost, it does not provide satisfactory load balance quality.

48 Average Communication time for Each Level (192 Processors, Coarse Resolution) 350 Average Communication time for Each Level (256 Processors, Coarse Resolution) Time (s) SFCLB GraphLB GroupLB Time (s) SFCLB GraphLB GroupLB Level Level 350 Average Communication time for Each Level (512 Processors, Coarse Resolution) 350 Average Communication time for Each Level (1024 Processors, Coarse Resolution) Time (s) SFCLB GraphLB GroupLB Time (s) SFCLB GraphLB GroupLB Level Level Figure 3.8. Average communication time of each level for coarse resolution case Averagae Communication Time for Each Level (192 Processors, Fine Resoulution) 1600 Averagae Communication Time for Each Level (256 Processors, Fine Resoulution) Time(s) SFCLB GraphLB GroupLB Time(s) SFCLB GraphLB GroupLB Level Level 1600 Averagae Communication Time for Each Level (512 Processors, Fine Resoulution) 1600 Averagae Communication Time for Each Level (1024Processors, Fine Resoulution) Time(s) SFCLB GraphLB GroupLB Time(s) SFCLB GraphLB GroupLB Level Level Figure 3.9. Average communication time of each level for fine resolution case.

49 38 CHAPTER 4 OVERVIEW OF TOPOLOGY MAPPING Supercomputers with up to hundreds of thousands of cores have been built to fulfill the demands of large-scale and complex scientific applications. These machines usually have sparse network topologies with large network diameters (i.e., the maximum distance between two nodes). As the system size scales up, the diameter of the interconnection network increases, and the bisection bandwidth (i.e., the minimum total bandwidth of links connecting one half of the supercomputer and the other) often decreases. As a result, communication in the network becomes increasingly expensive due to large distance between nodes and network contention, leading to the scaling bottleneck of parallel applications. Topology-aware task mapping is an essential technique for communication optimization of parallel applications on high performance computing platforms. According to the topology of the interconnection network and the communication pattern of the application, it maps parallel application tasks onto processors to reduce communication cost. The following sections briefly review topology-aware task mapping techniques. 4.1 Background Today s supercomputers often use fat-tree, mesh or torus network topologies. A fat-tree network uses switches to connect compute nodes into a tree structure [55]. The compute nodes are at the leaves, while intermediate nodes represent switches. From the leaves to the root, the available bandwidth of links increases, i.e. the links become fatter. Infiniband networks are representative examples of fat-tree topology. To optimize communication on fat-tree, we prefer local communication to global communication for smaller latency. An n-dimensional mesh network connects compute nodes into a mesh struc-

50 39 ture, where each node is directly connected with two other nodes in each physical dimension. An n-dimensional torus network is a mesh with additional links to connect the pair of nodes at the end of each physical dimension [1], so that the network diameter can be reduced by half. In practice, 3-dimensional (3D) torus is commonly used in supercomputers, including the IBM Blue Gene family and the Cray XT family, and it can easily be configured into a mesh upon requirement. In petascale machines with 3D torus networks, the network diameter can be a few dozens of hops. As studied in [12], for small and medium sized messages, the message latency is dependent on the distance in hops. More importantly, [12] also shows that the congestion in the network can increase message latencies significantly. For better communication performance in torus networks, it is a must to reduce network congestion by mapping heavily communicating tasks onto neighboring nodes. In general, it is often advantageous to consider the network topology when mapping application tasks onto nodes, so that the communication cost can be reduced. For communication-intensive applications, it is indispensable to exploit topologyaware task mapping techniques for better scaling toward petascale computing platforms. Except for the performance improvement, a proper mapping often reduces the total amount of communication in the network, achieving power reduction in network transmission. As power consumption becomes increasingly critical in modern supercomputers, topology-aware mapping techniques are favorable due to potential power savings. However, in practice, it is very challenging to perform topology-aware mapping for parallel applications at runtime, since we have to solve the complicated mapping problem, which is detailed in the next section. 4.2 The Topology Mapping Problem For topology-aware mapping, the communication relation between parallel application tasks is represented as a task graph, which is also called communication

51 40 graph. It is to be noted that each task is an MPI process. The network topology of the target computing platform is represented as a topology graph. The problem of mapping parallel application tasks onto a computing platform with specific topology can be considered as a graph embedding problem [4], where a mapping of tasks onto the target topology is an embedding of the task graph onto the topology graph. Specifically, the task graph G t (V t, E t ) is an undirected graph, which characterizes the communication pattern of the parallel application. Each vertex in V t represents a task (i.e. a process). The edges in E t represent the communication between tasks, and each edge (u, v) E t has a weight c(u, v), which denotes the amount of communication (in bytes) between the two tasks represented by vertex u and v. The topology graph G p (V p, E p ) represents the network topology of the target computing platform. Typically, a vertex in V p denotes a processor (i.e. a compute node), and each edge in E p denotes a direct link. The definition of topology graph can vary for different network topologies. For example, the topology graph of a fat-tree network may also need to include switches. In order to model various topologies, [38] introduces a generic definition of topology graph; [83] utilizes edge weight to model the communication cost between processors. The mapping of tasks onto processors is specified by a function ϕ : V t V p. There are at least four metrics to evaluate the quality of a mapping, including the widely-used hop-bytes [12] and dilation, and recently proposed maximum interconnective message size [26] and the worst-case congestion [38]. hop-bytes is the total amount of inter-processor communication (in bytes) weighted by the distance (in hops) between processors. Recall that edge weight c(u, v) denotes the amount of communication (in bytes) on edge (u, v) E t. Let d(ϕ(u), ϕ(v)) be the distance, which is usually measured by the length of the shortest path (in hops) between ϕ(u) and ϕ(v) in the topology graph G p (V p, E p ). Then the hop-bytes of a mapping ϕ can

52 41 be computed as hop-bytes(ϕ) = ( ) c(u, v)d ϕ(u), ϕ(v). (u,v) E t It represents the total communication volume in the interconnection network, and also indicates the amount of power consumption for data transmission in the network. The dilation of an edge (u, v) E t is the length of the shortest path connecting ϕ(u) and ϕ(v) in G p (V p, E p ). The dilation of a mapping is the maximum dilation among all edges in E t, and it measures the most stretched edge. The maximum interconnective message size is the maximum size of messages transmitted in the network. It measures the mapping quality on interconnected multicore platforms, where each compute node is assigned several heavily communicating processes. The congestion of a link in the network is measured by the amount of traffic on that link divided by the link capacity (i.e. bandwidth). The worst-case congestion over all links denotes the worstcase contention in the network, and it is a lower bound on the communication time. The mapping problem is usually defined as finding a mapping that minimizes some of these metrics, i.e. hop-bytes, dilation, the maximum interconnective message size, the worst-case congestion, etc. Such problem definition is based on certain assumptions as summarized in [38]. First, intra-node communication is much faster than inter-node communication, so that heavily communicating tasks should be mapped onto the same compute node whenever possible. Second, we assume that switches are ideal, and it is not necessary to model the internal switch structure for task mapping. Third, oblivious routing is assumed, i.e. the route of messages does not depend on other on-going traffic in the network. Forth, the routing algorithm is assumed to be fixed, and does not depend on the communication relation between processors. In other words, the mapping of tasks onto processors will not change the routing algorithm. These assumptions are often satisfied in many cases, and a fairly good mapping can be obtained by solving

53 42 the task mapping problem. In practice, the number of tasks (i.e. processes) is no less than the number of processors (i.e. compute nodes) used for running the application, i.e. V t V p. If each compute node executes a single process, we have V t = V p, and ϕ is a oneto-one mapping from V t to V p. If compute nodes execute more than one process, then V t > V p, and a graph partitioner [45, 60, 72] can be employed to divide the processes into groups before mapping strategies are applied. The common case is that each compute node execute the same number of processes. Generally, finding the optimal mapping ϕ that minimizes some of the aforementioned metrics is NPhard [14] or NP-complete [38]. A variety of algorithms have been explored for efficient task mapping, as described in the next section. 4.3 Related Works Topology-aware task mapping has been extensively studied. Many algorithms have been proposed to map parallel application tasks onto different topologies. These algorithms typically fall under the categories of physical optimization techniques and heuristic approaches. Specifically, physical optimization techniques include simulated annealing [15, 16, 53], genetic algorithms [7, 23], graph contraction [9, 58] and particle swarm optimization [71]. Although these optimization algorithms often provide good mapping quality, they take a long runtime to derive the mapping, and hence are not suitable for efficient mapping at runtime. They are seldom used in practice. In contrast, many heuristic approaches can better exploit the structure of the topology graph and the communication pattern, thus being more efficient for practical usage. The pioneer work [14] uses sequences of pairwise exchanges with probabilistic jumps to search for a good mapping. The subsequent research [54] proposes to derive an initial mapping first, and then use pairwise exchanges to refine the mapping.

54 43 However, it can take a long time for pairwise exchanges to achieve good mapping solutions. Graph heuristics also have been investigated for task mapping. [70] generates an initial mapping by partitioning the task graph, where communicating processes are mapped onto the same processor or neighboring processors, then the initial mapping is further tuned by a boundary refinement procedure to improve load balance among processors. A recursive bipartitioning algorithm based on the Kernighan-Lin graph-bisection heuristic is proposed in [33] to map tasks onto a hypercube topology. The task graph is bipartitioned recursively with the objective of minimizing the communication between two parts while maintaining a specified workload balance, and each task s processor assignment is gradually determined through partitioning the task graph at each level. This recursive bipartitioning technique is extended to handle general topologies in [66], where both the task graph and the topology graph are bipartitioned recursively to derive the mapping of tasks onto processors. It has been implemented in the software package SCOTCH [67, 72], and proved to be successful in deriving static mappings. The earlier works in 1980s (e.g., [14, 33, 54, 70]) mainly consider hypercubes, array processors, or shuffle-exchange networks, which are popular topologies at that time. Their experimental evaluation is limited to relatively small problem size, since the parallel computing platforms were at small scale. With the advent of faster networks and wormhole routing, the communication latency was largely reduced and became distance-insensitive, so the research in this area died down. As the emergence of very large supercomputers in the recent years, the communication latency shows to be a performance bottleneck, leading to the necessity of topology-aware task mapping. However, the network topology of modern supercomputers are often fat-trees and torus. Most earlier works are either unscalable or target for different topologies, and cannot be used in the present context. Hence, recent studies continue to explore efficient task mapping techniques.

55 44 In late 1990s, the Cray T3D and T3E machines first show that the communication performance is dependent on the mapping of tasks onto processors, and the impact of network contention and benefit of topology-aware mapping are evaluated [27, 40, 62]. The later supercomputers from Cray (e.g. Cray XT families) use faster interconnects, and have relieved this communication issue to some extent. However, the evaluation study with sophisticated benchmarks in [12] demonstrates that the communication issue still exists, and [89] shows the benefit of topologyaware job scheduling. Different from Cray machines, IBM supercomputers like Blue Gene families acknowledge that the message latencies are dependent on the distance, and encourage users to take advantage of the topology for performance optimization [2,6,42]. Their topologies have been utilized by both system designers [5,28] and application developers [11, 13, 34, 36, 64]. Graph embedding schemes are exploited to map tasks onto nodes for 3D mesh and torus topologies in [74,96], and have been employed for scalable support of MPI topology functions in IBM Blue Gene/L (BG/L) MPI library. A general method for task mapping on BG/L is proposed in [10]. As it uses Monte Carlo simulation to find a good mapping, it takes a long runtime and the solution is derived offline. Several mapping algorithms for regular 2D and 3D topologies are studied in [12]. MPI benchmarks are developed to evaluate the message latency in order to quantify the effects of network congestion, and geometric approaches are exploited to map regular 2D and 3D task graphs effectively. For example, the Maximum Overlap algorithm recursively overlaps the largest possible area of a 2D task graph with a 2D topology graph in order to derive a good mapping. The Expand From Corner algorithm starts at one corner of the task graph and the topology graph, respectively, and sorts the tasks and processors according to the Manhattan distance. The tasks are mapped onto processors according to the order of tasks and processors, so that the communicating tasks can be placed on neighboring processors. A mapping algorithm

56 45 based on space filling curve is also investigated. Both tasks and processors are ordered into a one dimensional list according to the space filling curve, and tasks are mapped onto processors following this order. Moreover, generic mapping techniques are studied in [38], including a greedy heuristic, a recursive bipartitioning approach, and a mapping strategy based on graph similarity. These have been implemented in the topology mapping library [57], and the derived solution can be further improved by simulated annealing. Results on fat-tree, torus, and the PERCS network topologies show that these techniques can effectively reduce network congestion, and the third approach using graph similarity takes the minimum runtime, thus being a good candidate for time-efficient mapping on large-scale systems. [3] proposes an iterative approach to map tasks onto processors heuristically. Using an estimation function to estimate how critical it is to map an un-allocated task onto an available processor, it iteratively selects a pair of an unallocated task and an available processor for mapping. For systems with a hierarchical communication architecture, e.g. clusters of SMP (Symmetric Multiprocessor) nodes, [83] employs graph partitioning to derive an appropriate mapping. In order to solve large-scale mapping problems, a hierarchical mapping algorithm is proposed in [25]. The tasks are first divided into groups called supernodes, then these groups are mapped onto the target topology to derive an initial mapping, and finally the mapping of the tasks within each group is improved by fine tuning. For the task mapping on multicore (or multiprocessor) clusters, the tasks are usually partitioned into subsets, which are mapped onto nodes by using ordinary mapping techniques. [31] evaluates the performance of different mappings using benchmark applications. [26] introduces the maximum interconnective message size (MIMS) to measure the quality of mapping, and designs packing algorithms to find the optimal mapping.

57 46 Except for the aforementioned works on inter-node mapping, intra-node mapping has also been studied. Jeannot et al. [44] proposed the TreeMatch algorithm for mapping processes onto cores within a multicore compute node, and this algorithm was used in [59] to support the MPI Dist graph create function. Rashti et al. [69] proposed to use a weighted graph to model the whole physical topology of the computing system, including both the inter-node topology and the intra-node topology, and used the recursive bipartitioning algorithm in SCOTCH [67] to find a proper mapping. Although this is an effective approach to tackle both internode mapping and intra-node mapping in a universal way, it has a few limitations. First, this approach cannot exploit the structure of the physical topology, e.g., the network hierarchy of the fat-tree topology, the cache hierarchy within a compute node, and the regularity of mesh/torus, etc. As a result, its overhead and mapping quality can be further improved. Second, in some cases, a weighted graph may not be a proper representation of the physical topology. For example, the InfiniBand network topology and the intra-node topology can be better represented by topological trees (without edge weights), and the topology of non-contiguous nodes allocated to a user application on Cray XT5 machines cannot be modeled by a sparse weighted graph straightforwardly. In summary, the majority of works focus on inter-node mapping only, and overlooks the mapping within a compute node. Generic mapping algorithms like recursive bipartitioning [38, 66] and greedy heuristics [3, 38] can find good mappings, but they are likely to be outperformed by other heuristics like graph partitioning [76] (according to the fat-tree topology) and graph embedding [96] (on to regular mesh/torus), which are well-designed to exploit the structure of specific mapping problems. Typically, a proper inter-node mapping reduces the communication between nodes by placing heavily communicating processes on the same compute node, leading to a

58 47 large amount of intra-node communication. Then it becomes important to optimize the intra-node mapping in order to achieve the best performance. Although the TreeMatch algorithm [44] has been proposed as an effective heuristic for intra-node mapping, to the best of our knowledge, there is minor work on concurrent support of both inter-node mapping and intra-node mapping except [69], which did not exploit the structure of the problem.

59 48 CHAPTER 5 HIERARCHICAL TOPOLOGY MAPPING FOR ART In this chapter, we present the hierarchical topology mapping algorithm for reducing communication cost of ART on interconnected multiprocessor clusters. 5.1 Problem Statement We consider the problem of mapping parallel ART processes onto interconnected multiprocessor clusters as illustrated in Figure 5.1. Each node has several multicore CPUs. The communication pattern of ART is represented as a task graph G t (V t, E t ), which can be extracted from the decomposed subdomains. Each vertex in V t denotes a process, and each edge (u, v) E t represents the communication between process u and v. A weight c(u, v) is introduced for each edge to denote the amount of communication in bytes between respective processes. The topology of a multiprocessor cluster is characterized by a topology graph G p (V p, E p ), where each vertex in V p represents a node, and each edge in E p denotes the link between respective nodes. We also introduce edge weight for the topology graph to represent the distance in hops between nodes, so that both direct and indirect links can be modeled properly. The mapping of tasks onto nodes is specified by a function ϕ : V t V p. As each node can accommodate multiple processes, without loss of generality, we assume that V t is a multiple of V p, so that each node is assigned V t V p processes. There are several different metrics to evaluate the quality of a mapping as discussed in Chapter 4. We choose hop-bytes in order to reduce the total amount of traffic in the interconnection network. To the best of our knowledge, most studies focus on mapping tasks onto nodes (i.e., a single-level mapping), while the mapping of tasks within nodes is undefined. A proper inter-node mapping typically reduces the communication between nodes

60 49 Figure 5.1. Interconnected multiprocessor clusters with multicore CPUs on each node. by grouping most heavily communicating processes within nodes, leading to large amount of intra-node communication. If inter-node mapping has been optimized properly, then it becomes critical to optimize the intra-node mapping by exploiting the topology within a node. Hence, we optimize both inter-node mapping and intranode mapping in a hierarchical manner. 5.2 Proposed Approach Typically, the inter-node communication is more expensive than intra-node communication. Likewise, within a node, the inter-socket communication is often more expensive than intra-socket communication 1. To fully exploit these communication characteristics on multicore clusters, we propose to perform task mapping hierarchically in two phases: First, perform inter-node mapping; Second, perform intra-node mapping within each multiprocessor node. 1 This is our observation from the experiments on several production multiprocessor clusters. The performance of intra-node communication is determined by specific MPI implementation.

61 50 In the first phase, the mapping of tasks onto nodes can be derived by using conventional task mapping techniques. In the second phase, novel technique is required to map tasks onto multicore CPUs within each node Inter-Node Mapping. We employ the recursive bipartitioning heuristic [33, 38, 66] for mapping ART processes onto multiprocessor nodes, aiming at minimizing hop-bytes. This approach solves the task mapping problem in a divideand-conquer manner. It performs recursive bipartitioning for both task graph and topology graph, and maps subsets of processes to subsets of nodes until a final mapping is obtained. Many topologies and communication patterns can be handled by recursive bipartitioning, and it is proved to be a successful task mapping technique in the software package SCOTCH [67]. Figure 5.2 presents our recursive bipartitioning algorithm for inter-node mapping. Recall that both task graph and topology graph are weighted, and their edge weights represent the amount of communication in bytes between processes and the distance in hops between nodes, respectively. In order to reduce hop-bytes, it is critical to map heavily communicating processes onto the same multiprocessor node, or at least nearby nodes. To achieve this goal, the task graph is partitioned with the minimum edge-cut, while the topology graph is split with the maximum edge-cut. In each step of the algorithm, The resulting subsets of processes (V t0, V t1 ) and subsets of nodes (V p0, V p1 ) can be mapped in two ways: the direct mapping V t0 V p0, V t1 V p1 ; and the exchanged mapping V t0 V p1, V t1 V p0. To choose a proper mapping, we heuristically estimate the cost of these two mappings. Let C u be the amount of communication in bytes associated with process u, D i be the aggregate distance between node i and other nodes. Let C k and D k be the average C u and D i for the subsets of

62 51 Algorithm 1 Recursive Mapping Input: task graph G t (V t, E t ), topology graph G p (V p, E p ). Output: mapping ϕ : V t V p. 1 recursive mapping(v t, V p ) 2 { 3 if ( V p = 1) { 4 ϕ(u) = V p, process u V t ; 5 return; 6 } 7 (V t0, V t1 ) bipartition ( G t (V t, E t ) ) ; 8 (V p0, V p1 ) bipartition ( G p (V p, E p ) ) ; 9 Calculate C 0, C 1, D 0, D 1 ; 10 if (C 0 D 0 + C 1 D 1 C 0 D 1 + C 1 D 0 ) { 11 recursive mapping(v t0, V p0 ); 12 recursive mapping(v t1, V p1 ); 13 } 14 else { 15 recursive mapping(v t0, V p1 ); 16 recursive mapping(v t1, V p0 ); 17 } 18 } Figure 5.2. The recursive bipartitioning algorithm for inter-node mapping.

63 52 processes V tk and the subsets of nodes V pk (k = 0, 1), respectively. C u = D i = C k = D k = c(u, v), (u,v) E t d(i, j), node j i 1 V tk 1 V pk C u, u V tk D i. i V pk The cost of the direct mapping is estimated by (C 0 D 0 + C 1 D 1 ), while that of the exchanged mapping is (C 0 D 1 + C 1 D 0 ). This estimated cost can be considered as a prediction of hop-bytes. The mapping with smaller estimated cost is selected in order to minimize hop-bytes. Theorem 1 The time complexity of recursive bipartitioning is O ( ( E t + E p ) log V p ), and the time complexity for cost estimation is O ( ( V t + V p ) log V p ) based on precomputed C u and D i. Proof 1 The multilevel k-way partitioning scheme [45] computes a bipartition of a graph G(V, E) in O( E ) time. The depth of recursive bipartitioning is log 2 V p, and the size of the graph is decreased by half in each step. Hence, the total runtime for partitioning the task graph G t (V t, E t ) is log 2 V p 1 k=0 2 k O( E t )/2 k = O( E t log V p ). Similarly, the topology graph G p (V p, E p ) is recursively bipartitioned in O( E p log V p ) time. Thus the overall runtime of recursive bipartitioning is O ( ( E t + E p ) log V p ). For cost estimation, both C u and D i can be precomputed and then they are used to calculate C k and D k in different recursive mapping calls. The C u of all processes can be computed in O( E t ) time. The distance between a pair of nodes can be obtained by using platform-dependent techniques, e.g., the pairwise node distance in a 3D mesh

64 53 or 3D torus network can be computed from the coordinates. The D i of all nodes can be calculated in O( V p 2 ) time using pairwise node distances. In the recursive mapping procedure, the runtime for computing C k and D k is log 2 V p 1 k=0 2 k O( V t + V p ))/2 k = O ( ( V t + V p ) log V p ). In order to reduce the runtime of recursive mapping, the original task graph G t (V t, E t ) is partitioned into V p equal parts by minimizing inter-part communication, where each part has V t V p processes. Then an induced task graph Ĝt( V t, Êt) which represents the communication between groups of processes is used for efficient recursive mapping. Theorem 2 With the induced task graph Ĝt( V t, Êt), the time complexity of recursive bipartitioning is O ( ( Êt + E p ) log V p ), and the time complexity for cost estimation is O ( V p log V p ). We use the graph partitioning tool hmetis [37] for partitioning the original task graph, and the recursive bipartitioning of both the induced task graph and the topology graph. The resulting subsets of processes and subsets of nodes may be unbalanced in some rare cases. A greedy approach is applied to achieve balanced bipartition by moving the vertex which leads to the optimal edge-cut. In practice, if V p = 2 k, the number of nodes in some step of the recursive mapping will be odd, then the cost estimation is not required, and a direct mapping which maps process groups to equal number of nodes is adopted. The mapping produced by the recursive bipartitioning algorithm may be further improved to reduce hop-bytes. The local search algorithm in [25], and the heuristic in [38] with the threshold accepting technique can be employed to further optimize the mapping, but such mapping optimization has not been exploited in this work.

65 Intra-Node Mapping. The motivation of intra-node mapping is to exploit the performance gap between intra-socket communication and inter-socket communication for performance optimization. For each multiprocessor node, we choose to minimize the maximum inter-socket message size (MIMS), which is computed as MIMS = max c(u, v), (u,v) E t subject to the condition that process u and v are on different CPU sockets of the same node. MIMS is the maximum size of process-pairwise bidirectional messages transmitted between CPU sockets. The resulting mapping places heavily communicating processes on the same multicore CPU. Figure 5.3 shows the sketch of our algorithm for intra-node mapping. It utilizes the intra-node task graph G t (Ṽt, Ẽt) which represents the communication between processes within a multiprocessor node, and maps processes onto multicore CPUs. We assume that the number of processes Ṽt is a multiple of ncpus (number of CPUs), so that each CPU is assigned in two steps: Ṽt ncpus processes. The intra-node mapping is performed First, initial mapping by graph partitioning; Second, fine tuning with a greedy heuristic. The processes are partitioned into ncpus (number of CPUs) equal parts P i (0 i < ncpus) by minimizing inter-part communication, where each part is mapped onto the corresponding CPU. In each iteration of fine tuning, the edge (u, v) E t resulting in MIMS is identified, i.e., u P i, v P j, i j, MIMS = c(u, v). Then we evaluate whether we can group both processes u and v onto CPU i or j by exchanging a

66 55 Algorithm 2 Intra-node Mapping Input: intra-node task graph G t (Ṽt, Ẽt). Output: mapping of processes onto multicore CPUs. 1 Partition intra-node task graph into ncpus equal parts P i (0 i < ncpus); 2 Map processes in P i onto CPU i; 3 Loop 4 Identify the edge (u, v) E t leading to MIMS, i.e., u P i, v P j, i j, MIMS = c(u, v); 5 If the minimum IMS of exchanging a pair of processes to group both process u and v onto either CPU i or j is smaller than MIMS; 6 Exchange the pair of processes with the 7 Else 8 Break; 9 End If 10 End Loop minimum IMS; Figure 5.3. The algorithm for intra-node mapping by minimizing the maximum intersocket message size (MIMS).

67 56 pair of processes to reduce MIMS. The resulting inter-socket message size (IMS) of exchanging process u P i and a process w P j \ {v} can be computed as ( IMS(u w) = max max x P i,x u ) c(u, x), max c(w, x), x P j,x w Likewise, the resulting IMS of exchanging process v P j and a process w P i \ {u} can also be computed. The minimum IMS of exchanging a pair of processes can be derived by evaluating all possible process pairs: (u w), u P i, w P j \ {v}, (w v), w P i \ {u}, v P j. If it is less than MIMS, then we exchange the pair of processes with the minimum IMS. Otherwise, the algorithm terminates. The initial mapping can be obtained in O( Ẽt ) time by using the multilevel k-way partitioning algorithm [45]. In each iteration of fine tuning, the edge (u, v) leading to MIMS can be identified in O( Ẽt ) time. In order to evaluate the resulting IMS of exchanging process pairs, for each process in P i and P j, the maximum amount of communication with another process in the same part need to be computed. This procedure takes O( Ẽti + Ẽtj ) time, where Ẽti and Ẽtj be the set of edges representing the communication between processes within P i and P j, respectively. Then it requires O( Ṽt ) comparisons to find the minimum IMS of exchanging two processes. ncpus Theorem 3 The overall time complexity of each iteration of fine tuning is O( Ẽt + Ẽti + Ẽtj + Ṽt ncpus ). hmetis [37] is employed to partition the intra-node task graph into balanced parts for initial mapping, which is further improved through fine tuning. It is worth

68 57 noting that the initial mapping has the minimum total amount of inter-socket communication, and the fine tuning procedure usually terminates in a few iterations. 5.3 Performance Evaluation Experimental Setup. Experiments are carried out on two production multiprocessor clusters, NICS Kraken [78] and TACC Ranger [79], which have different network topologies. Kraken is a Cray XT5 system with a 3D torus interconnect topology. It is comprised of 9, 408 compute nodes and each node contains two 2.6 GHz six-core AMD Opteron processors. Ranger is a Sun Constellation Linux cluster with a full-clos fat-tree topology. It has a total of 3, 936 compute nodes and each node has four 2.3 GHz quad-core AMD Opteron processors. A communication benchmark similar to IMB (Intel MPI Benchmarks) [43] is designed to measure the intra-socket communication time and the inter-socket communication time for performance analysis. Note that IMB cannot be employed for this purpose because we would like to test all possible communicating pairs of processors. We are particularly interested in the performance of PingPing, since the communication between ART processes can be viewed as concurrent PingPing operations between many process pairs. To evaluate the performance of the proposed hierarchical task mapping algorithm, we extract the communication part of a production ART simulation (with a box of 36 comoving Mpc on a side and the uniform top level mesh of root cells) for tests. For comparison, we evaluate five different mapping mechanisms: (1) the system default mapping, which is topology-agnostic; (2) an optimized mapping using MPI topology function MPI GRAPH CREATE [61]; (3) the pure inter-node mapping using the algorithm in Section 5.2.1; (4) the pure intra-node mapping using the algorithm in Section 5.2.2; (5) the hierarchical mapping integrating both inter-node

69 58 Table 5.1. Average Intra-Socket and Inter-Socket Communication Time (PingPing) No. of No. of Kraken (us) Ranger (us) Bytes Repetitions Intra Inter Difference Intra Inter Difference % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %

70 59 mapping and intra-node mapping. Hope-bytes, the maximum inter-socket message size (MIMS) and communication time are used as evaluation metrics. All the experiments were conducted in production mode without dedicated nodes, and there are other users sharing the interconnection network. For each set of experiments with a particular number of processes, we get the topology information between nodes at runtime, generate different mappings by using the proposed algorithms, and run all the tests with different mappings in a single batch script. It is worth noting that different runs often get nodes with different pairwise distances, and the interference of other running applications may also be different. Hence, the results obtained from different runs may not be comparable Results. Table 5.1 presents the average intra-socket and inter-socket PingPing communication time over all possible communicating processor pairs. The overhead of inter-socket communication compared to intra-socket communication is reported under the columns Difference, and the larger difference values between Kraken and Ranger are highlighted in bold. Clearly, Ranger shows a larger performance gap between intra-socket and inter-socket communication for most message sizes, because it has more complicated intra-node topology. As shown in Figure 5.4, there are four interconnected CPU sockets and no direct link exists between socket 0 and 3 on Ranger. Meanwhile, there are only two CPU sockets on each node of Kraken. Such architectural difference results in different communication saving when applying the pure intra-node mapping and the hierarchical mapping (see Figures 5.8 and 5.12). Figure 5.5 compares the pure inter-node mapping and the system default mapping on Kraken in terms of hop-bytes. The inter-node mapping algorithm (listed in Figure 5.2) effectively reduces hop-bytes for all the test cases, and the maximum reduction is up to 59%.

71 60 Figure 5.4. The intra-node topology of Ranger (from TACC Ranger website [79]). Improvement 70% 60% 50% 40% 30% 20% 10% Hop-Bytes Reduction of Inter-Node Mapping on Kraken 0% Number of Processes Figure 5.5. Comparison of inter-node mapping and default mapping on Kraken in terms of hop-bytes.

72 61 Improvement MIMS Reduction of Intra-Node Mapping on Kraken 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Number of Processes Figure 5.6. Comparison of intra-node mapping and default mapping on Kraken in terms of MIMS (the maximum inter-socket message size). Improvement 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% MIMS Reduction of Hierarchical Mapping on Kraken Number of Processes Figure 5.7. Comparison of hierarchical mapping and inter-node mapping on Kraken in terms of MIMS (the maximum inter-socket message size).

73 % Communication Time Reduction on Kraken 25.00% Improvement 20.00% 15.00% 10.00% 5.00% MPI_Graph Intra-Node Inter-Node Hierarchical 0.00% -5.00% Number of Processes Figure 5.8. Communication time reduction of different mapping mechanisms compared to default mapping on Kraken. Figure 5.6 compares the pure intra-node mapping and the system default mapping on Kraken in terms of MIMS. and Figure 5.7 compares the hierarchical mapping and the pure inter-node mapping on Kraken in terms of MIMS. The intra-node mapping algorithm (listed in Figure 5.3) largely reduces MIMS by up to 83%. For the first test case with 24 processes, the system default mapping happens to have the minimum MIMS, and the MIMS of the pure inter-node mapping is also close to the minimum, so no MIMS reduction or only limited reduction can be achieved. The communication time reduction of different mapping mechanisms compared to the system default mapping is illustrated in Figure 5.8, where MPI Graph, Intra-Node, Inter-Node and Hierarchical represent MPI topology mapping, the pure intra-node mapping, the pure inter-node mapping and the hierarchical mapping, respectively. MPI Graph does not achieve much performance improvement except the first test case with 24 processes, and it fails for the largest test with 1, 536 processes. The pure intra-node mapping often provides minor performance improvement, while

74 63 Hop-Bytes Reduction of Inter-Node Mapping on Ranger Improvement 80% 70% 60% 50% 40% 30% 20% 10% 0% Number of Processes Figure 5.9. Comparison of inter-node mapping and default mapping on Ranger in terms of hop-bytes. the pure inter-node mapping has much better performance. In contrast, the hierarchical mapping always outperforms both the pure intra-node mapping and the pure inter-node mapping, achieving communication time reduction by up to 25%. The experiments on Ranger have similar performance results as illustrated in Figures 5.9 to For all the test cases, the inter-node mapping algorithm (listed in Figure 5.2) is able to reduce hop-bytes by up to 76%, and the intra-node mapping algorithm (listed in Figure 5.3) reduces MIMS by up to 79%. MPI Graph provides similar performance as the system default mapping. Both the pure intra-node mapping and the pure inter-node mapping can achieve communication time reduction for most test cases. For the test with 32 processes, the pure inter-node mapping results in more communication time despite of reduced hope-bytes. This phenomenon is mainly attributable to the fact that the intra-node PingPing communication can be slower than the nearby inter-node PingPing communication for large message sizes on Ranger. This is due to small memory bandwidth per core on Ranger, and possibly as well as inefficient implementation of intra-node communication algorithms in the mvapich library. The hierarchical mapping often achieves much better performance

75 64 Improvement 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% MIMS Reduction of Intra-Node Mapping on Ranger Number of Processes Figure Comparison of intra-node mapping and default mapping on Ranger in terms of MIMS (the maximum inter-socket message size). MIMS Reduction of Hierarchical Mapping on Ranger Improvement 80% 70% 60% 50% 40% 30% 20% 10% 0% Number of Processes Figure Comparison of hierarchical mapping and inter-node mapping on Ranger in terms of MIMS (the maximum inter-socket message size).

76 65 Communication Time Reduction on Ranger Improvement 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% % % % Number of Processes MPI_Graph Intra-Node Inter-Node Hierarchical Figure Communication time reduction of different mapping mechanisms compared to default mapping on Ranger. than both the pure intra-node mapping and the pure inter-node mapping, reducing the communication time by up to 50%. By comparing Figure 5.8 and Figure 5.12, we can observe that the hierarchical mapping is much more effective on Ranger (with respect to the performance of the pure inter-node mapping), and the pure intra-node mapping generally achieves more communication time reduction (in percentage) on Ranger, because Ranger has a larger performance gap between intra-socket and inter-socket communication as shown in Table 5.1. Basically, such performance gap indicates how critical it is to perform intra-node task mapping, and we can expect that the intra-node task mapping will become increasingly important as more processors are included in each node.

77 66 CHAPTER 6 NETWORK AND MULTICORE AWARE TOPOLOGY MAPPING In this chapter, we introduce a topology mapping methodology and a set of mapping algorithms, which consider both the network topology and the hierarchical architecture of multicore compute nodes for finding a proper mapping. 6.1 Proposed Methodology Figure 6.1. High performance computing systems with multicore CPU(s) on each compute node. We consider the problem of mapping parallel application processes onto HPC systems with multicore compute nodes (see Figure 6.1), where each compute node is assigned multiple MPI processes. The physical topology of the computing system consists of two parts: network topology (i.e. inter-node topology) and intra-node topology. As they have different characteristics, which can be exploited by different mapping techniques, we propose to perform inter-node mapping and intra-node mapping in two phases, so that the mapping can be better optimized. This two-phase approach naturally exploits the hierarchy of the computing system, thus being computationally efficient.

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications

More information

benchmarking Amazon EC2 for high-performance scientific computing

benchmarking Amazon EC2 for high-performance scientific computing Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received

More information

Fast Multipole Method for particle interactions: an open source parallel library component

Fast Multipole Method for particle interactions: an open source parallel library component Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,

More information

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Distributed Dynamic Load Balancing for Iterative-Stencil Applications Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

Mesh Generation and Load Balancing

Mesh Generation and Load Balancing Mesh Generation and Load Balancing Stan Tomov Innovative Computing Laboratory Computer Science Department The University of Tennessee April 04, 2012 CS 594 04/04/2012 Slide 1 / 19 Outline Motivation Reliable

More information

Partitioning and Divide and Conquer Strategies

Partitioning and Divide and Conquer Strategies and Divide and Conquer Strategies Lecture 4 and Strategies Strategies Data partitioning aka domain decomposition Functional decomposition Lecture 4 and Strategies Quiz 4.1 For nuclear reactor simulation,

More information

Performance Monitoring of Parallel Scientific Applications

Performance Monitoring of Parallel Scientific Applications Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure

More information

Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies

Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies Proceedings of the IASTED International Conference Parallel and Distributed Computing and Systems November 3-6, 1999 in Cambridge Massachusetts, USA Characterizing the Performance of Dynamic Distribution

More information

Scientific Computing Programming with Parallel Objects

Scientific Computing Programming with Parallel Objects Scientific Computing Programming with Parallel Objects Esteban Meneses, PhD School of Computing, Costa Rica Institute of Technology Parallel Architectures Galore Personal Computing Embedded Computing Moore

More information

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

walberla: Towards an Adaptive, Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation

walberla: Towards an Adaptive, Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation walberla: Towards an Adaptive, Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation SIAM Parallel Processing for Scientific Computing 2012 February 16, 2012 Florian Schornbaum,

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

Power-Aware High-Performance Scientific Computing

Power-Aware High-Performance Scientific Computing Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan

More information

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov

More information

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

More information

Optimizing Shared Resource Contention in HPC Clusters

Optimizing Shared Resource Contention in HPC Clusters Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs

More information

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of

More information

Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

More information

HPC Deployment of OpenFOAM in an Industrial Setting

HPC Deployment of OpenFOAM in an Industrial Setting HPC Deployment of OpenFOAM in an Industrial Setting Hrvoje Jasak h.jasak@wikki.co.uk Wikki Ltd, United Kingdom PRACE Seminar: Industrial Usage of HPC Stockholm, Sweden, 28-29 March 2011 HPC Deployment

More information

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC Outline Dynamic Load Balancing framework in Charm++ Measurement Based Load Balancing Examples: Hybrid Load Balancers Topology-aware

More information

Sun Constellation System: The Open Petascale Computing Architecture

Sun Constellation System: The Open Petascale Computing Architecture CAS2K7 13 September, 2007 Sun Constellation System: The Open Petascale Computing Architecture John Fragalla Senior HPC Technical Specialist Global Systems Practice Sun Microsystems, Inc. 25 Years of Technical

More information

Recommended hardware system configurations for ANSYS users

Recommended hardware system configurations for ANSYS users Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range

More information

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906

More information

Symmetric Multiprocessing

Symmetric Multiprocessing Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called

More information

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Oracle Database Scalability in VMware ESX VMware ESX 3.5 Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises

More information

OpenMP Programming on ScaleMP

OpenMP Programming on ScaleMP OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign

More information

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer Stan Posey, MSc and Bill Loewe, PhD Panasas Inc., Fremont, CA, USA Paul Calleja, PhD University of Cambridge,

More information

1 Bull, 2011 Bull Extreme Computing

1 Bull, 2011 Bull Extreme Computing 1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance

More information

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations Roy D. Williams, 1990 Presented by Chris Eldred Outline Summary Finite Element Solver Load Balancing Results Types Conclusions

More information

Reliable Systolic Computing through Redundancy

Reliable Systolic Computing through Redundancy Reliable Systolic Computing through Redundancy Kunio Okuda 1, Siang Wun Song 1, and Marcos Tatsuo Yamamoto 1 Universidade de São Paulo, Brazil, {kunio,song,mty}@ime.usp.br, http://www.ime.usp.br/ song/

More information

Real-Time Scheduling 1 / 39

Real-Time Scheduling 1 / 39 Real-Time Scheduling 1 / 39 Multiple Real-Time Processes A runs every 30 msec; each time it needs 10 msec of CPU time B runs 25 times/sec for 15 msec C runs 20 times/sec for 5 msec For our equation, A

More information

Planning the Installation and Installing SQL Server

Planning the Installation and Installing SQL Server Chapter 2 Planning the Installation and Installing SQL Server In This Chapter c SQL Server Editions c Planning Phase c Installing SQL Server 22 Microsoft SQL Server 2012: A Beginner s Guide This chapter

More information

Distributed communication-aware load balancing with TreeMatch in Charm++

Distributed communication-aware load balancing with TreeMatch in Charm++ Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration

More information

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV) Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Interconnection Networks 2 SIMD systems

More information

Big Graph Processing: Some Background

Big Graph Processing: Some Background Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs

More information

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems 202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric

More information

Delivering Quality in Software Performance and Scalability Testing

Delivering Quality in Software Performance and Scalability Testing Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,

More information

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP

More information

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) ( TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx

More information

159.735. Final Report. Cluster Scheduling. Submitted by: Priti Lohani 04244354

159.735. Final Report. Cluster Scheduling. Submitted by: Priti Lohani 04244354 159.735 Final Report Cluster Scheduling Submitted by: Priti Lohani 04244354 1 Table of contents: 159.735... 1 Final Report... 1 Cluster Scheduling... 1 Table of contents:... 2 1. Introduction:... 3 1.1

More information

Cosmological simulations on High Performance Computers

Cosmological simulations on High Performance Computers Cosmological simulations on High Performance Computers Cosmic Web Morphology and Topology Cosmological workshop meeting Warsaw, 12-17 July 2011 Maciej Cytowski Interdisciplinary Centre for Mathematical

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

Methodology for predicting the energy consumption of SPMD application on virtualized environments *

Methodology for predicting the energy consumption of SPMD application on virtualized environments * Methodology for predicting the energy consumption of SPMD application on virtualized environments * Javier Balladini, Ronal Muresano +, Remo Suppi +, Dolores Rexachs + and Emilio Luque + * Computer Engineering

More information

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing /35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of

More information

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,

More information

A Pattern-Based Approach to. Automated Application Performance Analysis

A Pattern-Based Approach to. Automated Application Performance Analysis A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,

More information

Data Centric Systems (DCS)

Data Centric Systems (DCS) Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems

More information

Parallel Scalable Algorithms- Performance Parameters

Parallel Scalable Algorithms- Performance Parameters www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for

More information

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department

More information

Why the Network Matters

Why the Network Matters Week 2, Lecture 2 Copyright 2009 by W. Feng. Based on material from Matthew Sottile. So Far Overview of Multicore Systems Why Memory Matters Memory Architectures Emerging Chip Multiprocessors (CMP) Increasing

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

High Performance Computing in CST STUDIO SUITE

High Performance Computing in CST STUDIO SUITE High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver

More information

Understanding the Benefits of IBM SPSS Statistics Server

Understanding the Benefits of IBM SPSS Statistics Server IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster

More information

Fast Two-Point Correlations of Extremely Large Data Sets

Fast Two-Point Correlations of Extremely Large Data Sets Fast Two-Point Correlations of Extremely Large Data Sets Joshua Dolence 1 and Robert J. Brunner 1,2 1 Department of Astronomy, University of Illinois at Urbana-Champaign, 1002 W Green St, Urbana, IL 61801

More information

MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp

MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1 Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System

More information

Cray: Enabling Real-Time Discovery in Big Data

Cray: Enabling Real-Time Discovery in Big Data Cray: Enabling Real-Time Discovery in Big Data Discovery is the process of gaining valuable insights into the world around us by recognizing previously unknown relationships between occurrences, objects

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

LS DYNA Performance Benchmarks and Profiling. January 2009

LS DYNA Performance Benchmarks and Profiling. January 2009 LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The

More information

On-Demand Supercomputing Multiplies the Possibilities

On-Demand Supercomputing Multiplies the Possibilities Microsoft Windows Compute Cluster Server 2003 Partner Solution Brief Image courtesy of Wolfram Research, Inc. On-Demand Supercomputing Multiplies the Possibilities Microsoft Windows Compute Cluster Server

More information

Parallel Large-Scale Visualization

Parallel Large-Scale Visualization Parallel Large-Scale Visualization Aaron Birkland Cornell Center for Advanced Computing Data Analysis on Ranger January 2012 Parallel Visualization Why? Performance Processing may be too slow on one CPU

More information

Department of Computer Sciences University of Salzburg. HPC In The Cloud? Seminar aus Informatik SS 2011/2012. July 16, 2012

Department of Computer Sciences University of Salzburg. HPC In The Cloud? Seminar aus Informatik SS 2011/2012. July 16, 2012 Department of Computer Sciences University of Salzburg HPC In The Cloud? Seminar aus Informatik SS 2011/2012 July 16, 2012 Michael Kleber, mkleber@cosy.sbg.ac.at Contents 1 Introduction...................................

More information

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, Josef.Pelikan@mff.cuni.cz Abstract 1 Interconnect quality

More information

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009 Performance Study Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009 Introduction With more and more mission critical networking intensive workloads being virtualized

More information

IS-ENES/PrACE Meeting EC-EARTH 3. A High-resolution Configuration

IS-ENES/PrACE Meeting EC-EARTH 3. A High-resolution Configuration IS-ENES/PrACE Meeting EC-EARTH 3 A High-resolution Configuration Motivation Generate a high-resolution configuration of EC-EARTH to Prepare studies of high-resolution ESM in climate mode Prove and improve

More information

A Flexible Cluster Infrastructure for Systems Research and Software Development

A Flexible Cluster Infrastructure for Systems Research and Software Development Award Number: CNS-551555 Title: CRI: Acquisition of an InfiniBand Cluster with SMP Nodes Institution: Florida State University PIs: Xin Yuan, Robert van Engelen, Kartik Gopalan A Flexible Cluster Infrastructure

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION. Marc-Olivier Briat, Jean-Luc Monnot, Edith M.

SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION. Marc-Olivier Briat, Jean-Luc Monnot, Edith M. SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION Abstract Marc-Olivier Briat, Jean-Luc Monnot, Edith M. Punt Esri, Redlands, California, USA mbriat@esri.com, jmonnot@esri.com,

More information

FLOW-3D Performance Benchmark and Profiling. September 2012

FLOW-3D Performance Benchmark and Profiling. September 2012 FLOW-3D Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: FLOW-3D, Dell, Intel, Mellanox Compute

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

PARALLEL PROGRAMMING

PARALLEL PROGRAMMING PARALLEL PROGRAMMING TECHNIQUES AND APPLICATIONS USING NETWORKED WORKSTATIONS AND PARALLEL COMPUTERS 2nd Edition BARRY WILKINSON University of North Carolina at Charlotte Western Carolina University MICHAEL

More information

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012 Scientific Application Performance on HPC, Private and Public Cloud Resources: A Case Study Using Climate, Cardiac Model Codes and the NPB Benchmark Suite Peter Strazdins (Research School of Computer Science),

More information

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG INSTITUT FÜR INFORMATIK (MATHEMATISCHE MASCHINEN UND DATENVERARBEITUNG) Lehrstuhl für Informatik 10 (Systemsimulation) Massively Parallel Multilevel Finite

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Scalability evaluation of barrier algorithms for OpenMP

Scalability evaluation of barrier algorithms for OpenMP Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*

More information

Performance of the JMA NWP models on the PC cluster TSUBAME.

Performance of the JMA NWP models on the PC cluster TSUBAME. Performance of the JMA NWP models on the PC cluster TSUBAME. K.Takenouchi 1), S.Yokoi 1), T.Hara 1) *, T.Aoki 2), C.Muroi 1), K.Aranami 1), K.Iwamura 1), Y.Aikawa 1) 1) Japan Meteorological Agency (JMA)

More information

Control 2004, University of Bath, UK, September 2004

Control 2004, University of Bath, UK, September 2004 Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of

More information

BSC vision on Big Data and extreme scale computing

BSC vision on Big Data and extreme scale computing BSC vision on Big Data and extreme scale computing Jesus Labarta, Eduard Ayguade,, Fabrizio Gagliardi, Rosa M. Badia, Toni Cortes, Jordi Torres, Adrian Cristal, Osman Unsal, David Carrera, Yolanda Becerra,

More information

Modeling Parallel Applications for Scalability Analysis: An approach to predict the communication pattern

Modeling Parallel Applications for Scalability Analysis: An approach to predict the communication pattern Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'15 191 Modeling Parallel Applications for calability Analysis: An approach to predict the communication pattern Javier Panadero 1, Alvaro Wong 1,

More information

A Comparison of General Approaches to Multiprocessor Scheduling

A Comparison of General Approaches to Multiprocessor Scheduling A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University

More information

Contributions to Gang Scheduling

Contributions to Gang Scheduling CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,

More information

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage

More information

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008 ParFUM: A Parallel Framework for Unstructured Meshes Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008 What is ParFUM? A framework for writing parallel finite element

More information

Introduction History Design Blue Gene/Q Job Scheduler Filesystem Power usage Performance Summary Sequoia is a petascale Blue Gene/Q supercomputer Being constructed by IBM for the National Nuclear Security

More information

Parallelism and Cloud Computing

Parallelism and Cloud Computing Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication

More information

Algorithms of Scientific Computing II

Algorithms of Scientific Computing II Technische Universität München WS 2010/2011 Institut für Informatik Prof. Dr. Hans-Joachim Bungartz Alexander Heinecke, M.Sc., M.Sc.w.H. Algorithms of Scientific Computing II Exercise 4 - Hardware-aware

More information