PERFORMANCE ANALYSIS AND OPTIMIZATION OF LARGE-SCALE SCIENTIFIC APPLICATIONS JINGJIN WU

Transcription

1 PERFORMANCE ANALYSIS AND OPTIMIZATION OF LARGE-SCALE SCIENTIFIC APPLICATIONS BY JINGJIN WU Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the Illinois Institute of Technology Approved Advisor Chicago, Illinois July 2013

2

3 ACKNOWLEDGMENT First and foremost, I would like to thank my advisor Prof. Zhiling Lan. Without her guidance and support throughout my Ph.D. study, this dissertation would not have been accomplished. I appreciate all her research directions, technical criticisms, as well as valuable feedbacks from innumerable office discussions. Her intelligence, wisdom and diligence are contagious and motivational for me. I will always be in her debt and admire her for the great person she is. I would also like to thank Prof. Nickolay Y. Gnedin, Prof. Andrey V. Kravtsov, Dr. Douglas H. Rudd and Dr. Roberto E. González, who have also guided me and worked with me on research related to this thesis. I am also thankful to my thesis committee: Prof. Xian-He Sun, Prof. Ioan Raicu, and Prof. Jia Wang for their time, interest, suggestions and comments. My special thanks go to my friends and colleagues in the SCS group for their friendship and encouragement. Last, but not least, I would like to thank my husband Xuanxing and our parents for their support at all times, and I hope that I can make them proud. iii

4 TABLE OF CONTENTS Page ACKNOWLEDGEMENT iii LIST OF TABLES vi LIST OF FIGURES ix LIST OF SYMBOLS x ABSTRACT xi CHAPTER 1. INTRODUCTION Motivation Contributions Thesis Organization COSMOLOGY SIMULATIONS Adaptive Mesh Refinement The Adaptive Refinement Tree (ART) Code Performance Analysis PERFORMANCE EMULATION Proposed Approach Load Balancing Schemes Experiments OVERVIEW OF TOPOLOGY MAPPING Background The Topology Mapping Problem Related Works HIERARCHICAL TOPOLOGY MAPPING FOR ART Problem Statement Proposed Approach Performance Evaluation NETWORK AND MULTICORE AWARE TOPOLOGY MAPPING Proposed Methodology iv

5 6.2. Inter-Node Mapping Intra-Node Mapping Performance Evaluation ANALYTICAL TOPOLOGY MAPPING Preliminary Algorithm Overview Global Mapping Legalization Performance Evaluation TOPOMAP: A TOPOLOGY MAPPING LIBRARY CONCLUSION AND FUTURE WORK BIBLIOGRAPHY v

6 LIST OF TABLES Table Page 3.1 Overall Load Balance Ratio of Different Load Balancing Schemes Average Intra-Socket and Inter-Socket Communication Time (Ping- Ping) Properties of Sparse Matrices Overhead of Topology Mapping on Stampede (Time in Seconds) Overhead of Topology Mapping on Kraken by Using the Recursive Bipartitioning Mapping Algorithm (Time in Seconds) Overhead of Topology Mapping on Kraken by Using the Recursive Tree Mapping Algorithm (Time in Seconds) The Relations between Topology Mapping and VLSI Placement Topology of the Blue Gene/P Supercomputer Overhead of the Proposed Analytical Mapping Algorithm. # Proc.: the number of processes; Time: the total runtime in seconds; # Iter.: the number of iterations in the global mapping stage vi

7 LIST OF FIGURES Figure Page 2.1 A 2D block-structured adaptive mesh refinement example with a refinement factor r = A 2D cell-based adaptive mesh refinement example: a quad tree with a refinement factor r = An example of large-scale ART simulation. Three panels show the dark matter (left), cosmic gas (middle) and stars (right), respectively Flow of the ART code. The four steps in the dotted region are the major steps for evolving a time step at each level The execution order for evolving time steps on three levels (refinement factor r = 2) A space-filling curve traversing a 2D mesh (left), and a parallel partition of the 2D adaptive mesh into four parts (right). The cells with the same color are assigned to the same process The communication pattern of ART during a large-scale cosmology simulation with 1536 processes Execution time of the ART code on Abe for simulating a single iteration of gf25cv1 at a late physical epoch with medium amount of particles Execution time of the ART code on Ranger for simulating a single iteration of gf25cv1 at an early physical epoch with large amount of particles Total parallel scaling of the ART code on Abe Total parallel scaling of the ART code on Ranger Flow of performance emulator of ART Part of the time axes of two processes Comparison of actual runtime and emulated runtime of ART Comparison of emulated runtime by using different load balancing schemes for coarse resolution case Comparison of emulated runtime by using different load balancing schemes for fine resolution case vii

8 3.6 Load Balance Ratio of each level for coarse resolution case Load Balance Ratio of each level for fine resolution case Average communication time of each level for coarse resolution case Average communication time of each level for fine resolution case Interconnected multiprocessor clusters with multicore CPUs on each node The recursive bipartitioning algorithm for inter-node mapping The algorithm for intra-node mapping by minimizing the maximum inter-socket message size (MIMS) The intra-node topology of Ranger (from TACC Ranger website [79]) Comparison of inter-node mapping and default mapping on Kraken in terms of hop-bytes Comparison of intra-node mapping and default mapping on Kraken in terms of MIMS (the maximum inter-socket message size) Comparison of hierarchical mapping and inter-node mapping on Kraken in terms of MIMS (the maximum inter-socket message size) Communication time reduction of different mapping mechanisms compared to default mapping on Kraken Comparison of inter-node mapping and default mapping on Ranger in terms of hop-bytes Comparison of intra-node mapping and default mapping on Ranger in terms of MIMS (the maximum inter-socket message size) Comparison of hierarchical mapping and inter-node mapping on Ranger in terms of MIMS (the maximum inter-socket message size) Communication time reduction of different mapping mechanisms compared to default mapping on Ranger High performance computing systems with multicore CPU(s) on each compute node Topology mapping framework A fat-tree network topology, and the neighbor joining tree representing the topology of the nodes allocated to a user application (light green) viii

9 6.4 The recursive tree mapping algorithm A 2D mesh/torus topology, and the neighbor joining tree representing the topology of the nodes allocated to a user application (light green) Recursive bipartitioning of a 2D mesh/torus topology. The nodes allocated to the user application are light green The recursive bipartitioning mapping algorithm for inter-node mapping on mesh/torus topology A machine architecture and its corresponding topology tree Hop-bytes reduction of sparse matrix tests on Stampede Communication time reduction of sparse matrix tests on Stampede Hop-bytes reduction of sparse matrix tests on Kraken Communication time reduction of sparse matrix tests on Kraken Performance results of ART tests on Stampede Performance results of ART tests on Kraken The analytical mapping algorithm The process migration algorithm The communication pattern of SP, CG and ART (1024 processes). nz is the number of blue dots, which indicates the number of inter-process communication NPB benchmark results SP NPB benchmark results CG Cosmology application results Flow of the topology mapping library TOPOMap ix

10 LIST OF SYMBOLS Symbol c(u, v) E p E t G p G t V p V t ϕ Definition The amount of communication in bytes between task u and v The set of edges in the topology graph The set of edges in the task graph The topology graph The task graph The set of processors The set of tasks The mapping from tasks in V t to processors in V p x

11 ABSTRACT Scientific applications are critical for solving complex problems in many areas of research, and often require a large amount of computing resources in terms of both runtime and memory. Massively parallel supercomputers with ever increasing computing power are being built to satisfy the need of large-scale scientific applications. With the advent of petascale era, there is an enlarged gap between the computing power of supercomputers and the parallel scalability of many applications. To take full advantage of the massive parallelism of supercomputers, it is indispensable to improve the scalability of large-scale scientific applications through performance analysis and optimization. This thesis work is motivated by cell-based AMR (Adaptive Mesh Refinement) cosmology simulations, in particular, the Adaptive Refinement Tree (ART) application. Performance analysis is performed to identify its scaling bottleneck, a performance emulator is designed for efficient evaluation of different load balancing schemes, and topology mapping strategies are explored for performance improvements. More importantly, the exploration of topology mapping mechanisms leads to a generic methodology for network and multicore aware topology mapping, and a set of efficient mapping algorithms for popular topologies. These have been implemented in a topology mapping library TOPOMap, which can be used to support MPI topology functions. xi

12 1 CHAPTER 1 INTRODUCTION 1.1 Motivation In the recent decades, numerical simulation has become critical for scientific and engineering research. As an important complement to theory and experiment, it has been successfully utilized to study complex phenomena in many areas, such as computational fluid dynamics, electromagnetics, materials science, meteorology, cosmology, and so on. The resultant scientific applications are often highly complicated, and require a large amount of computing resources in terms of both runtime and memory, especially for large-scale simulations. To satisfy the need of large-scale scientific applications, massively parallel high performance computing (HPC) systems with ever increasing computing power are being built. In the current TOP500 list [81] (June 2013), all the supercomputers have at least several thousands of cores. The No. 1 supercomputer Tianhe-2 has 3, 120, 000 cores, achieving a Linpack performance (Rmax) of 33, TFlop/s. More petascale machines will be available in the near future. In order to take full advantage of the massive parallelism of supercomputers, it is a must to improve the scalability of scientific applications through performance optimization. However, this is very challenging. One important issue is load balancing, which becomes increasingly important as the system size scales up. In practice, the scaling bottleneck of applications is often attributable to the imbalanced workload distribution among processes. Although load balancing schemes has been extensively studied in both theory and application during the past decades, it is still highly non-trivial to achieve ideal load balance on largescale systems due to the following reasons. First, fine-grained load balancing should be adopted to reduce the imbalance, but this often increases the time complexity of load balancing routines. Second, many scientific simulations are highly dynamic, with the

13 2 workload distribution changing dynamically throughout the execution. Hence, highly efficient and scalable (dynamic) load balancing algorithms should be integrated into applications. To achieve this goal, we need to explore and experiment different load balancing algorithms. However, load balancing schemes are tightly related to the problem decomposition and data structure of the application. Developers are often confined to a few choices, which block the exploration of more efficient solutions. To explore new schemes, major code changes are often required, which is costly and risky if the expected performance is not evaluated before hand. Besides, it often takes a long runtime to experiment the performance of different schemes through real runs. As a result, we need a smarter way to evaluate various load balancing schemes accurately and efficiently. Communication optimization is another important issue. If the workload distribution is well-balanced, then the communication time often determines the overall scaling of the application. There are at least three approaches to reduce the communication cost. First, we can aggregate multiple small messages into comparatively large messages to fully utilize the bandwidth of the interconnection network. This can substantially reduce the overall communication latency. Second, communication should be overlapped with computation whenever it is possible, so that the communication overhead can be hid. These two approaches have been widely used by application developers to improve the communication performance. The third approach is topology-aware task mapping [12], or simply topology mapping, which is relatively under-exploited, but critical for scaling on petascale systems and beyond. According to the topology of the interconnection network and the communication pattern of the application, topology-aware task mapping techniques map parallel application tasks (i.e. processes) onto processors (i.e. nodes or cores) properly to minimize the communication cost. The motivation is that message latencies are highly

14 3 dependent on the distance between communicating processors and the network contention. From the HPC system perspective, the interconnection networks are always sparse, e.g., fat-tree [55], 3D mesh or 3D torus [1, 2]. As the system scales up, the diameter of the interconnection network (i.e. the maximum distance between two nodes) increases, and the bisection bandwidth (i.e. the minimum total bandwidth of links connecting one half of the HPC system and the other) often decreases, making the communication increasingly expensive. Consider that parallel scientific applications usually have sparse communication pattern, it is critical to map processes onto nodes properly, so that the traffic in the network will be localized, leading to better communication performance. There are other aspects for performance optimization of scientific applications, e.g. I/O optimization [97], fault tolerance [56], and so on. This thesis work focuses on load balancing and communication optimization as they are the dominating factors for the performance of many applications. Specifically, we consider cosmology simulations, and study the adaptive refinement tree (ART) code [47], which is an advanced hydro+n-body simulation tool for cosmological research. Simulations are critical to making quantum leaps in cosmological research as they provide insight for the evolution of the universe, e.g., the formation of stars and galaxies. There are mainly two categories of cosmology simulation tools: those that only simulate the dark matter (often referred to as N-body ), and those that model gas dynamics (often called hydro ). Since cosmologically relevant scales are mainly dependent on the dark matter, a hydro simulation tool always includes an N-body component for modeling the dark matter. Typically, hydro simulations are much more compute-intensive than purely N-body simulations. Modern hydro simulations become even more computationally demanding in terms of both runtime and memory as more and more physical processes are included, e.g., gas

15 4 cooling, star formation and feedback, radiative transfer, and so on. Adaptive mesh refinement (AMR) [8, 68] has been widely applied to model the dynamics of cosmic baryons (gas and stars) for cosmology simulation, since it can follow the fragmentation of gas down to virtually unlimited small scales. Enzo [32] and the ART code are representative cosmology simulation tools using AMR. In particular, the ART code uses the cell-based AMR algorithm [46, 88], and incorporates many physical calculations for cosmology research, making it unique in its capabilities. As cosmology simulations usually consume a large amount of computing resources in terms of both runtime and memory, they are typically carried out on massively parallel high performance computing (HPC) platforms, e.g., HPC clusters. Production simulations using the ART code often involve physics computations for thousands of time steps, and can take several weeks or even months using hundreds of processing cores on production HPC platforms. It is important to further improve the ART code, so that it can scale up to petascale systems, and enable cosmologists to solve complex problems more efficiently. 1.2 Contributions To improve the ART code, we analyze its performance on production HPC systems to identify its performance bottleneck, design a performance emulator for efficient evaluation of load balancing schemes, and explore topology-aware task mapping techniques for performance optimization. Moreover, we also propose a generic methodology for network and multicore aware topology mapping, and develop a set of efficient mapping algorithms for popular topologies. The major contributions of this dissertation are as follows. Performance Emulation. A performance emulator is designed for cell-based AMR cosmology simulations. The emulator follows the flow of the original

16 5 application, i.e. the ART code, while the major physical computation and interprocess communication are replaced by runtime estimates provided by performance models. We demonstrate the effectiveness of the emulator by means of realistic cosmology simulation data on production systems. Experiments indicate that the emulator achieves good accuracy. Given the dynamic feature of AMR and the wide range of workload per cell, load balancing is an extremely challenging problem for cell-based cosmology simulations. We further evaluate three load balancing schemes for cell-based AMR cosmology simulations via the performance emulator. The use of the emulator enables us to quickly identify the issues associated with different load balancing schemes. Hierarchical Topology Mapping for ART. A hierarchical topology mapping algorithm is proposed to map ART processes onto multiprocessor clusters, where each node contains several multicore CPUs. In order to exploit the architectural properties of multiprocessor clusters (the performance gap between inter-node and intra-node communication, as well as the gap between intersocket and intra-socket communication), we propose to perform topology mapping in a hierarchical manner. First, the mapping of processes onto nodes (i.e., inter-node mapping) is obtained by using the recursive bipartitioning technique to minimize the amount of traffic in the network. Second, for each node, the mapping of processes onto multicore CPUs (i.e., intra-node mapping) is derived by minimizing the maximum size of messages transmitted between CPU sockets. Experiments on production HPC systems show that significant communication time reduction can be achieved by using the optimized mappings. This hierarchical approach has a wide applicability for cell-based AMR cosmology simulations, and the general methodology of performing hierarchical mapping can be used for many parallel applications.

17 6 Network and Multicore Aware Topology Mapping. A generic topology mapping methodology, which considers both the topology of the interconnection network and the hierarchical architecture of multicore compute nodes, is proposed for communication optimization of parallel applications on HPC systems. It is the broad extension of the hierarchical topology mapping of ART. Specifically, the mapping is still performed in two phases. In the first phase, the mapping of processes onto compute nodes is derived by utilizing efficient algorithms, which exploit the structure of the interconnection network to reduce inter-node message traffic. In the second phase, the mapping of processes onto logical processors is determined by partitioning the communication graph according to the intra-node hierarchical architecture, which is represented as a tree. We consider supercomputers with popular fat-tree and mesh/torus topologies, and develop efficient mapping algorithms, including a recursive tree mapping algorithm for generic tree topologies, and a recursive bipartitioning algorithm, which partitions the compute nodes of mesh/torus topologies by using their geometric coordinates. These have been integrated in the topology mapping library TOPOMap, which can be employed to support MPI topology functions. Experiments on production HPC systems show that significant performance gain can be achieved by using TOPOMap to find optimized mappings. Analytical Topology Mapping. An analytical mapping algorithm is proposed for topology mapping onto 3D mesh/torus-connected supercomputers. The design of this algorithm is motivated by mapping applications with irregular communication patterns onto regular 3D mesh/torus topology, and it is inspired by the analytical placement technique [84, 85] for VLSI design. Specifically, a two-stage strategy is used to determine the mapping. The first stage derives a rough global mapping solution by using quadratic programming, which intends to minimize the amount of traffic in the network. The second stage le-

18 7 galizes the rough global mapping solution to get a proper mapping by using a diffusion-like migration algorithm to move the processes between nodes. The resulting mapping is highly optimized due to the global optimization through quadratic programming. Experiments with popular benchmarks and the ART application on IBM Blue Gene/P system [41] show that the analytical mapping algorithm derives highly optimized mappings, which lead to significant performance gains. 1.3 Thesis Organization The rest of this thesis is organized as follows. Chapter 2 introduces the Adaptive Refinement Tree (ART) application, and analyzes its performance. Chapter 3 elaborates the design of the performance emulator and the evaluation of different load balancing schemes. Chapter 4 presents an overview of topology mapping techniques for communication optimization of HPC applications. Chapter 5 proposes the hierarchical topology mapping algorithm for ART. Chapter 6 introduces the methodology and algorithms for network and multicore aware topology mapping. Chapter 7 proposes the analytical topology mapping algorithm for regular 3D mesh/torus topology. Chapter 8 introduces the topology mapping library TOPOMap. Finally, Chapter 9 summarizes this thesis and discusses the future work.

19 8 CHAPTER 2 COSMOLOGY SIMULATIONS This chapter introduces a cosmology simulation application, the Adaptive Refinement Tree (ART) code, which is based on adaptive mesh refinement (AMR) [8,68], and presents its performance analysis on production HPC systems. 2.1 Adaptive Mesh Refinement Numerical simulations of many multiscale physical phenomena consume enormous computing resources in terms of both runtime and memory, because their multiple spatial and(or) temporal scales are discretized into the finest resolution for solution accuracy. This excessive resource requirement usually makes large-scale numerical simulations prohibitive. However, some resources are often underutilized for performing computation of the spatial and(or) temporal regions which do not require the finest resolution. To achieve better computing efficiency, AMR [8, 68] has been proposed to employ high resolution only in those required regions, so that we can focus resources on compute-intensive regions. AMR allows the user to perform simulations that are completely intractable on a uniform mesh. It has been much studied for more than two decades, and has proved to be successful in modeling multiscale phenomena for a variety of disciplines, including computational fluid dynamics, thermal dynamics, material science, geophysics, meteorology, cosmology, astrophysics, and so on. Generally, there are two kinds of AMR strategies: structured and unstructured. Structured AMR uses unions of regular meshes or cells to cover the computational domain, while unstructured AMR utilizes mesh distortion, which provides greater geometric flexibility at the cost of storing all neighborhood relations explicitly. In this dissertation, we focus on structured AMR, which is often implemented in two

20 9 Figure 2.1. A 2D block-structured adaptive mesh refinement example with a refinement factor r = 2. ways: block-structured AMR [8,18] and cell-based AMR [46,88]. The former achieves high spatial resolution by inserting smaller meshes ( blocks ) at places where high resolution is needed. The latter instead refines the computational domain on a cellby-cell basis. In practice, these two methods use different data structures and very different methods for distributing the computational load across a large number of processes. In this section, we briefly review these two structured AMR approaches Block-structured AMR. The basic principle of block-structured AMR is straightforward. Initially, a uniform mesh is adopted for the entire computational domain. In the regions which require higher resolution, finer meshes ( blocks ) are added. If some regions still need more resolution, even finer meshes are added. The computational domain is refined recursively in this way, and turns into a tree of meshes. The initial uniform mesh, as the tree s root, is at the top level. Each finer level decreases the mesh size by a factor r, which is called the refinement factor. Figure 2.1 shows a block-structured AMR example on the 2D mesh, including both mesh hierarchy and the overall structure. As each mesh block is regular, the sequential

21 10 Figure 2.2. A 2D cell-based adaptive mesh refinement example: a quad tree with a refinement factor r = 2. code of regular meshes can be reused for block-structured AMR implementations. Load balancing can be achieved by evenly distributing mesh blocks to processors Cell-based AMR. The cell-based AMR implements the AMR algorithm by performing grid refinement based on each cell. If higher spatial resolution is required for a cell, then it is refined into smaller cells, which are at the finer level of the overall grid hierarchy. With each level up, the cell size is decreased by the refinement factor r. A simple example of cell-based AMR on the 2D mesh is shown Figure 2.2. For simplicity, there are only 3 levels of cells from level 0 to level 2. Initially, a uniform grid on level 0 covers the overall computational domain. The dotted cells need higher resolution, so they are further refined. Throughout the execution of the AMR application, the grid hierarchy changes adaptively, and the cells are organized in refinement trees [46,88]. For each refinement tree, its root is the corresponding cell at level 0. The cell with children is a non-leaf cell. Otherwise, it is a leaf cell. Compared with block-structured AMR, cell-based AMR is more flexible, and can easily adapt to high resolution in localized regions. 2.2 The Adaptive Refinement Tree (ART) Code

22 11 Figure 2.3. An example of large-scale ART simulation. Three panels show the dark matter (left), cosmic gas (middle) and stars (right), respectively. The ART code is an advanced hydro+n-body simulation tool for cosmological research. It simulates the evolution of the universe, or more specifically, the formation of stars and galaxies. It employs a combination of multi-level particle-mesh and shock-capturing Eulerian methods for simulating the evolution of dark matter and cosmic gas, respectively. High dynamic range is achieved by applying adaptive mesh refinement to both gas dynamics and gravity calculations. The ART code is distinguished from the rest of cosmological simulation tools in the large number of physical processes it includes, which enable comprehensive simulation of cosmological phenomena and provide deep insight for cosmologists. Figure 2.3 shows the visualization of a large-scale ART simulation, including dark matter, cosmic gas and stars. In particular, ART is a hybrid MPI+OpenMP C program, with Fortran functions for compute-intensive routines. The MPI parallelization is used between separate compute nodes and the OpenMP parallelization is used inside a multi-core node. This mixed mode parallelization enables us to take full advantage of modern multi-core architectures. The ART code employs the cell-based AMR algorithm, performs refinement locally on individual cells and organizes cells in refinement trees. In order to model the universe, it adopts a cubic computational volume with a refinement factor of 2. For each cubic cell, the refinement operation evenly subdivides the cell

23 12 Figure 2.4. Flow of the ART code. The four steps in the dotted region are the major steps for evolving a time step at each level. into 8 cells, namely an oct. Hence, the refinement tree is also called oct-tree [46, 88]. In a typical cosmology simulation, the highest resolution regions can reach 7 to 10 refinement levels, resulting in a large range of dynamic multidimensional regions. With cell-based AMR, the ART code is able to control the computational mesh on the level of individual cells, such that the refinement mesh can easily be built and modified, and therefore, can effectively match the complex geometry of cosmologically interesting regions. In the following subsections, we briefly review its flow, domain decomposition, and communication pattern. Additional details about the ART code can be found in [47, 48, 97] Flow of ART. Figure 2.4 shows the basic flow of the ART code. First, it reads input files, including parameter files and cosmology data. Second, it initializes oct-tree and cell buffer. Then it checks whether the simulation time reaches the user specified time limit. If yes, the simulation stops, otherwise the ART code performs load balance

24 13 Level 0 Level 1 Level 2 7th 3rd 6th 1st 2nd 4th 5th Time Steps Figure 2.5. The execution order for evolving time steps on three levels (refinement factor r = 2). and simulates another iteration by evolving time steps for the overall computational domain, including the cells at all the levels. For each level, the evolution of a time step mainly consists of four steps: collect boundary information, perform physics computation, project physics data to the coarser level, and adaptively refine/derefine the cells. At the end of each iteration, the ART code generates output files, including log files and cosmology data files if any. Specifically, in each iteration of the simulation, ART evolves a time step dt for the overall computational domain, which is adaptively refined into multiple levels of cells during simulation. This refinement in the spatial domain is accompanied by a refinement in the time domain, where finer level meshes or cells evolve with a smaller time step according to the refinement factor. Since the ART code uses a refinement factor of 2, the time step size at level l is 2 l dt. As a result, ART evolves 2 l time steps at level l in each iteration. Figure 2.5 shows the recursive execution order for evolving these time steps on three levels. Basically, finer levels evolve first, then coarser levels follow. Except the finest level, each level l evolves a new time step when and only when level l + 1 has evolved two time steps ahead Domain Decomposition. The computational domain decomposition is critical to the performance of cosmology applications using AMR. In order to take advantage of the cell-based AMR and exploit the spatial locality, the ART code

25 14 Figure 2.6. A space-filling curve traversing a 2D mesh (left), and a parallel partition of the 2D adaptive mesh into four parts (right). The cells with the same color are assigned to the same process. adopts a domain decomposition scheme [65] based on Hilbert s space-filling curve (SFC) [21]. The SFC is identified by a traversal of all the root cells according to their spatial coordinates. A parallel partition of root cells is obtained by dividing the curve into N p (number of processes) equal parts, where each part is weighted by the total workload of the corresponding computational domain. It is to be noted that each root cell keeps all its child cells at finer refinement levels as a single composite unit, thus being the basic object for domain decomposition. Figure 2.6 presents a space-filling curve on a 2D mesh and a parallel partition of cells into four parts. Because of the spatial locality preserved by the curve, each part is a continuous domain consisting of nearby cells, and this property also holds for the 3D case of parallel partition of the cubic universe in ART. This SFC-based domain decomposition scheme enables efficient partitioning of the adaptive mesh by transforming a multidimensional problem (e.g., 2D or 3D) into a unidimensional one, and it has been widely employed in parallel AMR implementations [19, 20, 35, 47, 75]. Since the computational meshes and cosmological objects evolve dynamically during a cosmology simulation, the workload distribution between processes changes. In order to ensure load balance, ART regularly examines the workload distribution

26 15 during simulation, and performs domain decomposition to re-balance workload when necessary Communication Pattern. With the SFC-based domain decomposition, each process of ART mainly performs computation for its local computational domain, and communicates with other processes to get the boundary information, which are the data associated with the external boundary cells of each process. It is worth noting that each process only keeps the data associated with its local computational domain, enabling a fully parallel solution for both computation and memory. Updating the boundary information is the dominating communication routine of ART. Specifically, in order to exchange boundary information with other processes, each process first posts non-blocking receives (MPI Irecv), then sends data by non-blocking sends (MPI Isend), and finalizes the communication by an MPI Waitall for all sends and receives. Such MPI communication is performed per time step at each level, and cannot be overlapped with computation to reduce runtime, because the physics computation of each process depends on updated boundary data. Generally, each process only communicates with relatively small number of processes whose computational domain is nearby, and the amount of communication between two processes is mainly dependent on the number of boundary cells between their computational domains. Figure 7.3 shows the communication pattern of ART for a production simulation with 1536 processes. Each blue dot at (i, j) represents the communication between process i and j, and nz denotes the total number of blue dots, i.e., the total number of communicating process pairs. Clearly, each process only communicates with a few other processes, and most communication is between neighboring processes since most blue dots are along the diagonal. The communication pattern may vary for different mesh structure and different number of processes, yet the sparse and diagonal dominant property always holds due to the

27 nz = Figure 2.7. The communication pattern of ART during a large-scale cosmology simulation with 1536 processes.

28 17 spatial locality provided by the SFC-based domain decomposition. 2.3 Performance Analysis In order to study the scalability of the ART code, we instrument it with performance counters and timers, and perform simulations with practical cosmology datasets on various production HPC systems. In this section, we present the performance analysis of ART on two HPC systems Two Cluster Systems. The two cluster systems are the Intel 64 cluster Abe at the National Center for Supercomputing Applications (NCSA), and the Sun cluster Ranger at the Texas Advanced Computing Center (TACC). These systems are chosen because they have different hardware/software configurations, and they are representative HPC clusters widely used for scientific computing. The first system is an Intel 64 cluster called Abe which is located at NCSA. It was ranked the 82nd in the top 500 list [81] (June 2010) with a peak performance of TFLOPS. Abe is equipped with InfiniBand network and Lustre parallel filesystem. It contains 9600 processing cores, arranged as 1200 symmetric multiprocessing (SMP) nodes. Each node has either 8GB or 16GB memory, and two quadcore CPUs running at 2.33 GHz. Each CPU has 4MB L2 cache. The operating system on Abe is Red Hat Enterprise Linux 4 (Linux ). The second system is a Sun constellation Linux cluster called Ranger, which is located at TACC. It was ranked the 11th in the top 500 list [81] (June 2010) with a peak performance of 579.4TFlops. It is composed of 3936 nodes, with a total of processing cores. The nodes are connected by a full-clos InfiniBand interconnect, which provides a 1GB/sec point-to-point bandwidth. Each node has 32GB memory, and 4 quad-core AMD 64-bit Barcelona CPUs. Each CPU has a clock frequency of 2.3 GHz and three levels of cache for fast memory access: 2MB L3 cache shared by 4

29 18 cores, 512KB L2 cache dedicated to each core, and 64 KB L1 cache. Also, Ranger is equipped with the Lustre file system and a Linux kernel Cosmology Dataset. A practical cosmology dataset of ART called gf25cv1 is employed in our experiments. This cosmological simulation is used in the box of 36 comoving Mpc on a side, covered with the uniform top level grid of cells. A small fraction of the volume (containing Lagrangian regions of 5 galaxies with masses between and M ) is further resolved in the initial conditions by additional 3 levels, to the effective grid. This high resolution region is allowed to refine dynamically by another 6 levels (10 levels total, counting the top grid). This dataset is chosen because it is a practical dataset used in ongoing cosmological research, and it is similar to the datasets, which will be used in the ultimate large-scale cosmology simulations in future. In fact, the computational domain of a large-scale simulation can be 100 times the box size of gf25cv1, and these 100 box pieces will be made essentially independent of each other. Therefore, this dataset is representative for both current cosmology simulations and future large-scale simulations. The scalability analysis and performance optimization based on this dataset will be beneficial for general simulations. In practice, the simulation of this dataset evolves thousands of iterations, which take large amount of runtime (up to a few months using hundreds of cores), thus it is prohibitive to simulate all the iterations. In our experiments, we selectively simulate this dataset for a few iterations at different physical epoches with different grid resolutions for performance evaluation Experimental Setup. On Abe, we use the ART code to simulate gf25cv1 for a few number of iterations at a late physical epoch with medium amount of particles. Here only medium amount of particles are employed for the experiments

30 19 on Abe, because the quota of disk in the home directory of each user on Abe is only 50GB, and it cannot accommodate the data files for simulating more particles. The experiments are carried out with 256, 512, and 1024 processors (i.e., cores), respectively. As each node of Abe has 8 processors, so actually there are 32, 64 and 128 nodes used in these experiments. Smaller number of nodes are not adopt for experimentation because they cannot accommodate the data for simulating this dataset. As ART is a hybrid MPI+OpenMP code and there are 8 processors available on each node of Abe, we assign an MPI process with 8 OpenMP threads for each node. This kind of configuration is employed for two reasons. First, this configuration often provides good performance, and it is adopt in most production simulations. Second, to simulate the practical cosmology dataset, each process of the ART code needs a large amount of memory. In practice, each node of the Abe machine cannot accommodate two or more MPI processes of ART. Therefore, we use one MPI process and 8 OpenMP threads on each node for performance analysis. As Ranger has sufficient quota of disk per user, we simulate the dataset at an early physical epoch with large amount of particles. A series of experiments with 256, 512, 1024, and 2048 processors (i.e., cores) have been performed. Similarly, we assign one MPI process for each node, and one OpenMP thread for each processor. Recall that each node of Ranger has 16 processors, so there are 16 OpenMP threads on each node Results. Figures 2.8 and 2.9 show the average runtime per process of the ART code on Abe and Ranger, respectively. The overall runtime of the ART code is divided into three parts: physics computation, MPI communication, and others. The physics computation time includes all the runtime spent on solving the physics equations and managing the cells; the MPI communication time is the time spent on message passing function calls; and the other runtime mainly includes IO and

31 Execution Time On Abe Physics Computation MPI Communication Others 6000 Time(s) Number of Processors Figure 2.8. Execution time of the ART code on Abe for simulating a single iteration of gf25cv1 at a late physical epoch with medium amount of particles Execution Time On Ranger Physics Computation MPI Communication Others 1500 Time(s) Number of Processors Figure 2.9. Execution time of the ART code on Ranger for simulating a single iteration of gf25cv1 at an early physical epoch with large amount of particles.

32 Physics Computation Time MPI Communication Time Total Runtime Speedup Relative to 256 Processors on Abe 3 Speedup Number of Processors Figure Total parallel scaling of the ART code on Abe. load balance time. Since different simulations are performed on these two clusters, the absolute runtime of these two sets of experiments are not comparable. As the others only consumes less than 10% of the total runtime, we focus on the analysis of the physics computation time and MPI communication time. The scaling curves of these experiments are illustrated in Figures 2.10 and 2.11, respectively. For both experiments on Abe and Ranger, the physics computation time has a linear speedup as the number of processors increases. In comparison with the physics computation time, the MPI communication time has an inferior scaling trend on both clusters, and this inferior scaling results in the poor scaling of the total runtime. Basically, it is expected that the average physics computation time has a good scaling, because the overall workload of simulation is dependent on the given cosmology data. It is observed that the physics computation time of different processes often varies significantly, which causes large amount of synchronization time (i.e. waiting time) between processes. As such synchronization time constitutes a large part of MPI communication time, it is a must to explore more efficient load balancing schemes, so that the physics computation time (i.e. workload) of different

33 Speedup Relative to 256 Processors on Ranger Physics Computation Time MPI Communication Time Total Runtime 6 Speedup Number of Processors Figure Total parallel scaling of the ART code on Ranger. processes can be well balanced, and the synchronization time can be reduced. Moreover, it would also be beneficial to explore topology-aware mapping techniques for reducing communication cost (i.e. data transmission time).

34 23 CHAPTER 3 PERFORMANCE EMULATION In this chapter, we present the design of the performance emulator for cellbased AMR cosmology simulations, and evaluate three load balancing schemes via the performance emulator. 3.1 Proposed Approach The execution time of a cosmology simulation can be divided into three parts: physics computation, MPI communication, and others. Here, physics computation time includes the runtime spent on solving physics equations and managing the cells; the MPI communication time is the time spent on MPI function calls; and the other runtime mainly includes IO and load balance time. As the sum of physics computation and MPI communication time accounts for more than 95% of the overall runtime, we focus on these two parts during the design of the emulator. In the following, we first present our performance models to estimate physics computation time and MPI communication time, and then describe the performance emulator Performance Models. Runtime performance models are built to estimate physics computation time and MPI communication time. Considering that different levels have different resolutions and time step sizes, our focus is to build models to estimate these runtime of each time step per level. During simulation, each application process stores two types of cells: the cells within its computational domain, namely local cells; and the external neighboring cells of its computational domain, namely buffer cells. Each process conducts physical calculation on local cells, and communicates with other processes to get buffer cell data, which serves as important boundary information for simulating its local computational domain. These two types of cells are the main indicators of physics computation time and MPI communication time,

35 24 respectively Physics Computation Time. In order to characterize the physics computation time accurately, we use principle component analysis (SPCA) [73] to analyze the runtime, and observe that the number of local cells and particles are the dominant terms in determining the physics computation time. It is expected because the ART code solves two kinds of physics equations at each level: the physics equations for hydro dynamics, and the physics equations for N-body simulation. The former is only solved for leaf local cells, and the solution is projected to obtain the solution of non-leaf local cells at the coarser level, while the latter is solved for particles. Thus, we use the following linear model for the physics computation time T P of each level. T P = T nllc P + T llc P + T particle P (3.1) = w 1 N nllc + w 2 N llc + w 3 N particle, where w i (i = 1, 2, 3) are constant coefficients for the level of interest. and T particle P TP nllc, TP llc denote the computation time of non-leaf local cells, leaf local cells and particles, respectively. N nllc, N llc, and N particle denote the number of non-leaf local cells, leaf local cells and particles, respectively. Note that Req.(3.1) is defined for each level of each process, while all the processes share common constant coefficients of each level. To extract these coefficients, we use the cell and particle counts along with the physics computation time at each level of each process to formulate Req.(3.1), and then solve a linear system in the form of Ax = b for each level. As the number of equations is usually larger than the number of coefficients, we apply linear regression to compute the least square fit solution of these coefficients MPI Communication Time. The MPI communication time can be

36 25 further divided into two parts: data transmission time and synchronization time. Data transmission time is simply the runtime spent on transmitting data between processes. As updating the boundary information is a major communication routine and the amount of boundary data for each process is proportional to its buffer cell counts, the data transmission time is largely dependent on the number of buffer cells. In our model, data transmission time is modeled as follows. T trans = t s + n t c, (3.2) where t s can be considered as the latency for message passing, t c is the inverse of the bandwidth and n is the data size for one time data transmission. The latency and bandwidth can be obtained by using Intel MPI Benchmarks (IBM) [43], and the number of transmitted bytes can be calculated using the number of buffer cells. Synchronization time is incurred when processes do not start their communication routines simultaneously. For example, consider two processes with a maximum refinement level of 3 and 6, respectively, and assume that these two processes need to exchange boundary information from level 0 to 3. The first process starts evolving time steps at level 3, while the second process starts at level 6. Once the first process finishes a time step at level 3, it sits idle until the second process catches up to it, and then they can communicate to exchange boundary information. Even if two processes have the same refinement level, when there exists load imbalance, a process may still have to sit idle waiting for the boundary information from the other process. Such extra waiting time is recorded as part of the MPI communication time, because the process is stalled when executing MPI function calls, and we refer to it as synchronization time. The MPI communication time including both data transmission time and synchronization time can be characterized by emulating time steps as detailed in the next subsection.

37 26 Grid Structure and Model Coefficients NO Iter < MaxIter Exit YES Load Balance Emulate Time Steps Analyze Communication Relationships Estimate Total Runtime Estimate Physics Computation Time and Data Transmission Time Using the Performance Models Figure 3.1. Flow of performance emulator of ART. To build these models, we need performance data so as to extract model coefficients. In practice, we can either use the performance data from previous simulations, or conduct a few iterations of the simulation to collect performance data for model construction Emulator Design. Figure 3.1 presents the design of our emulator. Comparing Figure 3.1 and Figure 2.4, we can see that the emulator uses the performance models to estimate physics computation and MPI communication time for each time step, rather than conducting the actual simulation. During each iteration, we use a load balancer to determine workload distribution among processes, and then emulate time steps in exactly the same order as the ART code. For each time step, the emulator first analyzes the communication relationships among processes. Specifically, for each process, the emulator derives the amount of data that needs to be transmitted to and received from all the other processes. Next, according to the cell and particle counts of each process and the communication relationships, the emulator estimates

38 27 Figure 3.2. Part of the time axes of two processes. physics computation time and data transmission time using the performance models shown in equation (3.1) and (3.2), respectively. Finally, the emulator estimates the total runtime. It is achieved by maintaining a time axis for each process to record the computation and communication intervals. Figure 3.2 shows the time axes of two processes P0 and P1. The length of computation time intervals are the estimated physics computation times, while the arrows represent data transmission. Clearly, MPI communication time intervals are determined by computation time intervals and data transmissions. For example, the second MPI communication time interval of P0 can be computed as T c = t 2 t 1 = (t 0 + T trans ) t 1, and the start time of P0 for the communication in the next time step is t 3 = t 2 + T P. Therefore, using the estimated physics computation time and data transmission time for each time step of each process, the emulator is able to estimate the MPI communication time including both data transmission time and synchronization time without evaluating synchronization time separately. In summary, by emulating time steps with the performance models to characterize runtime components, the emulator can estimate the performance of the ART code without executing physics solvers.

39 Load Balancing Schemes One of the major design goals of our performance emulator is to evaluate the performance of different load balancing schemes, thus avoiding time-consuming and complicated implementation in code without knowing potential effects of the modification. The ART code performs cosmology simulations using a cubic computational domain, which represents the universe. The computational domain is initially divided into many uniform cubic cells at level 0. We refer to such cells at level 0 as root cells since they are the roots of oct-trees. Each root cell keeps all its child cells at finer refinement levels as a single composite unit, thus being the basic unit for load balancing. In this section, we present three representative load balancing schemes which will be evaluated later in Section SFC-Based Load Balancing Scheme (SCAB). Currently, the ART code employs a load balancing scheme based on Hilbert space-filling curve (SFC) [21]. We denote this scheme as SCAB. It assigns a unique SFC ID for each root cell according to their spatial coordinates, then generates an SFC curve by connecting root cells with continuous SFC IDs, and finally divides the SFC curve into N p (N p is the number of processes) segments with similar amount of workload. Specifically, this scheme considers the total workload of each root cell, and adopts a greedy algorithm to split the SFC curve, so that the workload can be evenly distributed among all the processes. Currently, SCAB is widely used for parallel AMR [19, 35, 75]. One salient feature of this SCAB scheme is its good spatial locality, where each process gets root cells with continuous SFC IDs resulting in a continuous computational domain. However, SCAB restricts the assignments of root cells to processes by the SFC curve, and does not consider the communication among processes when splitting the SFC curve Graph Partitioning-Based Load Balancing Scheme (Graph). Graph

40 29 partitioning is an alternative approach for the load balancing of cell-based AMR applications. We denote it as Graph. To apply the graph method, we need to map the load balancing problem into a graph partitioning problem. One straightforward mapping is to use vertices to represent root cells, and edges to represent communication relationships between neighboring root cells. The weight of each vertex is the workload of the corresponding root cell. Although this mapping does make sense, it results in a large graph, which is difficult to partition using acceptable amount of runtime and memory. For example, in a medium-sized cosmology simulation, a cubic computational domain of root cells maps into a graph with vertices and about edges. Partitioning such a large graph is almost prohibitive. Therefore, we must reduce the graph size for efficient partitioning. In our preliminary experiments, it is observed that only the root cells in a few localized regions are deeply refined to finer levels, while most root cells are not refined. Thus, we use a single vertex to represent the unrefined root cells with continuous SFC IDs, and create the edges accordingly in order to generate a manageable graph. The assignments of root cells to processes can be obtained by partitioning the vertices in the graph into N p partitions. Note that graph partitioning algorithms typically minimize the total edge-cuts subject to the constraints that the partitions are of similar size. When they are applied for the load balancing of cell-based AMR applications, we are actually minimizing the amount of communication among processes while ensuring that the workload is well-balanced. In our implementation, we use the graph partitioning tool MENTIS [60] to partition the graph Group-Based Load Balancing Scheme (Group). The aforementioned two load balancing schemes only consider the total workload of each root cell. However, as the ART code evolves time steps for each level, it is also critical to balance the workload of each level in order to reduce the synchronization time. Besides, it

41 30 is observed that the communication at deep refinement levels usually results in large synchronization cost, so it is important to minimize the communication at deep refinement levels. To meet such requirements, we design a new load balancing scheme called Group. This scheme first assigns neighboring root cells into groups, where each group has the lowest possible boundary level and satisfies a set of group workload constraints to control the granularity. The assignment of root cells to groups is based on the Friends-of-Friends algorithm [29]. Second, Group assigns root cell groups to processes by solving a constrained bin packing problem to balance the workload of each level, where each bin corresponds to a process. Specifically, we sort the groups in nonincreasing order according to their workload, and pack them into bins sequentially. To achieve good spatial locality, we compute the distances between groups using Coronoid tessellations [86], and try to assign each group to the process which holds its neighboring groups. In this way, Group is able to achieve good level-by-level load balance and spatial locality. 3.3 Experiments In this section, we present two sets of experiments. In the first experiment, we assess the accuracy of the emulator. In the second experiment, we examine the load balancing schemes using the emulator with realistic cosmology data, and discuss their advantages and shortcomings Experimental Setup. We instrument the ART code with performance counters and timers for analysis. We use a real cosmological simulation of a box of 36 comoving Mpc on a side, covered with the uniform top level grid of root cells. This simulation represents a scaled-down version of our future petascale cosmological simulations. In fact, the petascale simulation is 100 times the volume of this dataset,

42 31 and these 100 box pieces are essentially independent of each other. Our tested is the Intel 64 cluster Abe located at NCSA [77]. It is equipped with InfiniBand network and Lustre parallel file system. Each node has either 8GB or 16GB memory, and two quad-core CPUs running at 2.33 GHz. As ART is a hybrid MPI+OpenMP code and there are 8 processors available on each node of Abe, we assign an MPI process with 8 OpenMP threads to each node Accuracy of Performance Emulator. To measure the accuracy of the emulator, we conduct experiments using the emulator with the current load balancing scheme of ART SFCLB, and compare the emulated performance results with the actual runtime of the ART code. As the emulator emulates time steps without running physics solvers, it only consumes a few minutes to estimate the performance of ART. Figure 3.3 presents the comparison of actual runtime and emulated runtime. Here, the runtime includes physics computation time and MPI communication time. The difference between the actual runtime and emulated runtime is within 12%. The error is mainly due to some extra communication routines in the ART code except for exchanging boundary information between processes. The emulator does not model such extra communications because they will be removed in our next version of ART. Therefore, the emulator is accurate. More importantly, we notice that both curves in the figure have exactly the same trend. This further indicates that our emulator can clearly predict the performance and scalability of realistic cosmology simulations, and the performance analysis based on this emulator is reliable Evaluation of Load Balancing Schemes. In this set of experiments, we assess three load balancing schemes presented in Section 3.2 for cosmology simulations by means of the emulator. For the same cosmology dataset described in Section 3.3.1, we test two different resolution cases: the coarse resolution case and

43 32 Comparison of Actual Runtime and Emulated Runtime Time(s) Number of Processors Actual Runtime Emulated Runtime Figure 3.3. Comparison of actual runtime and emulated runtime of ART. the fine resolution case. The coarse resolution case represents a simulation with an intermediate resolution which reaches to a maximum refinement level of 6. The fine resolution case represents a simulation with an extremely high resolution, which is allowed to refine dynamically to level 9. We use three metrics, namely execution time, load balance ratio, and communication time per level, for evaluation. Specifically, load balance ratio represents the quality of workload distributions among processes. It is defined as Load Balance Ratio = 1 Np 1 N p i=0 W i 100%, max 0 i Np 1 W i where W i is the workload of process i and N p is the number of processes. Note that Load Balance Ratio is always smaller than or equal to 100%, and a much closer value to 100% indicates a better load balance quality. Figures 3.4 and 3.5 present emulated runtime by using different load balancing schemes for coarse resolution and fine resolution case, respectively. In both figures, the runtime using GroupLB scheme is smaller than that of the other two schemes, especially in the coarse resolution case. In Figure 3.5, we notice that as the number

44 33 Emulated Runtime of Coarse Resolution Case Time(s) Number of Processors SFCLB GraphLB GroupLB Figure 3.4. Comparison of emulated runtime by using different load balancing schemes for coarse resolution case. of processors increases, three curves are trending toward the same value. Such trend is caused by granularity, which is determined by the maximum workload of root cells. In the fine resolution case, there are several root cells whose individual workload is much larger than the average workload of each process when running on 512 and 1024 processors, thus introducing significant synchronization cost. Table 3.1 presents the overall Load Balance Ratio among processes for different load balancing schemes. The closer the metric is to 100%, the better load balance is achieved. Obviously, both SFCLB and GroupLB achieve better load balance for the coarse resolution case in comparison with the fine resolution case. This is because finer grid resolution increases the workload of root cells and results in larger granularity, which makes load balancing more challenging. GraphLB behaves differently since its primal objective is to minimize the communication among processes instead of balancing the workload. Comparing these schemes, GroupLB achieves the best load balance for the coarse resolution case and the first two tests of the fine resolution case. For the fine resolution case with 512 and 1024 processors, all of these three schemes fail to provide good load balance because of granularity problem.

45 34 Emulated Runtime of Fine Resolution Case Time(s) Number of Processors SFCLB GraphLB GroupLB Figure 3.5. Comparison of emulated runtime by using different load balancing schemes for fine resolution case. Table 3.1. Overall Load Balance Ratio of Different Load Balancing Schemes Number of Coarse Resolution Case Fine Resolution Case Processors SFCLB GraphLB GroupLB SFCLB GraphLB GroupLB % 64.89% 96.63% 90.21% 85.48% 91.17% % 73.54% 96.43% 71.35% 78.81% 88.51% % 53.76% 95.84% 53.74% 54.02% 51.73% % 42.74% 92.21% 26.98% 27.01% 26.75%

46 35 Load_Balance_Ratio 100% 90% 80% 70% 60% 50% 40% 30% Load_Balance_Ratio of Each Level (192 Processors, Coarse Resolution) SFCLB GraphLB GroupLB Load_Balance_Ratio 100% 90% 80% 70% 60% 50% 40% 30% Load_Balance_Ratio of Each Level (256 Processors, Coarse Resolution) SFCLB GraphLB GroupLB 20% 20% 10% 10% 0% Level 0% Level Load_Balance_Ratio 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% Load_Balance_Ratio of Each Level (512 Processors, Coarse Resolution) SFCLB GraphLB GroupLB Load_Balance_Ratio 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% Load_Balance_Ratio of Each Level (1024 Processors, Coarse Resolution) SFCLB GraphLB GroupLB 0% Level 0% Level Figure 3.6. Load Balance Ratio of each level for coarse resolution case. Figures 3.6 and 3.7 show the Load Balance Ratio of each level for coarse resolution case and fine resolution case, respectively. In the coarse resolution tests, the Load Balance Ratio for almost all the levels of GraphLB is obviously smaller than that of the other two schemes; SFCLB and GroupLB provide similar load balance at deep refinement levels 4 to 6; GroupLB also delivers well-balanced workload distribution at level 0 to 3. In the fine resolution tests, these schemes achieve comparable load balance at level 6 to 9, while GroupLB is more effective in balancing the workload at lower levels. In general, GroupLB achieves better level-by-level load balance than SFCLB and GraphLB for both coarse and fine resolution cases. Figures 3.8 and 3.9 illustrate the average communication time of each level for the two resolution cases, respectively. In Figure 3.8, as the number of processors increases, the level-by-level communication time is decreasing. Generally, GroupLB introduces smaller level-by-level communication time than SFCLB and GraphLB because it achieves better load balance at each level. In Figure 3.9, the communica-

47 36 100% Load_Balance_Ratio of Each Level (192 Processors, Fine Resolution) 100% Load_Balance_Ratio of Each Level (256 Processors, Fine Resolution) 90% 90% Load_Balance_Ratio 80% 70% 60% 50% 40% SFCLB GraphLB GroupLB Load_Balance_Ratio 80% 70% 60% 50% 40% SFCLB GraphLB GroupLB 30% 30% 20% 20% 10% 10% 0% Level 0% Level Load_Balance_Ratio 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Load_Balance_Ratio of Each Level (512 Processors, Fine Resolution) Level SFCLB GraphLB GroupLB Load_Balance_Ratio 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Load_Balance_Ratio of Each Level (1024Processors, Fine Resolution) Level SFCLB GraphLB GroupLB Figure 3.7. Load Balance Ratio of each level for fine resolution case. tion time is not scaling down with the increasing number of processors because the granularity problem results in poor load balance. GroupLB and GraphLB have much smaller communication time than SFCLB at level 7 to 9 since both methods try to reduce communication. However, for 512 and 1024 processors, GroupLB and GraphLB still have large communication time due to the granularity. In summary, by comparing these three load balancing schemes, we conclude that GroupLB provides the best performance. It achieves a good load balance quality by balancing both overall and level-by-level workload, and minimizes communication cost by preserving spatial locality. While SFCLB maintains an overall load balance and keeps spatial locality using the SFC curve, it does not take into consideration of level-by-level load balance, thereby introducing non-trivial synchronization cost. Although GraphLB minimizes communication cost, it does not provide satisfactory load balance quality.

48 Average Communication time for Each Level (192 Processors, Coarse Resolution) 350 Average Communication time for Each Level (256 Processors, Coarse Resolution) Time (s) SFCLB GraphLB GroupLB Time (s) SFCLB GraphLB GroupLB Level Level 350 Average Communication time for Each Level (512 Processors, Coarse Resolution) 350 Average Communication time for Each Level (1024 Processors, Coarse Resolution) Time (s) SFCLB GraphLB GroupLB Time (s) SFCLB GraphLB GroupLB Level Level Figure 3.8. Average communication time of each level for coarse resolution case Averagae Communication Time for Each Level (192 Processors, Fine Resoulution) 1600 Averagae Communication Time for Each Level (256 Processors, Fine Resoulution) Time(s) SFCLB GraphLB GroupLB Time(s) SFCLB GraphLB GroupLB Level Level 1600 Averagae Communication Time for Each Level (512 Processors, Fine Resoulution) 1600 Averagae Communication Time for Each Level (1024Processors, Fine Resoulution) Time(s) SFCLB GraphLB GroupLB Time(s) SFCLB GraphLB GroupLB Level Level Figure 3.9. Average communication time of each level for fine resolution case.

49 38 CHAPTER 4 OVERVIEW OF TOPOLOGY MAPPING Supercomputers with up to hundreds of thousands of cores have been built to fulfill the demands of large-scale and complex scientific applications. These machines usually have sparse network topologies with large network diameters (i.e., the maximum distance between two nodes). As the system size scales up, the diameter of the interconnection network increases, and the bisection bandwidth (i.e., the minimum total bandwidth of links connecting one half of the supercomputer and the other) often decreases. As a result, communication in the network becomes increasingly expensive due to large distance between nodes and network contention, leading to the scaling bottleneck of parallel applications. Topology-aware task mapping is an essential technique for communication optimization of parallel applications on high performance computing platforms. According to the topology of the interconnection network and the communication pattern of the application, it maps parallel application tasks onto processors to reduce communication cost. The following sections briefly review topology-aware task mapping techniques. 4.1 Background Today s supercomputers often use fat-tree, mesh or torus network topologies. A fat-tree network uses switches to connect compute nodes into a tree structure [55]. The compute nodes are at the leaves, while intermediate nodes represent switches. From the leaves to the root, the available bandwidth of links increases, i.e. the links become fatter. Infiniband networks are representative examples of fat-tree topology. To optimize communication on fat-tree, we prefer local communication to global communication for smaller latency. An n-dimensional mesh network connects compute nodes into a mesh struc-

50 39 ture, where each node is directly connected with two other nodes in each physical dimension. An n-dimensional torus network is a mesh with additional links to connect the pair of nodes at the end of each physical dimension [1], so that the network diameter can be reduced by half. In practice, 3-dimensional (3D) torus is commonly used in supercomputers, including the IBM Blue Gene family and the Cray XT family, and it can easily be configured into a mesh upon requirement. In petascale machines with 3D torus networks, the network diameter can be a few dozens of hops. As studied in [12], for small and medium sized messages, the message latency is dependent on the distance in hops. More importantly, [12] also shows that the congestion in the network can increase message latencies significantly. For better communication performance in torus networks, it is a must to reduce network congestion by mapping heavily communicating tasks onto neighboring nodes. In general, it is often advantageous to consider the network topology when mapping application tasks onto nodes, so that the communication cost can be reduced. For communication-intensive applications, it is indispensable to exploit topologyaware task mapping techniques for better scaling toward petascale computing platforms. Except for the performance improvement, a proper mapping often reduces the total amount of communication in the network, achieving power reduction in network transmission. As power consumption becomes increasingly critical in modern supercomputers, topology-aware mapping techniques are favorable due to potential power savings. However, in practice, it is very challenging to perform topology-aware mapping for parallel applications at runtime, since we have to solve the complicated mapping problem, which is detailed in the next section. 4.2 The Topology Mapping Problem For topology-aware mapping, the communication relation between parallel application tasks is represented as a task graph, which is also called communication

51 40 graph. It is to be noted that each task is an MPI process. The network topology of the target computing platform is represented as a topology graph. The problem of mapping parallel application tasks onto a computing platform with specific topology can be considered as a graph embedding problem [4], where a mapping of tasks onto the target topology is an embedding of the task graph onto the topology graph. Specifically, the task graph G t (V t, E t ) is an undirected graph, which characterizes the communication pattern of the parallel application. Each vertex in V t represents a task (i.e. a process). The edges in E t represent the communication between tasks, and each edge (u, v) E t has a weight c(u, v), which denotes the amount of communication (in bytes) between the two tasks represented by vertex u and v. The topology graph G p (V p, E p ) represents the network topology of the target computing platform. Typically, a vertex in V p denotes a processor (i.e. a compute node), and each edge in E p denotes a direct link. The definition of topology graph can vary for different network topologies. For example, the topology graph of a fat-tree network may also need to include switches. In order to model various topologies, [38] introduces a generic definition of topology graph; [83] utilizes edge weight to model the communication cost between processors. The mapping of tasks onto processors is specified by a function ϕ : V t V p. There are at least four metrics to evaluate the quality of a mapping, including the widely-used hop-bytes [12] and dilation, and recently proposed maximum interconnective message size [26] and the worst-case congestion [38]. hop-bytes is the total amount of inter-processor communication (in bytes) weighted by the distance (in hops) between processors. Recall that edge weight c(u, v) denotes the amount of communication (in bytes) on edge (u, v) E t. Let d(ϕ(u), ϕ(v)) be the distance, which is usually measured by the length of the shortest path (in hops) between ϕ(u) and ϕ(v) in the topology graph G p (V p, E p ). Then the hop-bytes of a mapping ϕ can

52 41 be computed as hop-bytes(ϕ) = ( ) c(u, v)d ϕ(u), ϕ(v). (u,v) E t It represents the total communication volume in the interconnection network, and also indicates the amount of power consumption for data transmission in the network. The dilation of an edge (u, v) E t is the length of the shortest path connecting ϕ(u) and ϕ(v) in G p (V p, E p ). The dilation of a mapping is the maximum dilation among all edges in E t, and it measures the most stretched edge. The maximum interconnective message size is the maximum size of messages transmitted in the network. It measures the mapping quality on interconnected multicore platforms, where each compute node is assigned several heavily communicating processes. The congestion of a link in the network is measured by the amount of traffic on that link divided by the link capacity (i.e. bandwidth). The worst-case congestion over all links denotes the worstcase contention in the network, and it is a lower bound on the communication time. The mapping problem is usually defined as finding a mapping that minimizes some of these metrics, i.e. hop-bytes, dilation, the maximum interconnective message size, the worst-case congestion, etc. Such problem definition is based on certain assumptions as summarized in [38]. First, intra-node communication is much faster than inter-node communication, so that heavily communicating tasks should be mapped onto the same compute node whenever possible. Second, we assume that switches are ideal, and it is not necessary to model the internal switch structure for task mapping. Third, oblivious routing is assumed, i.e. the route of messages does not depend on other on-going traffic in the network. Forth, the routing algorithm is assumed to be fixed, and does not depend on the communication relation between processors. In other words, the mapping of tasks onto processors will not change the routing algorithm. These assumptions are often satisfied in many cases, and a fairly good mapping can be obtained by solving

53 42 the task mapping problem. In practice, the number of tasks (i.e. processes) is no less than the number of processors (i.e. compute nodes) used for running the application, i.e. V t V p. If each compute node executes a single process, we have V t = V p, and ϕ is a oneto-one mapping from V t to V p. If compute nodes execute more than one process, then V t > V p, and a graph partitioner [45, 60, 72] can be employed to divide the processes into groups before mapping strategies are applied. The common case is that each compute node execute the same number of processes. Generally, finding the optimal mapping ϕ that minimizes some of the aforementioned metrics is NPhard [14] or NP-complete [38]. A variety of algorithms have been explored for efficient task mapping, as described in the next section. 4.3 Related Works Topology-aware task mapping has been extensively studied. Many algorithms have been proposed to map parallel application tasks onto different topologies. These algorithms typically fall under the categories of physical optimization techniques and heuristic approaches. Specifically, physical optimization techniques include simulated annealing [15, 16, 53], genetic algorithms [7, 23], graph contraction [9, 58] and particle swarm optimization [71]. Although these optimization algorithms often provide good mapping quality, they take a long runtime to derive the mapping, and hence are not suitable for efficient mapping at runtime. They are seldom used in practice. In contrast, many heuristic approaches can better exploit the structure of the topology graph and the communication pattern, thus being more efficient for practical usage. The pioneer work [14] uses sequences of pairwise exchanges with probabilistic jumps to search for a good mapping. The subsequent research [54] proposes to derive an initial mapping first, and then use pairwise exchanges to refine the mapping.

54 43 However, it can take a long time for pairwise exchanges to achieve good mapping solutions. Graph heuristics also have been investigated for task mapping. [70] generates an initial mapping by partitioning the task graph, where communicating processes are mapped onto the same processor or neighboring processors, then the initial mapping is further tuned by a boundary refinement procedure to improve load balance among processors. A recursive bipartitioning algorithm based on the Kernighan-Lin graph-bisection heuristic is proposed in [33] to map tasks onto a hypercube topology. The task graph is bipartitioned recursively with the objective of minimizing the communication between two parts while maintaining a specified workload balance, and each task s processor assignment is gradually determined through partitioning the task graph at each level. This recursive bipartitioning technique is extended to handle general topologies in [66], where both the task graph and the topology graph are bipartitioned recursively to derive the mapping of tasks onto processors. It has been implemented in the software package SCOTCH [67, 72], and proved to be successful in deriving static mappings. The earlier works in 1980s (e.g., [14, 33, 54, 70]) mainly consider hypercubes, array processors, or shuffle-exchange networks, which are popular topologies at that time. Their experimental evaluation is limited to relatively small problem size, since the parallel computing platforms were at small scale. With the advent of faster networks and wormhole routing, the communication latency was largely reduced and became distance-insensitive, so the research in this area died down. As the emergence of very large supercomputers in the recent years, the communication latency shows to be a performance bottleneck, leading to the necessity of topology-aware task mapping. However, the network topology of modern supercomputers are often fat-trees and torus. Most earlier works are either unscalable or target for different topologies, and cannot be used in the present context. Hence, recent studies continue to explore efficient task mapping techniques.

55 44 In late 1990s, the Cray T3D and T3E machines first show that the communication performance is dependent on the mapping of tasks onto processors, and the impact of network contention and benefit of topology-aware mapping are evaluated [27, 40, 62]. The later supercomputers from Cray (e.g. Cray XT families) use faster interconnects, and have relieved this communication issue to some extent. However, the evaluation study with sophisticated benchmarks in [12] demonstrates that the communication issue still exists, and [89] shows the benefit of topologyaware job scheduling. Different from Cray machines, IBM supercomputers like Blue Gene families acknowledge that the message latencies are dependent on the distance, and encourage users to take advantage of the topology for performance optimization [2,6,42]. Their topologies have been utilized by both system designers [5,28] and application developers [11, 13, 34, 36, 64]. Graph embedding schemes are exploited to map tasks onto nodes for 3D mesh and torus topologies in [74,96], and have been employed for scalable support of MPI topology functions in IBM Blue Gene/L (BG/L) MPI library. A general method for task mapping on BG/L is proposed in [10]. As it uses Monte Carlo simulation to find a good mapping, it takes a long runtime and the solution is derived offline. Several mapping algorithms for regular 2D and 3D topologies are studied in [12]. MPI benchmarks are developed to evaluate the message latency in order to quantify the effects of network congestion, and geometric approaches are exploited to map regular 2D and 3D task graphs effectively. For example, the Maximum Overlap algorithm recursively overlaps the largest possible area of a 2D task graph with a 2D topology graph in order to derive a good mapping. The Expand From Corner algorithm starts at one corner of the task graph and the topology graph, respectively, and sorts the tasks and processors according to the Manhattan distance. The tasks are mapped onto processors according to the order of tasks and processors, so that the communicating tasks can be placed on neighboring processors. A mapping algorithm

56 45 based on space filling curve is also investigated. Both tasks and processors are ordered into a one dimensional list according to the space filling curve, and tasks are mapped onto processors following this order. Moreover, generic mapping techniques are studied in [38], including a greedy heuristic, a recursive bipartitioning approach, and a mapping strategy based on graph similarity. These have been implemented in the topology mapping library [57], and the derived solution can be further improved by simulated annealing. Results on fat-tree, torus, and the PERCS network topologies show that these techniques can effectively reduce network congestion, and the third approach using graph similarity takes the minimum runtime, thus being a good candidate for time-efficient mapping on large-scale systems. [3] proposes an iterative approach to map tasks onto processors heuristically. Using an estimation function to estimate how critical it is to map an un-allocated task onto an available processor, it iteratively selects a pair of an unallocated task and an available processor for mapping. For systems with a hierarchical communication architecture, e.g. clusters of SMP (Symmetric Multiprocessor) nodes, [83] employs graph partitioning to derive an appropriate mapping. In order to solve large-scale mapping problems, a hierarchical mapping algorithm is proposed in [25]. The tasks are first divided into groups called supernodes, then these groups are mapped onto the target topology to derive an initial mapping, and finally the mapping of the tasks within each group is improved by fine tuning. For the task mapping on multicore (or multiprocessor) clusters, the tasks are usually partitioned into subsets, which are mapped onto nodes by using ordinary mapping techniques. [31] evaluates the performance of different mappings using benchmark applications. [26] introduces the maximum interconnective message size (MIMS) to measure the quality of mapping, and designs packing algorithms to find the optimal mapping.

57 46 Except for the aforementioned works on inter-node mapping, intra-node mapping has also been studied. Jeannot et al. [44] proposed the TreeMatch algorithm for mapping processes onto cores within a multicore compute node, and this algorithm was used in [59] to support the MPI Dist graph create function. Rashti et al. [69] proposed to use a weighted graph to model the whole physical topology of the computing system, including both the inter-node topology and the intra-node topology, and used the recursive bipartitioning algorithm in SCOTCH [67] to find a proper mapping. Although this is an effective approach to tackle both internode mapping and intra-node mapping in a universal way, it has a few limitations. First, this approach cannot exploit the structure of the physical topology, e.g., the network hierarchy of the fat-tree topology, the cache hierarchy within a compute node, and the regularity of mesh/torus, etc. As a result, its overhead and mapping quality can be further improved. Second, in some cases, a weighted graph may not be a proper representation of the physical topology. For example, the InfiniBand network topology and the intra-node topology can be better represented by topological trees (without edge weights), and the topology of non-contiguous nodes allocated to a user application on Cray XT5 machines cannot be modeled by a sparse weighted graph straightforwardly. In summary, the majority of works focus on inter-node mapping only, and overlooks the mapping within a compute node. Generic mapping algorithms like recursive bipartitioning [38, 66] and greedy heuristics [3, 38] can find good mappings, but they are likely to be outperformed by other heuristics like graph partitioning [76] (according to the fat-tree topology) and graph embedding [96] (on to regular mesh/torus), which are well-designed to exploit the structure of specific mapping problems. Typically, a proper inter-node mapping reduces the communication between nodes by placing heavily communicating processes on the same compute node, leading to a

58 47 large amount of intra-node communication. Then it becomes important to optimize the intra-node mapping in order to achieve the best performance. Although the TreeMatch algorithm [44] has been proposed as an effective heuristic for intra-node mapping, to the best of our knowledge, there is minor work on concurrent support of both inter-node mapping and intra-node mapping except [69], which did not exploit the structure of the problem.

59 48 CHAPTER 5 HIERARCHICAL TOPOLOGY MAPPING FOR ART In this chapter, we present the hierarchical topology mapping algorithm for reducing communication cost of ART on interconnected multiprocessor clusters. 5.1 Problem Statement We consider the problem of mapping parallel ART processes onto interconnected multiprocessor clusters as illustrated in Figure 5.1. Each node has several multicore CPUs. The communication pattern of ART is represented as a task graph G t (V t, E t ), which can be extracted from the decomposed subdomains. Each vertex in V t denotes a process, and each edge (u, v) E t represents the communication between process u and v. A weight c(u, v) is introduced for each edge to denote the amount of communication in bytes between respective processes. The topology of a multiprocessor cluster is characterized by a topology graph G p (V p, E p ), where each vertex in V p represents a node, and each edge in E p denotes the link between respective nodes. We also introduce edge weight for the topology graph to represent the distance in hops between nodes, so that both direct and indirect links can be modeled properly. The mapping of tasks onto nodes is specified by a function ϕ : V t V p. As each node can accommodate multiple processes, without loss of generality, we assume that V t is a multiple of V p, so that each node is assigned V t V p processes. There are several different metrics to evaluate the quality of a mapping as discussed in Chapter 4. We choose hop-bytes in order to reduce the total amount of traffic in the interconnection network. To the best of our knowledge, most studies focus on mapping tasks onto nodes (i.e., a single-level mapping), while the mapping of tasks within nodes is undefined. A proper inter-node mapping typically reduces the communication between nodes

60 49 Figure 5.1. Interconnected multiprocessor clusters with multicore CPUs on each node. by grouping most heavily communicating processes within nodes, leading to large amount of intra-node communication. If inter-node mapping has been optimized properly, then it becomes critical to optimize the intra-node mapping by exploiting the topology within a node. Hence, we optimize both inter-node mapping and intranode mapping in a hierarchical manner. 5.2 Proposed Approach Typically, the inter-node communication is more expensive than intra-node communication. Likewise, within a node, the inter-socket communication is often more expensive than intra-socket communication 1. To fully exploit these communication characteristics on multicore clusters, we propose to perform task mapping hierarchically in two phases: First, perform inter-node mapping; Second, perform intra-node mapping within each multiprocessor node. 1 This is our observation from the experiments on several production multiprocessor clusters. The performance of intra-node communication is determined by specific MPI implementation.

61 50 In the first phase, the mapping of tasks onto nodes can be derived by using conventional task mapping techniques. In the second phase, novel technique is required to map tasks onto multicore CPUs within each node Inter-Node Mapping. We employ the recursive bipartitioning heuristic [33, 38, 66] for mapping ART processes onto multiprocessor nodes, aiming at minimizing hop-bytes. This approach solves the task mapping problem in a divideand-conquer manner. It performs recursive bipartitioning for both task graph and topology graph, and maps subsets of processes to subsets of nodes until a final mapping is obtained. Many topologies and communication patterns can be handled by recursive bipartitioning, and it is proved to be a successful task mapping technique in the software package SCOTCH [67]. Figure 5.2 presents our recursive bipartitioning algorithm for inter-node mapping. Recall that both task graph and topology graph are weighted, and their edge weights represent the amount of communication in bytes between processes and the distance in hops between nodes, respectively. In order to reduce hop-bytes, it is critical to map heavily communicating processes onto the same multiprocessor node, or at least nearby nodes. To achieve this goal, the task graph is partitioned with the minimum edge-cut, while the topology graph is split with the maximum edge-cut. In each step of the algorithm, The resulting subsets of processes (V t0, V t1 ) and subsets of nodes (V p0, V p1 ) can be mapped in two ways: the direct mapping V t0 V p0, V t1 V p1 ; and the exchanged mapping V t0 V p1, V t1 V p0. To choose a proper mapping, we heuristically estimate the cost of these two mappings. Let C u be the amount of communication in bytes associated with process u, D i be the aggregate distance between node i and other nodes. Let C k and D k be the average C u and D i for the subsets of

62 51 Algorithm 1 Recursive Mapping Input: task graph G t (V t, E t ), topology graph G p (V p, E p ). Output: mapping ϕ : V t V p. 1 recursive mapping(v t, V p ) 2 { 3 if ( V p = 1) { 4 ϕ(u) = V p, process u V t ; 5 return; 6 } 7 (V t0, V t1 ) bipartition ( G t (V t, E t ) ) ; 8 (V p0, V p1 ) bipartition ( G p (V p, E p ) ) ; 9 Calculate C 0, C 1, D 0, D 1 ; 10 if (C 0 D 0 + C 1 D 1 C 0 D 1 + C 1 D 0 ) { 11 recursive mapping(v t0, V p0 ); 12 recursive mapping(v t1, V p1 ); 13 } 14 else { 15 recursive mapping(v t0, V p1 ); 16 recursive mapping(v t1, V p0 ); 17 } 18 } Figure 5.2. The recursive bipartitioning algorithm for inter-node mapping.

63 52 processes V tk and the subsets of nodes V pk (k = 0, 1), respectively. C u = D i = C k = D k = c(u, v), (u,v) E t d(i, j), node j i 1 V tk 1 V pk C u, u V tk D i. i V pk The cost of the direct mapping is estimated by (C 0 D 0 + C 1 D 1 ), while that of the exchanged mapping is (C 0 D 1 + C 1 D 0 ). This estimated cost can be considered as a prediction of hop-bytes. The mapping with smaller estimated cost is selected in order to minimize hop-bytes. Theorem 1 The time complexity of recursive bipartitioning is O ( ( E t + E p ) log V p ), and the time complexity for cost estimation is O ( ( V t + V p ) log V p ) based on precomputed C u and D i. Proof 1 The multilevel k-way partitioning scheme [45] computes a bipartition of a graph G(V, E) in O( E ) time. The depth of recursive bipartitioning is log 2 V p, and the size of the graph is decreased by half in each step. Hence, the total runtime for partitioning the task graph G t (V t, E t ) is log 2 V p 1 k=0 2 k O( E t )/2 k = O( E t log V p ). Similarly, the topology graph G p (V p, E p ) is recursively bipartitioned in O( E p log V p ) time. Thus the overall runtime of recursive bipartitioning is O ( ( E t + E p ) log V p ). For cost estimation, both C u and D i can be precomputed and then they are used to calculate C k and D k in different recursive mapping calls. The C u of all processes can be computed in O( E t ) time. The distance between a pair of nodes can be obtained by using platform-dependent techniques, e.g., the pairwise node distance in a 3D mesh

64 53 or 3D torus network can be computed from the coordinates. The D i of all nodes can be calculated in O( V p 2 ) time using pairwise node distances. In the recursive mapping procedure, the runtime for computing C k and D k is log 2 V p 1 k=0 2 k O( V t + V p ))/2 k = O ( ( V t + V p ) log V p ). In order to reduce the runtime of recursive mapping, the original task graph G t (V t, E t ) is partitioned into V p equal parts by minimizing inter-part communication, where each part has V t V p processes. Then an induced task graph Ĝt( V t, Êt) which represents the communication between groups of processes is used for efficient recursive mapping. Theorem 2 With the induced task graph Ĝt( V t, Êt), the time complexity of recursive bipartitioning is O ( ( Êt + E p ) log V p ), and the time complexity for cost estimation is O ( V p log V p ). We use the graph partitioning tool hmetis [37] for partitioning the original task graph, and the recursive bipartitioning of both the induced task graph and the topology graph. The resulting subsets of processes and subsets of nodes may be unbalanced in some rare cases. A greedy approach is applied to achieve balanced bipartition by moving the vertex which leads to the optimal edge-cut. In practice, if V p = 2 k, the number of nodes in some step of the recursive mapping will be odd, then the cost estimation is not required, and a direct mapping which maps process groups to equal number of nodes is adopted. The mapping produced by the recursive bipartitioning algorithm may be further improved to reduce hop-bytes. The local search algorithm in [25], and the heuristic in [38] with the threshold accepting technique can be employed to further optimize the mapping, but such mapping optimization has not been exploited in this work.

65 Intra-Node Mapping. The motivation of intra-node mapping is to exploit the performance gap between intra-socket communication and inter-socket communication for performance optimization. For each multiprocessor node, we choose to minimize the maximum inter-socket message size (MIMS), which is computed as MIMS = max c(u, v), (u,v) E t subject to the condition that process u and v are on different CPU sockets of the same node. MIMS is the maximum size of process-pairwise bidirectional messages transmitted between CPU sockets. The resulting mapping places heavily communicating processes on the same multicore CPU. Figure 5.3 shows the sketch of our algorithm for intra-node mapping. It utilizes the intra-node task graph G t (Ṽt, Ẽt) which represents the communication between processes within a multiprocessor node, and maps processes onto multicore CPUs. We assume that the number of processes Ṽt is a multiple of ncpus (number of CPUs), so that each CPU is assigned in two steps: Ṽt ncpus processes. The intra-node mapping is performed First, initial mapping by graph partitioning; Second, fine tuning with a greedy heuristic. The processes are partitioned into ncpus (number of CPUs) equal parts P i (0 i < ncpus) by minimizing inter-part communication, where each part is mapped onto the corresponding CPU. In each iteration of fine tuning, the edge (u, v) E t resulting in MIMS is identified, i.e., u P i, v P j, i j, MIMS = c(u, v). Then we evaluate whether we can group both processes u and v onto CPU i or j by exchanging a

66 55 Algorithm 2 Intra-node Mapping Input: intra-node task graph G t (Ṽt, Ẽt). Output: mapping of processes onto multicore CPUs. 1 Partition intra-node task graph into ncpus equal parts P i (0 i < ncpus); 2 Map processes in P i onto CPU i; 3 Loop 4 Identify the edge (u, v) E t leading to MIMS, i.e., u P i, v P j, i j, MIMS = c(u, v); 5 If the minimum IMS of exchanging a pair of processes to group both process u and v onto either CPU i or j is smaller than MIMS; 6 Exchange the pair of processes with the 7 Else 8 Break; 9 End If 10 End Loop minimum IMS; Figure 5.3. The algorithm for intra-node mapping by minimizing the maximum intersocket message size (MIMS).

67 56 pair of processes to reduce MIMS. The resulting inter-socket message size (IMS) of exchanging process u P i and a process w P j \ {v} can be computed as ( IMS(u w) = max max x P i,x u ) c(u, x), max c(w, x), x P j,x w Likewise, the resulting IMS of exchanging process v P j and a process w P i \ {u} can also be computed. The minimum IMS of exchanging a pair of processes can be derived by evaluating all possible process pairs: (u w), u P i, w P j \ {v}, (w v), w P i \ {u}, v P j. If it is less than MIMS, then we exchange the pair of processes with the minimum IMS. Otherwise, the algorithm terminates. The initial mapping can be obtained in O( Ẽt ) time by using the multilevel k-way partitioning algorithm [45]. In each iteration of fine tuning, the edge (u, v) leading to MIMS can be identified in O( Ẽt ) time. In order to evaluate the resulting IMS of exchanging process pairs, for each process in P i and P j, the maximum amount of communication with another process in the same part need to be computed. This procedure takes O( Ẽti + Ẽtj ) time, where Ẽti and Ẽtj be the set of edges representing the communication between processes within P i and P j, respectively. Then it requires O( Ṽt ) comparisons to find the minimum IMS of exchanging two processes. ncpus Theorem 3 The overall time complexity of each iteration of fine tuning is O( Ẽt + Ẽti + Ẽtj + Ṽt ncpus ). hmetis [37] is employed to partition the intra-node task graph into balanced parts for initial mapping, which is further improved through fine tuning. It is worth

68 57 noting that the initial mapping has the minimum total amount of inter-socket communication, and the fine tuning procedure usually terminates in a few iterations. 5.3 Performance Evaluation Experimental Setup. Experiments are carried out on two production multiprocessor clusters, NICS Kraken [78] and TACC Ranger [79], which have different network topologies. Kraken is a Cray XT5 system with a 3D torus interconnect topology. It is comprised of 9, 408 compute nodes and each node contains two 2.6 GHz six-core AMD Opteron processors. Ranger is a Sun Constellation Linux cluster with a full-clos fat-tree topology. It has a total of 3, 936 compute nodes and each node has four 2.3 GHz quad-core AMD Opteron processors. A communication benchmark similar to IMB (Intel MPI Benchmarks) [43] is designed to measure the intra-socket communication time and the inter-socket communication time for performance analysis. Note that IMB cannot be employed for this purpose because we would like to test all possible communicating pairs of processors. We are particularly interested in the performance of PingPing, since the communication between ART processes can be viewed as concurrent PingPing operations between many process pairs. To evaluate the performance of the proposed hierarchical task mapping algorithm, we extract the communication part of a production ART simulation (with a box of 36 comoving Mpc on a side and the uniform top level mesh of root cells) for tests. For comparison, we evaluate five different mapping mechanisms: (1) the system default mapping, which is topology-agnostic; (2) an optimized mapping using MPI topology function MPI GRAPH CREATE [61]; (3) the pure inter-node mapping using the algorithm in Section 5.2.1; (4) the pure intra-node mapping using the algorithm in Section 5.2.2; (5) the hierarchical mapping integrating both inter-node

69 58 Table 5.1. Average Intra-Socket and Inter-Socket Communication Time (PingPing) No. of No. of Kraken (us) Ranger (us) Bytes Repetitions Intra Inter Difference Intra Inter Difference % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %

70 59 mapping and intra-node mapping. Hope-bytes, the maximum inter-socket message size (MIMS) and communication time are used as evaluation metrics. All the experiments were conducted in production mode without dedicated nodes, and there are other users sharing the interconnection network. For each set of experiments with a particular number of processes, we get the topology information between nodes at runtime, generate different mappings by using the proposed algorithms, and run all the tests with different mappings in a single batch script. It is worth noting that different runs often get nodes with different pairwise distances, and the interference of other running applications may also be different. Hence, the results obtained from different runs may not be comparable Results. Table 5.1 presents the average intra-socket and inter-socket PingPing communication time over all possible communicating processor pairs. The overhead of inter-socket communication compared to intra-socket communication is reported under the columns Difference, and the larger difference values between Kraken and Ranger are highlighted in bold. Clearly, Ranger shows a larger performance gap between intra-socket and inter-socket communication for most message sizes, because it has more complicated intra-node topology. As shown in Figure 5.4, there are four interconnected CPU sockets and no direct link exists between socket 0 and 3 on Ranger. Meanwhile, there are only two CPU sockets on each node of Kraken. Such architectural difference results in different communication saving when applying the pure intra-node mapping and the hierarchical mapping (see Figures 5.8 and 5.12). Figure 5.5 compares the pure inter-node mapping and the system default mapping on Kraken in terms of hop-bytes. The inter-node mapping algorithm (listed in Figure 5.2) effectively reduces hop-bytes for all the test cases, and the maximum reduction is up to 59%.

71 60 Figure 5.4. The intra-node topology of Ranger (from TACC Ranger website [79]). Improvement 70% 60% 50% 40% 30% 20% 10% Hop-Bytes Reduction of Inter-Node Mapping on Kraken 0% Number of Processes Figure 5.5. Comparison of inter-node mapping and default mapping on Kraken in terms of hop-bytes.

72 61 Improvement MIMS Reduction of Intra-Node Mapping on Kraken 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Number of Processes Figure 5.6. Comparison of intra-node mapping and default mapping on Kraken in terms of MIMS (the maximum inter-socket message size). Improvement 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% MIMS Reduction of Hierarchical Mapping on Kraken Number of Processes Figure 5.7. Comparison of hierarchical mapping and inter-node mapping on Kraken in terms of MIMS (the maximum inter-socket message size).

73 % Communication Time Reduction on Kraken 25.00% Improvement 20.00% 15.00% 10.00% 5.00% MPI_Graph Intra-Node Inter-Node Hierarchical 0.00% -5.00% Number of Processes Figure 5.8. Communication time reduction of different mapping mechanisms compared to default mapping on Kraken. Figure 5.6 compares the pure intra-node mapping and the system default mapping on Kraken in terms of MIMS. and Figure 5.7 compares the hierarchical mapping and the pure inter-node mapping on Kraken in terms of MIMS. The intra-node mapping algorithm (listed in Figure 5.3) largely reduces MIMS by up to 83%. For the first test case with 24 processes, the system default mapping happens to have the minimum MIMS, and the MIMS of the pure inter-node mapping is also close to the minimum, so no MIMS reduction or only limited reduction can be achieved. The communication time reduction of different mapping mechanisms compared to the system default mapping is illustrated in Figure 5.8, where MPI Graph, Intra-Node, Inter-Node and Hierarchical represent MPI topology mapping, the pure intra-node mapping, the pure inter-node mapping and the hierarchical mapping, respectively. MPI Graph does not achieve much performance improvement except the first test case with 24 processes, and it fails for the largest test with 1, 536 processes. The pure intra-node mapping often provides minor performance improvement, while

74 63 Hop-Bytes Reduction of Inter-Node Mapping on Ranger Improvement 80% 70% 60% 50% 40% 30% 20% 10% 0% Number of Processes Figure 5.9. Comparison of inter-node mapping and default mapping on Ranger in terms of hop-bytes. the pure inter-node mapping has much better performance. In contrast, the hierarchical mapping always outperforms both the pure intra-node mapping and the pure inter-node mapping, achieving communication time reduction by up to 25%. The experiments on Ranger have similar performance results as illustrated in Figures 5.9 to For all the test cases, the inter-node mapping algorithm (listed in Figure 5.2) is able to reduce hop-bytes by up to 76%, and the intra-node mapping algorithm (listed in Figure 5.3) reduces MIMS by up to 79%. MPI Graph provides similar performance as the system default mapping. Both the pure intra-node mapping and the pure inter-node mapping can achieve communication time reduction for most test cases. For the test with 32 processes, the pure inter-node mapping results in more communication time despite of reduced hope-bytes. This phenomenon is mainly attributable to the fact that the intra-node PingPing communication can be slower than the nearby inter-node PingPing communication for large message sizes on Ranger. This is due to small memory bandwidth per core on Ranger, and possibly as well as inefficient implementation of intra-node communication algorithms in the mvapich library. The hierarchical mapping often achieves much better performance

75 64 Improvement 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% MIMS Reduction of Intra-Node Mapping on Ranger Number of Processes Figure Comparison of intra-node mapping and default mapping on Ranger in terms of MIMS (the maximum inter-socket message size). MIMS Reduction of Hierarchical Mapping on Ranger Improvement 80% 70% 60% 50% 40% 30% 20% 10% 0% Number of Processes Figure Comparison of hierarchical mapping and inter-node mapping on Ranger in terms of MIMS (the maximum inter-socket message size).

76 65 Communication Time Reduction on Ranger Improvement 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% % % % Number of Processes MPI_Graph Intra-Node Inter-Node Hierarchical Figure Communication time reduction of different mapping mechanisms compared to default mapping on Ranger. than both the pure intra-node mapping and the pure inter-node mapping, reducing the communication time by up to 50%. By comparing Figure 5.8 and Figure 5.12, we can observe that the hierarchical mapping is much more effective on Ranger (with respect to the performance of the pure inter-node mapping), and the pure intra-node mapping generally achieves more communication time reduction (in percentage) on Ranger, because Ranger has a larger performance gap between intra-socket and inter-socket communication as shown in Table 5.1. Basically, such performance gap indicates how critical it is to perform intra-node task mapping, and we can expect that the intra-node task mapping will become increasingly important as more processors are included in each node.

77 66 CHAPTER 6 NETWORK AND MULTICORE AWARE TOPOLOGY MAPPING In this chapter, we introduce a topology mapping methodology and a set of mapping algorithms, which consider both the network topology and the hierarchical architecture of multicore compute nodes for finding a proper mapping. 6.1 Proposed Methodology Figure 6.1. High performance computing systems with multicore CPU(s) on each compute node. We consider the problem of mapping parallel application processes onto HPC systems with multicore compute nodes (see Figure 6.1), where each compute node is assigned multiple MPI processes. The physical topology of the computing system consists of two parts: network topology (i.e. inter-node topology) and intra-node topology. As they have different characteristics, which can be exploited by different mapping techniques, we propose to perform inter-node mapping and intra-node mapping in two phases, so that the mapping can be better optimized. This two-phase approach naturally exploits the hierarchy of the computing system, thus being computationally efficient.