Partitioning and Dynamic Load Balancing for Petascale Applications

Similar documents

Tutorial: Partitioning, Load Balancing and the Zoltan Toolkit

Hypergraph-based Dynamic Load Balancing for Adaptive Scientific Computations

Lecture 12: Partitioning and Load Balancing

New Challenges in Dynamic Load Balancing

A Comparison of Load Balancing Algorithms for AMR in Uintah

Software Design for Scientific Applications

Partitioning and Load Balancing for Emerging Parallel Applications and Architectures

Dynamic Load Balancing for Cluster Computing Jaswinder Pal Singh, Technische Universität München. singhj@in.tum.de

Mesh Generation and Load Balancing

Mesh Partitioning and Load Balancing

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Load Balancing Strategies for Parallel SAMR Algorithms

A Refinement-tree Based Partitioning Method for Dynamic Load Balancing with Adaptively Refined Grids

Fast Multipole Method for particle interactions: an open source parallel library component

Sparse Matrix Decomposition with Optimal Load Balancing

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

A FAST AND HIGH QUALITY MULTILEVEL SCHEME FOR PARTITIONING IRREGULAR GRAPHS

HPC Deployment of OpenFOAM in an Industrial Setting

Guide to Partitioning Unstructured Meshes for Parallel Computing

HPC enabling of OpenFOAM R for CFD applications

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Load Balancing and Communication Optimization for Parallel Adaptive Finite Element Methods

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

A New Unstructured Variable-Resolution Finite Element Ice Sheet Stress-Velocity Solver within the MPAS/Trilinos FELIX Dycore of PISCEES

Mesh Partitioning for Parallel Computational Fluid Dynamics Applications on a Grid

walberla: Towards an Adaptive, Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation

Parallel Simplification of Large Meshes on PC Clusters

Load-Balancing Spatially Located Computations using Rectangular Partitions

CSE 4351/5351 Notes 7: Task Scheduling & Load Balancing

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Parallel Computing. Benson Muite. benson.

Power-Aware High-Performance Scientific Computing

Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies

Efficient Storage, Compression and Transmission

Highly Scalable Dynamic Load Balancing in the Atmospheric Modeling System COSMO-SPECS+FD4

Scientific Computing Programming with Parallel Objects

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008

A scalable multilevel algorithm for graph clustering and community structure detection

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

OPTIMIZED AND INTEGRATED TECHNIQUE FOR NETWORK LOAD BALANCING WITH NOSQL SYSTEMS

High Performance Computing in CST STUDIO SUITE

Solvers, Algorithms and Libraries (SAL)

Lecture 2 Parallel Programming Platforms

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

A Case Study - Scaling Legacy Code on Next Generation Platforms

Calculation of Eigenmodes in Superconducting Cavities

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

FD4: A Framework for Highly Scalable Dynamic Load Balancing and Model Coupling

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

Introduction to Cloud Computing

Optimizing Load Balance Using Parallel Migratable Objects

High-fidelity electromagnetic modeling of large multi-scale naval structures

An Improved Spectral Load Balancing Method*

Load Balancing Techniques

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Weighted Decomposition in High-Performance Lattice-Boltzmann Simulations: Are Some Lattice Sites More Equal than Others?

Fast Setup and Integration of ABAQUS on HPC Linux Cluster and the Study of Its Scalability

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

A first evaluation of dynamic configuration of load-balancers for AMR simulations of flows

A Review of Customized Dynamic Load Balancing for a Network of Workstations

Petascale, Adaptive CFD (ALCF ESP Technical Report)

Load Balance Strategies for DEVS Approximated Parallel and Distributed Discrete-Event Simulations

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Calculation of Eigenfields for the European XFEL Cavities

The enhancement of the operating speed of the algorithm of adaptive compression of binary bitmap images

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

Fast Optimal Load Balancing Algorithms for 1D Partitioning

Partitioning and Divide and Conquer Strategies

A Load Balancing Tool for Structured Multi-Block Grid CFD Applications

High Performance Computing. Course Notes HPC Fundamentals

Transcription:

Partitioning and Dynamic Load Balancing for Petascale Applications Karen Devine, Sandia National Laboratories Erik Boman, Sandia National Laboratories Umit Çatalyürek, Ohio State University Lee Ann Riesen, Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Partitioning and Load Balancing Assignment of application data to processors for parallel computation. Applied to grid points, elements, matrix rows, particles,.

Static Partitioning Initialize Application Partition Data Distribute Data Compute Solutions Output & End Static partitioning in an application: Data partition is computed. Data are distributed according to partition map. Application computes. Ideal partition: Processor idle time is minimized. Inter-processor communication costs are kept low.

Dynamic Repartitioning (a.k.a. Dynamic Load Balancing) Initialize Application Partition Data Redistribute Data Compute Solutions & Adapt Output & End Dynamic repartitioning (load balancing) in an application: Data partition is computed. Data are distributed according to partition map. Application computes and, perhaps, adapts. Process repeats until the application is done. Ideal partition: Processor idle time is minimized. Inter-processor communication costs are kept low. Cost to redistribute data is also kept low.

What makes a partition good, especially at petascale? Balanced work loads. Even small imbalances result in many wasted processors! 50,000 processors with one processor 5% over average workload is equivalent to 2380 idle processors and 47,620 perfectly balanced processors. Low interprocessor communication costs. Processor speeds increasing faster than network speeds. Partitions with minimal communication costs are critical. Scalable partitioning time and memory use. Scalability is especially important for dynamic partitioning. Low data redistribution costs (for dynamic partitioning). Redistribution costs must be recouped through reduced total execution time.

Partitioning Algorithms in the Zoltan Toolkit Geometric (coordinate-based) methods Recursive Coordinate Bisection (Berger, Bokhari) Recursive Inertial Bisection (Taylor, Nour-Omid) Space Filling Curve Partitioning (Warren&Salmon, et al.) Refinement-tree Partitioning (Mitchell) Hypergraph and graph (connectivity-based) methods Hypergraph Partitioning Hypergraph Repartitioning PaToH (Catalyurek & Aykanat) Zoltan Graph Partitioning ParMETIS (U. Minnesota) Jostle (U. Greenwich)

Geometric Partitioning Recursive Coordinate Bisection: Developed by Berger & Bokhari (987) for Adaptive Mesh Refinement. Idea: Divide work into two equal parts using a cutting plane orthogonal to a coordinate axis. Recursively cut the resulting subdomains. 2nd 3rd st cut 3rd 2nd 3rd 3rd

Geometric Repartitioning Implicitly achieves low data redistribution costs. For small changes in data, cuts move only slightly, resulting in little data redistribution.

Applications of Geometric Methods Adaptive Mesh Refinement Particle Simulations Parallel Volume Rendering Crash Simulations and Contact Detection

RCB Advantages and Disadvantages Advantages: Conceptually simple; fast and inexpensive. All processors can inexpensively know entire partition (e.g., for global search in contact detection). No connectivity info needed (e.g., particle methods). Good on specialized geometries. SLAC S 55-cell Linear Accelerator with couplers: One-dimensional RCB partition reduced runtime up to 68% on 52 processor IBM SP3. (Wolf, Ko) Disadvantages: No explicit control of communication costs. Mediocre partition quality. Can generate disconnected subdomains for complex geometries. Need coordinate information.

Graph Partitioning Kernighan, Lin, Schweikert, Fiduccia, Mattheyes, Simon, Hendrickson, Leland, Kumar, Karypis, et al. Represent problem as a weighted graph. Vertices = objects to be partitioned. Edges = dependencies between two objects. Weights = work load or amount of dependency. Partition graph so that Parts have equal vertex weight. Weight of edges cut by part boundaries is small.

Graph Repartitioning Diffusive strategies (Cybenko, Hu, Blake, Walshaw, Schloegel, et al.) Shift work from highly loaded processors to less loaded neighbors. Local communication keeps data redistribution costs low. 30 30 0 0 20 20 20 20 0 0 20 0 0 Multilevel partitioners that account for data redistribution costs in refining partitions (Schloegel, Karypis) Parameter weights application communication vs. redistribution communication. Coarsen graph Partition coarse graph Refine partition accounting for current part assignment

Applications using Graph Partitioning Multiphysics and multiphase simulations Finite Element Analysis = A Linear solvers & preconditioners (square, structurally symmetric systems) x b

Graph Partitioning: Advantages and Disadvantages Advantages: Highly successful model for mesh-based PDE problems. Explicit control of communication volume gives higher partition quality than geometric methods. Excellent software available. Serial: Parallel: Disadvantages: Chaco (SNL) Jostle (U. Greenwich) METIS (U. Minn.) Party (U. Paderborn) Scotch (U. Bordeaux) Zoltan (SNL) ParMETIS (U. Minn.) PJostle (U. Greenwich) More expensive than geometric methods. Edge-cut model only approximates communication volume.

Hypergraph Partitioning Alpert, Kahng, Hauck, Borriello, Çatalyürek, Aykanat, Karypis, et al. Hypergraph model: Vertices = objects to be partitioned. Hyperedges = dependencies between two or more objects. Partitioning goal: Assign equal vertex weight while minimizing hyperedge cut weight. A A Graph Partitioning Model Hypergraph Partitioning Model

Hypergraph Repartitioning Augment hypergraph with data redistribution costs. Account for data s current processor assignments. Weight dependencies by their size and frequency of use. Hypergraph partitioning then attempts to minimize total communication volume: Data redistribution volume + Application communication volume Total communication volume Lower total communication volume than geometric and graph repartitioning. Best Algorithms Paper Award at IPDPS07 Hypergraph-based Dynamic Load Balancing for Adaptive Scientific Computations Catalyurek, Boman, Devine, Bozdag, Heaphy, & Riesen

Hypergraph Applications Finite Element Analysis Linear programming for sensor placement Multiphysics and multiphase simulations 2 2 2 2 Rg02 R 2C02 C INDUCTOR Vs SOURCE_VOLTAGE Rs R 2 2Cm02 C 2 Rg0 R 2C0 C L2 L INDUCTOR Circuit Simulations 2 2 R2 R R R 2 C2 C 2 2 2 2C C Rg2 R Rl R Rg R 2Cm2 C A Linear solvers & preconditioners (no restrictions on matrix structure) x = b Data Mining

Hypergraph Partitioning: Advantages and Disadvantages Advantages: Communication volume reduced 30-38% on average over graph partitioning (Catalyurek & Aykanat). 5-5% reduction for mesh-based applications. More accurate communication model than graph partitioning. Better representation of highly connected and/or non-homogeneous systems. Greater applicability than graph model. Can represent rectangular systems and non-symmetric dependencies. Disadvantages: More expensive than graph partitioning.

Performance Results Experiments on Sandia s Thunderbird cluster. Dual 3.6 GHz Intel EM64T processors with 6 GB RAM. Infiniband network. Compare RCB, graph and hypergraph methods. Measure Amount of communication induced by the partition. Partitioning time.

Test Data Xyce 680K ASIC Stripped Circuit Simulation 680K x 680K 2.3M nonzeros SLAC *LCLS Radio Frequency Gun 6.0M x 6.0M 23.4M nonzeros SLAC Linear Accelerator 2.9M x 2.9M.4M nonzeros Cage5 DNA Electrophoresis 5.M x 5.M 99M nonzeros

Communication Volume: SLAC 6.0M LCLS Lower is Better Number of parts = number of processors. Xyce 680K circuit RCB Graph Hypergraph SLAC 2.9M Linear Accelerator Cage5 5.M electrophoresis

Partitioning Time: SLAC 6.0M LCLS Lower is better 024 parts. Varying number of processors. Xyce 680K circuit RCB Graph Hypergraph SLAC 2.9M Linear Accelerator Cage5 5.M electrophoresis

Repartitioning Experiments Experiments with 64 parts on 64 processors. Dynamically adjust weights in data to simulate, say, adaptive mesh refinement. Repartition. Measure repartitioning time and total communication volume: Data redistribution volume + Application communication volume Total communication volume

SLAC 6.0M LCLS Repartitioning Results: Lower is Better Xyce 680K circuit Data Redistribution Volume Application Communication Volume Repartitioning Time (secs)

Aiming for Petascale Reducing communication costs for applications. Reducing communication volume. Two-dimensional sparse matrix partitioning (Catalyurek, Aykanat, Bisseling). Partitioning non-zeros of matrix rather than rows/columns. Reducing message latency. Minimize maximum number of neighboring parts. Balancing both computation and communication (Pinar & Hendrickson); balance criterion is complex function of the partition instead of simple sum of object weights. Reducing communication overhead. Map parts onto processors to take advantage of network topology. Minimize distance messages travel in network.

Aiming for Petascale Hierarchical partitioning in Zoltan v3. Partition for multicore/manycore architectures. Partition hierarchically with respect to chips and then cores. Similar to strategies for clusters of SMPs (Teresco, Faik). Treat core-level partitions as separate threads or MPI processes. Support 00Ks processors. Reduce collective communication operations during partitioning. Allow more localized partitioning on subsets of processors. Core Processor... Entire System... Core Core Processor... Core Subsystem Proc Entire System... Subsystem......... Proc Proc... Proc

Aiming for Petascale Improving scalability of partitioning algorithms. Hybrid partitioners (particularly for mesh-based apps.): Use inexpensive geometric methods for initial partitioning; refine with hypergraph/graph-based algorithms at boundaries. Use geometric information for fast coarsening in multilevel hypergraph/graph-based partitioners. Coarsen data using geometric info Partition Refine partition coarse data Refactored partitioners for bigger data sets and processor arrays.

Aiming for Petascale Testing and performance evaluation. Examine effectiveness of partitions in applications. Wanted: Collaborations with application developers!

For More Information Zoltan web page: http://www.cs.sandia.gov/zoltan Download Zoltan v3 (open-source software). Read User s Guide. ITAPS Interface to Zoltan: https://svn.scorec.rpi.edu/wsvn/tstt/distributions/ SciDAC Tutorial, June 29, :30-4:30pm, MIT Load-balancing and partitioning for scientific computing using Zoltan Erik Boman, Karen Devine

S. Attaway (SNL) C. Aykanat (Bilkent U.) A. Bauer (RPI) R. Bisseling (Utrecht U.) D. Bozdag (Ohio St. U.) T. Davis (U. Florida) J. Faik (RPI) J. Flaherty (RPI) R. Heaphy (SNL) B. Hendrickson (SNL) M. Heroux (SNL) D. Keyes (Columbia) K. Ko (SLAC) Thanks SciDAC CSCAPES Institute (A. Pothen, Old Dominion U., PI) SciDAC ITAPS Center (L. Diachin, LLNL, PI) NNSA ASC Program G. Kumfert (LLNL) L.-Q. Lee (SLAC) V. Leung (SNL) G. Lonsdale (NEC) X. Luo (RPI) L. Musson (SNL) S. Plimpton (SNL) J. Shadid (SNL) M. Shephard (RPI) C. Silvio (SNL) J. Teresco (Mount Holyoke) C. Vaughan (SNL) M. Wolf (U. Illinois) http://www.cs.sandia.gov/zoltan

Communication Volume: SLAC 6.0M LCLS Lower is Better 024 parts. Varying number of processors. Xyce 680K circuit RCB Graph Hypergraph SLAC 2.9M Linear Accelerator Cage5 5.M electrophoresis

Partitioning Time: SLAC 6.0M LCLS Lower is better Number of parts = number of processors. Xyce 680K circuit RCB Graph Hypergraph SLAC 2.9M Linear Accelerator Cage5 5.M electrophoresis