Load Balancing Strategies for Parallel SAMR Algorithms



Similar documents
Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies

ME6130 An introduction to CFD 1-1

Mesh Generation and Load Balancing

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

Dynamic Load Balancing of SAMR Applications on Distributed Systems y

Fast Multipole Method for particle interactions: an open source parallel library component

Express Introductory Training in ANSYS Fluent Lecture 1 Introduction to the CFD Methodology

Partitioning and Dynamic Load Balancing for Petascale Applications

Computational Engineering Programs at the University of Erlangen-Nuremberg

A first evaluation of dynamic configuration of load-balancers for AMR simulations of flows

HPC enabling of OpenFOAM R for CFD applications

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

The scalability impact of a component-based software engineering framework on a growing SAMR toolkit: a case study

walberla: Towards an Adaptive, Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation

Customer Training Material. Lecture 2. Introduction to. Methodology ANSYS FLUENT. ANSYS, Inc. Proprietary 2010 ANSYS, Inc. All rights reserved.

Block-Structured Adaptive Mesh Refinement Algorithms and Software Phillip Colella Lawrence Berkeley National Laboratory

Model of a flow in intersecting microchannels. Denis Semyonov

Visualization of Adaptive Mesh Refinement Data with VisIt

Basin simulation for complex geological settings

Introduction to CFD Analysis

Very special thanks to Wolfgang Gentzsch and Burak Yenier for making the UberCloud HPC Experiment possible.

A Refinement-tree Based Partitioning Method for Dynamic Load Balancing with Adaptively Refined Grids

CFD modelling of floating body response to regular waves

HPC Deployment of OpenFOAM in an Industrial Setting

Scientific Computing Programming with Parallel Objects

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

A Pattern-Based Approach to. Automated Application Performance Analysis

CONVERGE Features, Capabilities and Applications

Dynamic Load Balancing for Cluster Computing Jaswinder Pal Singh, Technische Universität München. singhj@in.tum.de

Introduction to CFD Analysis

CFD analysis for road vehicles - case study

The Scientific Data Mining Process

Advanced discretisation techniques (a collection of first and second order schemes); Innovative algorithms and robust solvers for fast convergence.

A Comparison of Load Balancing Algorithms for AMR in Uintah

Interactive simulation of an ash cloud of the volcano Grímsvötn

Introduction. 1.1 Motivation. Chapter 1

Multiphase Flow - Appendices

Uintah Framework. Justin Luitjens, Qingyu Meng, John Schmidt, Martin Berzins, Todd Harman, Chuch Wight, Steven Parker, et al

Highly Scalable Dynamic Load Balancing in the Atmospheric Modeling System COSMO-SPECS+FD4

Hypergraph-based Dynamic Load Balancing for Adaptive Scientific Computations

Visualization Tools for Adaptive Mesh Refinement Data

Hash-Storage Techniques for Adaptive Multilevel Solvers and Their Domain Decomposition Parallelization

A FAST AND HIGH QUALITY MULTILEVEL SCHEME FOR PARTITIONING IRREGULAR GRAPHS

New Challenges in Dynamic Load Balancing

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Data Structures and Mesh Processing in Parallel CFD Project GIMM

POLITECNICO DI MILANO Department of Energy

Multi-Block Gridding Technique for FLOW-3D Flow Science, Inc. July 2004

Chapter 5 Adaptive Mesh, Embedded Boundary Model for Flood Modeling

Spatio-Temporal Mapping -A Technique for Overview Visualization of Time-Series Datasets-

(Toward) Radiative transfer on AMR with GPUs. Dominique Aubert Université de Strasbourg Austin, TX,

XFlow CFD results for the 1st AIAA High Lift Prediction Workshop

A scalable multilevel algorithm for graph clustering and community structure detection

A Grid-based Clustering Algorithm using Adaptive Mesh Refinement

TESLA Report

OpenFOAM Optimization Tools

Tomasz STELMACH. WindSim Annual User Meeting 16 June 2011

How High a Degree is High Enough for High Order Finite Elements?

Lecture 12: Partitioning and Load Balancing

OpenFOAM Workshop. Yağmur Gülkanat Res.Assist.

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

A HYBRID GROUND DATA MODEL TO SUPPORT INTERACTION IN MECHANIZED TUNNELING

ABSTRACT FOR THE 1ST INTERNATIONAL WORKSHOP ON HIGH ORDER CFD METHODS

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Static Load Balancing of Parallel PDE Solver for Distributed Computing Environment

Multiresolution 3D Rendering on Mobile Devices

FEM Software Automation, with a case study on the Stokes Equations

High-fidelity electromagnetic modeling of large multi-scale naval structures

Load Balancing and Communication Optimization for Parallel Adaptive Finite Element Methods

Load Balance Strategies for DEVS Approximated Parallel and Distributed Discrete-Event Simulations

Quality and Reliability in CFD

ENABLING TOOLS AND TECHNIQUES FOR THE OPTIMIZATION OF THE HERACLES SIMULATION PROGRAM

Mesh Partitioning and Load Balancing

Autonomic Dynamic Load Balancing of Parallel SAMR Applications

GPUs for Scientific Computing

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

CFD simulations using an AMR-like approach in the PDE Framework Peano

CFD Application on Food Industry; Energy Saving on the Bread Oven

Computational Modeling of Wind Turbines in OpenFOAM

Transcription:

Proposal for a Summer Undergraduate Research Fellowship 2005 Computer science / Applied and Computational Mathematics Load Balancing Strategies for Parallel SAMR Algorithms Randolf Rotta Institut für Informatik, Technische Universität Cottbus Postfach 10 13 44, 03013 Cottbus, Germany E-Mail: rrotta@informatik.tu-cottbus.de Abstract Highly resolved solutions of partial differential equations are important in many areas of science and technology nowadays. Only adaptive mesh refinement methods reduce the necessary work sufficiently allowing the calculation of realistic problems. Blockstructured SAMR methods are well-suited for the time-explicit computation of large-scale dynamical problems, but still require parallelization on distributed systems. Unfortunately assigning activities in a load balanced manner is even more difficult than for unstructured AMR applications. Here common partitioning aims for parallel SAMR algorithms are summarized and strategies for mapping the load balancing problem to simpler unstructured meshes are outlined. The research proposed for the participation in SURF 2005 includes the systematization, implementation and quantitative comparison of known modelling and load balancing strategies. 1 Introduction A large number of physical processes, e.g. fluid dynamics [3], magnetohydrodynamics, thermodynamics, astrophysics [16], or structural dynamics are modeled by partial differential equations (PDEs). The continuous physical values are discretized with a finite number of differentiation points (finite differences methods) or volumes containing densities obtained by integration (finite volume methods). Computer programs simulate the evolution in time by numerically approximating the solutions of these equations. Realistic simulations of interesting physical phenomena (shock fronts, material or phase interfaces, turbulence) require a very fine resolution. Employing only uniform discretizations such simulations often would be impossible, because of the implied high memory consumption and computational costs. Fortunately, many problems require such high resolutions only in small subregions where the interesting phenomena occur. Therefore dynamic adaptive mesh refinement methods can reduce the huge computational costs significantly. Refinement of cells in unstructured or structured meshes and the generation of blockstructured rectangular cell clusters are commonly used methods. The latter approach efficiently exploits memory caches and the vectorization of modern superscalar processors. 1

Figure 1: Hierarchy of refinement levels. 2 SAMR on Distributed Memory Systems The blockstructured adaptive mesh refinement method (SAMR) developed by Berger and Colella [1, 2] for hydrodynamics uses a hierarchy of refinement levels with same refinement factors in space and time. A coarse base grid is refined patch-wise as shown in Fig 1. Cells marked for refinement are not just replaced by smaller cells, but are clustered into rectangular patches and stored in the next refinement level. This implies some duplicated work, but allows the use of simpler and more efficient recursive algorithms, because local refinement does not alter the structure of coarser levels. Another advantage of this approach is the ability to perform calculations independently on all patches using existing PDE solvers. The patch-wise calculations are numerically coupled by the exchange of boundary ghost cells between neighboring patches on the same refinement level (intra-level synchronization) and averaging, interpolation and special flux corrections between adjacent levels (inter-level synchronization). The global update algorithm advances a level by updating the boundary and ghost cells, calculating one time-step, recursively advancing the next refinement level several times in smaller time steps and then synchronizes with the finer level. Thus most of the computing time is spent on the finer levels where more shorter time steps have to be calculated. As computations can be done independently on different patches, parallelization in distributed systems is possible by simply distributing all patches over the available processors. This implies communication for the intra-level and inter-level synchronizations, performed across processor boundaries, and waiting time due to load imbalances. Therefore efficient parallelization strategies require sophisticated, dynamic load balancing methods which additionally minimize the communication costs. The existing software packages for SAMR applications differ in the methods used for storing and accessing patches, communicating the synchronization data and 2

especially in their load balancing strategies. Some packages supporting distributed memory parallel computers are: AMROC [4, 5], Paramesh [6, 7], SAMRAI [8, 9], CHOMBO [11, 12], GrACE [13, 14], and ENZO [15, 16, 17]. AMROC is the generic dynamically adaptive fluid solver software framework used by the ASC Alliance Center for the Simulation of Dynamic Response of Materials at the California Institute of Technology in constructing the Virtual Test Facility (VTF). The VTF is an integrated software environment for computing the dynamic response of solid materials due to impinging high energetic gaseous shock and combustion waves. Especially combustion simulations are usually very computation intensive and therefore benefit considerably from the dynamic mesh adaptation and parallelization AMROC provides. 2.1 Load Balancing in Distributed SAMR Applications Load balancing generates an assignment of single elements, that usually represent some kind of computational tasks, to the processing units of a distributed system. The set of elements assigned to a processor is called its partition. In case of SAMR the partitions should be optimized regarding a subset of the following load balancing aims: 1. Balance work load on every level. SAMR algorithms have a multi-level character and involve the balance of separate level-based working tasks by construction. Additionally the calculations are very time-consuming increasing even the effect of small imbalances. 2. Maintain data locality. Adjacent patches not in the same partition require communication in the synchronization phase. The two types of synchronization differ in the amount of data to be transferred. Therefore this aim may be formulated differently: (a) Minimize inter-level communication. (volumetric data) (b) Minimize intra-level communication. (surface or boundary data) 3. Minimize number of patches. Most load balancing schemes split patches for different reasons. But the management algorithms (e.g. computing cuts between sets of patches) have a time complexity of O(n 2 ) or O(nlogn) in the number of patches [10]. 4. Maximize size of patches. Cache and vectorization optimizations can be performed only on reasonably big patches. 5. Minimize data movements while re-balancing. (volumetric data) 6. Provide fast repartitioning and scalability. Refinement hierarchy and work load change during runtime and thus the distribution has to be re-balanced dynamically. 3

Figure 2: Examples of partitions generated by space filling curves combined with adaptive rigorous domain decomposition (left) and multi-level graph partitioning on a uniform decomposition (right). 2.2 Proposed Research Software packages for SAMR methods on distributed memory systems are relatively new and there still exists no consent on appropriate load balancing strategies and priorities of partitioning aims. While the first aim is considered to be most important, the relevance of the communication related partitioning aims is undetermined, mainly because of the inherent hardware dependence. My investigations in last year s SURF suggest that the communication time for intra-level synchronizations are clearly below the waiting time due to load imbalances. In some cases even the third and fourth aim had a higher influence on the overall runtime of the simulation. Nowadays a multitude of methods for partitioning unstructured meshes exists utilizing geometric position information or neighborhood relationships. Several software libraries implementing these methods are freely available to the scientific community (e.g. JOSTLE [18], ZOLTAN [19], and METIS [20, 21]). All of these load balancing algorithms can directly distribute only single elements not node-clusters and only very few special, sophisticated (Nature+Fable [22]) or apparently too simplistic (e.g. the Knapsack-routine used in Chombo [12]) techniques are available for SAMR applications. Therefore the load balancing problem needs to be modelled by mapping the patch hierarchy to a set of elements combining several patches, refinement levels or subregions. These can be used as input to the algorithms mentioned before. It is often overlooked that the strategy used for this mapping may be more important to the overall result than the specific load balancing method used. All 4

SAMR software packages present very different approaches to this problem ranging from patch-centered methods to (adaptive) rigorous domain decompositions and appear to be determined more by the limitations of the underlying parallel data structures. But no detailed fundamental comparison seems to exist although the importance of this aspect is also underlined by observations made in experiments with graph-diffusion based load balancers in my work for SURF2004. Restricting communication to intra-level synchronization avoids volumetric data transfers and simplifies the implementation. But this requires a strict locality over all refinement levels not allowing values of the same physical position to reside on different processors. Such a rigorous domain decomposition might be generated by overlaying the whole area with a grid of uniform boxes and combining the parts of all patches covered by a box. But it is difficult to determine the right granularity and small boxes leave too much freedom to the load balancer as visible in Fig 2. Also this strategy splits patches arbitrarily at box boundaries easily producing many small patches. Adaptive octrees can be used to automatically adjust the granularity and reduce the amount of arbitrary splits. But achieving a good work balance requires the distribution of small entities. Unfortunately providing such small entities and avoiding inefficient splittings at the same time is nearly impossible using rigorous domain decompositions. Accepting some intel-level communication it is possible to distribute groups of whole patches instead. These groups might be created with an adaptive octree again but using the center of mass for assigning patches to a box. Now the size and shape of a patch can be considered also when splitting becomes necessary. Of course special caution has to be taken to limit the inter-level communication. Considering the open questions and missing comparisons a detailed overview of known strategies for modelling the load balancing problem should be created. The integration of several of the most promising methods into AMROC combined with existing load balancing libraries for unstructured meshes should provide a base for experimental measurements and quantitative comparisons. References [1] M. Berger and P. Colella. Local adaptive mesh refinement for shock hydrodynamics. J. Comput. Phys., 82:64 84, 1988. [2] J. Bell, M. Berger, J. Saltzman, and M. Welcome. Three-dimensional adaptive mesh refinement for hyperbolic conservation laws. SIAM J. Sci. Comp., 15(1):127 138, 1994. [3] R. Deiterding. Parallel adaptive simulation of multi-dimensional detonation structures. PhD thesis, Techn. Univ. Cottbus, Sep 2003. [4] R. Deiterding. AMROC - Blockstructured Adaptive Mesh Refinement in Object-oriented C++. Available at http://amroc.sourceforge.net, Oct 2003. [5] R. Deiterding. Construction and application of an AMR algorithm for distributed memory computers. Proc. of Chicago Workshop on Adaptive Mesh Refinement Methods, Sep 2003. [6] P. MacNeice. Paramesh homepage, 2004. http://paramesh.sourceforge.net/ [7] P. MacNeice et al. PARAMESH: A parallel adaptive mesh refinement community toolkit. Computer physics communications, (126):330 354, 2000. 5

[8] SAMRAI homepage, 2004. http://www.llnl.gov/casc/samrai/ [9] A. Wissink, R. Hornung, S. Kohn, S. Smith, and N. Elliott. Large scale parallel structured AMR calculations using the SAMRAI framework. Proceedings of the 2001 ACM/IEEE conference on Supercomputing, 2001. [10] A. Wissink, D. Hysom, and R. Hornung. Enhancing scalability of parallel structured AMR calculations. ICS 03: Proceedings of the 17th annual international conference on Supercomputing, 336 347, 2003. [11] CHOMBO homepage, 2004. http://seesar.lbl.gov/anag/chombo/ [12] C. A. Rendleman, V. E. Beckner, M. Lijewski, W. Y. Crutchfield, J. B. Bell. Parallelization of Structured, Hierarchical Adaptive Mesh Refinement Algorithm. Computing and Visualization in Science, 3, 137-147, 2000. [13] M. Parashar GrACE homepage, 2004. http://www.caip.rutgers.edu/tassl/projects/grace/ [14] M. Parashar and J. C. Browne. Systems Engineering for High Performance Computing Software: The HDDA/DAGH Infrastructure for Implementation of Parallel Structured Adaptive Mesh Refinement. IMA Volume 117: Structured Adaptive Mesh Refinement (SAMR) Grid Methods, Editors: S. B. Baden, N. P. Chrisochoides, D. B. Gannon, and M. L. Norman, Springer-Verlag, 1-18, 2001. [15] ENZO homepage, 2004. http://cosmos.ucsd.edu/enzo/ [16] Greg L. Bryan, Tom Abel, and Michael L. Norman. Achieving Extreme Resolution in Numerical Cosmology Using Adaptive Mesh Refinement: Resolving Primordial Star Formation. Proceedings, SC2001., Denver, CO, 2001. [17] Z. Lan, V. Taylor, and G. Bryan. Dynamic Load Balancing for Structured Adaptive Mesh Refinement Applications. Proc. of 30th International Conference on Parallel Processing 2001, Valencia, Spain, 2001. [18] JOSTLE - graph partitioning software, Homepage, 2004. http://staffweb.cms.gre.ac.uk/ c.walshaw/jostle/ [19] K. Devine, E. Boman, R. Heaphy, B. Hendrickson, J. Teresco, J. Faik, J. Flaherty, L. Gervasio. New challenges in dynamic load balancing. Appl. Numer. Maths. Vol. 52, Issues 2-3, 133 152, 2005. [20] George Karypis and Vipin Kumar Parallel multilevel k-way partitioning scheme for irregular graphs. Supercomputing 96: Proceedings of the 1996 ACM/IEEE conference on Supercomputing, 35, 1996. [21] Kirk Schloegel and George Karypis and Vipin Kumar. Parallel Multilevel Algorithms for Multiconstraint Graph Partitioning. Euro-Par 00: Proceedings from the 6th International Euro-Par Conference on Parallel Processing, 296 310, 2000 [22] Johan Steensland. Efficient Partitioning of Dynamic Structured Grid Hierarchies. PhD thesis, Uppsala University, 2002. 6