Load Balancing Strategies for Parallel SAMR Algorithms

Proposal for a Summer Undergraduate Research Fellowship 2005 Computer science / Applied and Computational Mathematics Load Balancing Strategies for Parallel SAMR Algorithms Randolf Rotta Institut für Informatik, Technische Universität Cottbus Postfach 10 13 44, 03013 Cottbus, Germany E-Mail: rrotta@informatik.tu-cottbus.de Abstract Highly resolved solutions of partial differential equations are important in many areas of science and technology nowadays. Only adaptive mesh refinement methods reduce the necessary work sufficiently allowing the calculation of realistic problems. Blockstructured SAMR methods are well-suited for the time-explicit computation of large-scale dynamical problems, but still require parallelization on distributed systems. Unfortunately assigning activities in a load balanced manner is even more difficult than for unstructured AMR applications. Here common partitioning aims for parallel SAMR algorithms are summarized and strategies for mapping the load balancing problem to simpler unstructured meshes are outlined. The research proposed for the participation in SURF 2005 includes the systematization, implementation and quantitative comparison of known modelling and load balancing strategies. 1 Introduction A large number of physical processes, e.g. fluid dynamics [3], magnetohydrodynamics, thermodynamics, astrophysics [16], or structural dynamics are modeled by partial differential equations (PDEs). The continuous physical values are discretized with a finite number of differentiation points (finite differences methods) or volumes containing densities obtained by integration (finite volume methods). Computer programs simulate the evolution in time by numerically approximating the solutions of these equations. Realistic simulations of interesting physical phenomena (shock fronts, material or phase interfaces, turbulence) require a very fine resolution. Employing only uniform discretizations such simulations often would be impossible, because of the implied high memory consumption and computational costs. Fortunately, many problems require such high resolutions only in small subregions where the interesting phenomena occur. Therefore dynamic adaptive mesh refinement methods can reduce the huge computational costs significantly. Refinement of cells in unstructured or structured meshes and the generation of blockstructured rectangular cell clusters are commonly used methods. The latter approach efficiently exploits memory caches and the vectorization of modern superscalar processors. 1

Figure 1: Hierarchy of refinement levels. 2 SAMR on Distributed Memory Systems The blockstructured adaptive mesh refinement method (SAMR) developed by Berger and Colella [1, 2] for hydrodynamics uses a hierarchy of refinement levels with same refinement factors in space and time. A coarse base grid is refined patch-wise as shown in Fig 1. Cells marked for refinement are not just replaced by smaller cells, but are clustered into rectangular patches and stored in the next refinement level. This implies some duplicated work, but allows the use of simpler and more efficient recursive algorithms, because local refinement does not alter the structure of coarser levels. Another advantage of this approach is the ability to perform calculations independently on all patches using existing PDE solvers. The patch-wise calculations are numerically coupled by the exchange of boundary ghost cells between neighboring patches on the same refinement level (intra-level synchronization) and averaging, interpolation and special flux corrections between adjacent levels (inter-level synchronization). The global update algorithm advances a level by updating the boundary and ghost cells, calculating one time-step, recursively advancing the next refinement level several times in smaller time steps and then synchronizes with the finer level. Thus most of the computing time is spent on the finer levels where more shorter time steps have to be calculated. As computations can be done independently on different patches, parallelization in distributed systems is possible by simply distributing all patches over the available processors. This implies communication for the intra-level and inter-level synchronizations, performed across processor boundaries, and waiting time due to load imbalances. Therefore efficient parallelization strategies require sophisticated, dynamic load balancing methods which additionally minimize the communication costs. The existing software packages for SAMR applications differ in the methods used for storing and accessing patches, communicating the synchronization data and 2

especially in their load balancing strategies. Some packages supporting distributed memory parallel computers are: AMROC [4, 5], Paramesh [6, 7], SAMRAI [8, 9], CHOMBO [11, 12], GrACE [13, 14], and ENZO [15, 16, 17]. AMROC is the generic dynamically adaptive fluid solver software framework used by the ASC Alliance Center for the Simulation of Dynamic Response of Materials at the California Institute of Technology in constructing the Virtual Test Facility (VTF). The VTF is an integrated software environment for computing the dynamic response of solid materials due to impinging high energetic gaseous shock and combustion waves. Especially combustion simulations are usually very computation intensive and therefore benefit considerably from the dynamic mesh adaptation and parallelization AMROC provides. 2.1 Load Balancing in Distributed SAMR Applications Load balancing generates an assignment of single elements, that usually represent some kind of computational tasks, to the processing units of a distributed system. The set of elements assigned to a processor is called its partition. In case of SAMR the partitions should be optimized regarding a subset of the following load balancing aims: 1. Balance work load on every level. SAMR algorithms have a multi-level character and involve the balance of separate level-based working tasks by construction. Additionally the calculations are very time-consuming increasing even the effect of small imbalances. 2. Maintain data locality. Adjacent patches not in the same partition require communication in the synchronization phase. The two types of synchronization differ in the amount of data to be transferred. Therefore this aim may be formulated differently: (a) Minimize inter-level communication. (volumetric data) (b) Minimize intra-level communication. (surface or boundary data) 3. Minimize number of patches. Most load balancing schemes split patches for different reasons. But the management algorithms (e.g. computing cuts between sets of patches) have a time complexity of O(n 2 ) or O(nlogn) in the number of patches [10]. 4. Maximize size of patches. Cache and vectorization optimizations can be performed only on reasonably big patches. 5. Minimize data movements while re-balancing. (volumetric data) 6. Provide fast repartitioning and scalability. Refinement hierarchy and work load change during runtime and thus the distribution has to be re-balanced dynamically. 3

Figure 2: Examples of partitions generated by space filling curves combined with adaptive rigorous domain decomposition (left) and multi-level graph partitioning on a uniform decomposition (right). 2.2 Proposed Research Software packages for SAMR methods on distributed memory systems are relatively new and there still exists no consent on appropriate load balancing strategies and priorities of partitioning aims. While the first aim is considered to be most important, the relevance of the communication related partitioning aims is undetermined, mainly because of the inherent hardware dependence. My investigations in last year s SURF suggest that the communication time for intra-level synchronizations are clearly below the waiting time due to load imbalances. In some cases even the third and fourth aim had a higher influence on the overall runtime of the simulation. Nowadays a multitude of methods for partitioning unstructured meshes exists utilizing geometric position information or neighborhood relationships. Several software libraries implementing these methods are freely available to the scientific community (e.g. JOSTLE [18], ZOLTAN [19], and METIS [20, 21]). All of these load balancing algorithms can directly distribute only single elements not node-clusters and only very few special, sophisticated (Nature+Fable [22]) or apparently too simplistic (e.g. the Knapsack-routine used in Chombo [12]) techniques are available for SAMR applications. Therefore the load balancing problem needs to be modelled by mapping the patch hierarchy to a set of elements combining several patches, refinement levels or subregions. These can be used as input to the algorithms mentioned before. It is often overlooked that the strategy used for this mapping may be more important to the overall result than the specific load balancing method used. All 4

SAMR software packages present very different approaches to this problem ranging from patch-centered methods to (adaptive) rigorous domain decompositions and appear to be determined more by the limitations of the underlying parallel data structures. But no detailed fundamental comparison seems to exist although the importance of this aspect is also underlined by observations made in experiments with graph-diffusion based load balancers in my work for SURF2004. Restricting communication to intra-level synchronization avoids volumetric data transfers and simplifies the implementation. But this requires a strict locality over all refinement levels not allowing values of the same physical position to reside on different processors. Such a rigorous domain decomposition might be generated by overlaying the whole area with a grid of uniform boxes and combining the parts of all patches covered by a box. But it is difficult to determine the right granularity and small boxes leave too much freedom to the load balancer as visible in Fig 2. Also this strategy splits patches arbitrarily at box boundaries easily producing many small patches. Adaptive octrees can be used to automatically adjust the granularity and reduce the amount of arbitrary splits. But achieving a good work balance requires the distribution of small entities. Unfortunately providing such small entities and avoiding inefficient splittings at the same time is nearly impossible using rigorous domain decompositions. Accepting some intel-level communication it is possible to distribute groups of whole patches instead. These groups might be created with an adaptive octree again but using the center of mass for assigning patches to a box. Now the size and shape of a patch can be considered also when splitting becomes necessary. Of course special caution has to be taken to limit the inter-level communication. Considering the open questions and missing comparisons a detailed overview of known strategies for modelling the load balancing problem should be created. The integration of several of the most promising methods into AMROC combined with existing load balancing libraries for unstructured meshes should provide a base for experimental measurements and quantitative comparisons. References [1] M. Berger and P. Colella. Local adaptive mesh refinement for shock hydrodynamics. J. Comput. Phys., 82:64 84, 1988. [2] J. Bell, M. Berger, J. Saltzman, and M. Welcome. Three-dimensional adaptive mesh refinement for hyperbolic conservation laws. SIAM J. Sci. Comp., 15(1):127 138, 1994. [3] R. Deiterding. Parallel adaptive simulation of multi-dimensional detonation structures. PhD thesis, Techn. Univ. Cottbus, Sep 2003. [4] R. Deiterding. AMROC - Blockstructured Adaptive Mesh Refinement in Object-oriented C++. Available at http://amroc.sourceforge.net, Oct 2003. [5] R. Deiterding. Construction and application of an AMR algorithm for distributed memory computers. Proc. of Chicago Workshop on Adaptive Mesh Refinement Methods, Sep 2003. [6] P. MacNeice. Paramesh homepage, 2004. http://paramesh.sourceforge.net/ [7] P. MacNeice et al. PARAMESH: A parallel adaptive mesh refinement community toolkit. Computer physics communications, (126):330 354, 2000. 5

[8] SAMRAI homepage, 2004. http://www.llnl.gov/casc/samrai/ [9] A. Wissink, R. Hornung, S. Kohn, S. Smith, and N. Elliott. Large scale parallel structured AMR calculations using the SAMRAI framework. Proceedings of the 2001 ACM/IEEE conference on Supercomputing, 2001. [10] A. Wissink, D. Hysom, and R. Hornung. Enhancing scalability of parallel structured AMR calculations. ICS 03: Proceedings of the 17th annual international conference on Supercomputing, 336 347, 2003. [11] CHOMBO homepage, 2004. http://seesar.lbl.gov/anag/chombo/ [12] C. A. Rendleman, V. E. Beckner, M. Lijewski, W. Y. Crutchfield, J. B. Bell. Parallelization of Structured, Hierarchical Adaptive Mesh Refinement Algorithm. Computing and Visualization in Science, 3, 137-147, 2000. [13] M. Parashar GrACE homepage, 2004. http://www.caip.rutgers.edu/tassl/projects/grace/ [14] M. Parashar and J. C. Browne. Systems Engineering for High Performance Computing Software: The HDDA/DAGH Infrastructure for Implementation of Parallel Structured Adaptive Mesh Refinement. IMA Volume 117: Structured Adaptive Mesh Refinement (SAMR) Grid Methods, Editors: S. B. Baden, N. P. Chrisochoides, D. B. Gannon, and M. L. Norman, Springer-Verlag, 1-18, 2001. [15] ENZO homepage, 2004. http://cosmos.ucsd.edu/enzo/ [16] Greg L. Bryan, Tom Abel, and Michael L. Norman. Achieving Extreme Resolution in Numerical Cosmology Using Adaptive Mesh Refinement: Resolving Primordial Star Formation. Proceedings, SC2001., Denver, CO, 2001. [17] Z. Lan, V. Taylor, and G. Bryan. Dynamic Load Balancing for Structured Adaptive Mesh Refinement Applications. Proc. of 30th International Conference on Parallel Processing 2001, Valencia, Spain, 2001. [18] JOSTLE - graph partitioning software, Homepage, 2004. http://staffweb.cms.gre.ac.uk/ c.walshaw/jostle/ [19] K. Devine, E. Boman, R. Heaphy, B. Hendrickson, J. Teresco, J. Faik, J. Flaherty, L. Gervasio. New challenges in dynamic load balancing. Appl. Numer. Maths. Vol. 52, Issues 2-3, 133 152, 2005. [20] George Karypis and Vipin Kumar Parallel multilevel k-way partitioning scheme for irregular graphs. Supercomputing 96: Proceedings of the 1996 ACM/IEEE conference on Supercomputing, 35, 1996. [21] Kirk Schloegel and George Karypis and Vipin Kumar. Parallel Multilevel Algorithms for Multiconstraint Graph Partitioning. Euro-Par 00: Proceedings from the 6th International Euro-Par Conference on Parallel Processing, 296 310, 2000 [22] Johan Steensland. Efficient Partitioning of Dynamic Structured Grid Hierarchies. PhD thesis, Uppsala University, 2002. 6