Dynamic Load Balancing for Cluster Computing Jaswinder Pal Singh, Technische Universität München. singhj@in.tum.de

Transcription

1 Dynamic Load Balancing for Cluster Computing Jaswinder Pal Singh, Technische Universität München. singhj@in.tum.de Abstract: In parallel simulations, partitioning and load-balancing algorithms compute the distribution of application data and work to processors. The effectiveness of this distribution greatly influences the performance of a parallel simulation. Decompositions that balance processor loads while keeping the application's communication costs low are preferred. Although a wide variety of partitioning and load-balancing algorithms have been developed, but the load-balancing problem is not yet solved completely as their effectiveness depends on the characteristics of the application using them. New applications and architectures require new partitioning features. New models are needed for non-square, non-symmetric, and highly connected systems arising from applications in biology, circuits, and materials simulations. Increased use of heterogeneous computing architectures requires partitioners that account for non-uniform computing, network, and memory resources. This paper introduces the topic and proposes algorithm to paralellize adaptive grid generation on a cluster followed by a brief look at the future prospects of DLB. Keywords: Distributed Systems, Dynamic Load Balancing, Adaptive Grids, Refinement Tree Bisection 1. Introduction Modern age scientific computations are increasingly becoming large, irregular and computationally intensive to demand more computing power than a conventional sequential computer can provide. The upper bound for the computing power for a single processor is limited by the fastest processor available at any certain time but it can be dramatically increased by integrating a set of processors together. Later are known as Parallel Systems (Fig 1). An in-depth discussion of existing systems has been presented in [1],[2],[3]. Cluster, the one relevant to our discussion, is a collection of heterogeneous workstations with a dedicated high-performance network and can have a Single System Image (SSI) spanning its nodes. Message Passing programming model in which each processor (or process) is assumed to have its own private data space, and data must be explicitly moved between spaces by sending messages, is inherent to Clusters. The principal challenge of parallel programming is to decompose the program into subcomponents that can be run in parallel and that can be achieved by exploring either functional or data parallelism. In Functional Parallelism, the problem is decomposed into a large number of smaller tasks, which are assigned to the processors as they become available. Processors that finish quickly are simply assigned more work. It is implemented in a Client-Server paradigm. The tasks are allocated to a group of slave processes by a master process that may also perform some of the tasks. Processor Organizations SISD SIMD MISD MIMD Uniprocessor Vector processor Array processor Shared Memory SMP NUMA Fig 1. Flynn s Classification Distributed Memory Cluster Data parallelism also known as Domain Decomposition represents the most common strategy for scientific programs. The application is decomposed by subdividing the data space over which it operates and assigning different processors to the work associated with different data subspaces. This leads to data sharing at the boundaries, and the programmer is responsible for ensuring that this data are correctly synchronized. Domain decomposition methods are useful in two contexts. First, the division of problems into smaller problems through usually artificial subdivisions of the domain are a means for introducing parallelism into a problem. Second, many problems involve more than one mathematical model, each posed on a different domain, so that domain decomposition occurs naturally. Examples of the latter are fluid-structure interactions. Associated with MIMD architectures, data parallelism originated the single program, multiple data (SPMD) programming model [4]; the same program is executed on different processors, over distinct data sets. Load Balancing: Motivation Consider, for example, an application that after domain decomposition can be mapped onto the processors of a parallel architecture. In our case underlying hardware system is a cluster and we run into problem because the static resource problem is mapped to a system with dynamic resources, resulting in a potentially unbalanced execution. Things get even more complicated if we run an application with a dynamic run-time behavior on a cluster i.e. mapping of a dynamic resource problem onto a dynamic resource machine. The system changes such as variation in availability of individual processor power, variation in number of

2 processors or dynamic changes in the run-time behavior of the application leads to load imbalance among processing elements. These factors contribute to inefficient use of resources and increase in total execution time, which is a major concern in parallel programming. Load balancing/ sharing is a policy which takes the advantage of the communication facility between the nodes of a cluster, by exchanging of status information and jobs between any two nodes, in order to find the appropriate granularity of tasks and partitioning them so that each node is assigned load in proportion to its performance. It aims at improving the performance of the system and decrease the total execution time. Load balancing algorithm can be either static or dynamic. Static load balancing only uses information about the average system behavior at the initialization phase, i.e. done at compile time while the Dynamic load balancing uses runtime state information to manage task allocation. The paper is outlined as follows. The steps involved in a dynamic load balancing algorithm and its classification criteria are presented in section 2. Section 3 introduces the some popular domain decomposition techniques. Parallelization of Adaptive Grids generation on a Cluster and has been discussed in section 4 as an application. Section 5 mentions current challenges and future aspects in domain decomposition methods. 2. Dynamic Load Balancing Dynamic load balancing is carried out through task migration the transfer of tasks form overloaded nodes to underloaded nodes. To decide when and how to perform task migration, information about the current workload must be exchanged among nodes. The communication and the computation required to make the balancing decision consumes the processing power and may result in worse overall performance if the algorithm is not efficient enough. In the context of SPMD applications, a task migration between two nodes corresponds to transferring all the data associated with this task and necessary to its execution hence the need for an appropriate DLB algorithm. In order to understand a DLB algorithm completely, the main four components (initiation, location, exchange, and load movement) have to be understood Initiation: The initiation strategy specifies the mechanism, which invokes the load balancing activities. This may be a periodic or event-driven initiation. The later are load dependent, based upon the monitoring of local load thus more responsive to load imbalances. They can be either sender- or receiver initiated. In sender initiated, congested servers attempt to transfer work to lightly loaded ones and the opposite takes place in receiver initiated policies. Load-balancer location: Specifies the location at which the algorithm itself is executed. In Centralized algorithms only a single processor, computes the necessary reassignments and informs the involved processors. A distributed algorithm runs locally within each processor. Although the use of former may lead to a bottleneck, but later require load information to be propagated to all the processors, leading to higher communication costs. Information Exchange: Specifies the information and load flow through the system based upon whether the information used by algorithm for decision-making is local or global and the communication policy. All processors take part in the global schemes, whereas in the local schemes, information on the processor or gathered from the surrounding neighborhood take part. Less communication costs are involved in the 1st case but global information exchange strategies tend to give more accurate decisions. The communication policy determines the neighborhood of each processor. It specifies the connection topology, which doesn t have to represent the actual physical topology. A uniform topology indicates a fixed set of neighbors to communicate with, while in a randomized topology the processor randomly chooses another processor to exchange information with. Also, the communication policy specifies the task/load exchange between different processors. In global strategies, task/load transfers may take place between any two processors, while local strategies define group of processors, and allow transfers to take place only between two processors within the same group. Load movement: specifies the appropriate load items to be moved/exchanged as there is a trade/off between the benefits of moving work to balance load and the cost of data movement. Local averaging represents one of the common techniques. The overloaded processor sends load-packets to its neighbors until its own load drops to a specific threshold or the average load. 3. Domain Decomposition or Partitioning algorithm 3.1 The partitioning problem At its simplest, a partitioning algorithm attempts to assign equal numbers of objects to partitions while minimizing communication costs between partitions. A partition's subdomain, then, consists of the data uniquely assigned to the partition; the union of subdomains is equal to the entire problem domain (Fig 2). Objects may have weights proportional to the computational costs of the objects. These nonuniform costs may

3 result from, e.g., variances in computation time due to different physics being solved on different objects, more degrees of freedom per element in adaptive p-refinement [5], or more small time steps taken on smaller elements to enforce timestep constraints in local mesh-refinement methods [6]. Similarly, nonuniform communication costs may be modeled by assigning weights to connections between objects. Partitioning then has the goal of assigning equal total object weight to each subdomain while minimizing the weighted communication cost. 3.2 Dynamic Repartitioning and Load Balancing Problem Workloads in dynamic computations evolve in time, for example, in finite element methods with adaptive mesh refinement, process workloads can vary dramatically as elements are added and/or removed from the mesh. Dynamic repartitioning of mesh data, often called dynamic load balancing, becomes necessary. It is also needed to maintain geometric locality in applications like crash simulations where high parallel efficiency is obtained when subdomains are constructed of geometrically close elements [7]. Fig. 2. An example of a two dimensional mesh (left) and a In our case Dynamic load balancing has decomposition of the mesh into four subdomains (right). the same goals as partitioning, but with the additional constraints that procedures (i) must operate in parallel on already distributed data, (ii) must execute quickly, as dynamic load balancing may be performed frequently, and (iii) should be incremental (i.e., small changes in workloads produce only small changes in the decomposition) as the cost of redistribution of mesh data is often the most significant part of a dynamic load-balancing step. While a more expensive procedure may produce a higher quality result, it is sometimes better to use a faster procedure to obtain lower-quality decomposition, if the workloads are likely to change again after a short time. 3.3 Partition Quality Assessment The most obvious measure of partition quality is computational load balance but it alone does not ensure efficient parallel computation. Communication costs must also be minimized which corresponds to minimizing the number of objects on sharing data across subdomain boundaries. For mesh-based applications, this cost is often approximated by the number of element faces on boundaries between two or more subdomains. To estimate the cost of interprocess communication following metrics have proved to provide better results: Subdomain's surface index is the percentage of all element faces within a subdomain that lie on the subdomain boundary. The maximum local surface index is the largest surface index over all subdomains and approximates the maximum communication needed by any one subdomain, while the global surface index measures the percentage of all element faces that are on subdomain boundaries [8] and approximates the total communication volume. Minimizing only the edge cut or global surface index statistics is not enough [8] for the following reasons: First, the number of faces shared by subdomains is not necessarily equal to the communication volume between the subdomains [8]; an element could easily share two or more faces, but the element's data would be communicated only once to the neighbor.. Second, interprocess connectivity i.e. the number of processes with which each process must exchange information during the solution phase is a significant factor due to its dependence upon interconnection network latency [8]. Third, communication should be balanced, not necessarily minimized [9]. A balanced communication load often corresponds to a small maximum local surface index. Internal connectivity of the subdomains has also proved to be a measure of partition quality. Having multiple disjoint connected components within a subdomain (also known as subdomain splitting ) can be undesirable as the solution of the linear systems will converge slowly for partitions with this property [11]. Additionally, if a relatively small disjoint part of one subdomain can be merged into a neighboring subdomain, the boundary size will decrease, thereby improving the surface indices. Subdomain aspect ratio is the ratio of the square of the radius of smallest circle that contains the entire subdomain to the subdomain's area [Diekmann, et al. [10]]. It has also been reported as an important factor in partition quality [11], particularly when iterative methods such as Conjugate Gradient (CG) or Multigrid are used to solve the linear systems. They [11] show that the number of iterations needed for a

4 preconditioned CG procedure grows with the subdomain aspect ratio. Furthermore, large aspect ratios are likely to lead to larger boundary sizes. Geometric locality of elements is an important indicator of partition effectiveness for some applications. While mesh connectivity provides a reasonable approximation to geometric locality in some simulations, it does not represent geometric locality in all simulations. (In a simulation of an automobile crash, for example, the windshield and bumper are far apart in the mesh, but can be quite close together geometrically.). Quality metrics based on connectivity are not appropriate for these types of simulations 3.4 Partitioning and Dynamic Load Balancing Taxonomy A variety of partitioning and dynamic load balancing procedures have been developed. Since no single procedure is ideal in all situations, many of these alternatives are commonly used. This section describes many of the approaches, grouping them into geometric methods, global graph-based methods, and local graph-based methods. Geometric methods examine only coordinates of the objects to be partitioned. Graph-based methods use the topological connections among the objects. Most geometric or graph-based methods operate as global partitioners or repartitioners. Local graph-based methods, however, operate among neighborhoods of processes in an existing decomposition to improve load balance. This section describes the methods; their relative merits are discussed in Section Geometric Methods Geometric methods use only objects' spatial coordinates and objects' computational weights to compute decomposition in a way that balances the total weight of objects assigned to each partition. They are effective for applications in which objects interact only if they are geometrically close to each other. Examples of such methods are: 1. Recursive Bisection: Recursive bisection methods divide the simulation's objects into two equally weighted sets; the algorithm is then applied recursively to obtain desired number of partitions. In Recursive Coordinate Bisection (RCB) [12], two sets are computed by cutting the problem geometry with a plane orthogonal to a coordinate axis. The plane's direction is selected to be orthogonal to the longest direction of the geometry; its position is computed so that half of the object weight is on each side of the plane. RCB is incremental and suitable for DLB. Like RCB, Recursive Inertial Bisection (RIB) [13] uses cutting planes to bisect the geometry; however, the direction of the plane is computed to be orthogonal to the principle axis of inertia. It is not incremental and may be not suitable for dynamic load balancing 2. Space-Filling Curves: A space-filling curve (SFC) maps n-dimensional space to one dimension [11]. In SFC partitioning, an object's coordinates are converted to a SFC key representing the object's position along a SFC through the physical domain. Sorting the keys gives a linear ordering of the objects. This ordering is cut into appropriately weighted pieces that are assigned to processors Global Graph-Based Partitioning A popular and powerful class of partitioning procedures make use of connectivity information rather than spatial coordinates. These methods use the fact that the partitioning problem in Section 3.1 can be viewed as the partitioning of an induced graph G = (V, E), where objects serve as the graph vertices (V) and connections between objects are the graph edges (E). For example, Figure 3 shows an induced graph for the mesh in Figure 2; here, elements are the objects to be partitioned and, thus, serve as vertices in the graph, while shared element faces define graph edges. A k-way partition Fig 3. 2-Dimensional mesh and its induced graph (left); It s four-way partitioning (right) of the graph G is obtained by dividing the vertices into subsets V 1 V k, where V = V 1 V k and V i V j = for i j. Figure 3 (right) shows one possible decomposition of the graph induced by the mesh. Vertices and edges may have weights associated with them representing computation and communication costs, respectively. The goal of graph partitioning, then, is to create subsets V k with equal vertex weights while minimizing the weight of edge cut by subset boundaries. An edge e ij between vertices v i and v j is cut when v i belongs to one subset and v j belongs to a different one. Algorithms to provide an optimal partitioning are NP-complete [14], so heuristic algorithms are generally used.

5 Greedy algorithm, Spectral partitioning and Multilevel partitioning are some of the static partitioners, intended for use as a preprocessing step rather than as a dynamic load balancing procedure. Some of the multilevel procedures do operate in parallel and can be used for dynamic load balancing Local Graph-based Methods In an adaptive computation, dynamic load balancing may be required frequently. Applying global partitioning strategies after each adaptive step can be costly relative to solution time. Thus, a number of dynamic load balancing techniques that are intended to be fast and incrementally migrate data from heavily to lightly loaded processes, have been developed. These are often referred to as local methods. Unlike global partitioning methods, local methods work with only a limited view of the application workloads. They consider workloads within small, overlapping sets of processors to improve balance within each set. Heavily loaded processors within a set transfer objects to less heavily loaded processors in the same set. Sets can be defined by the parallel architecture's processor connectivity or by the connectivity of the application data [16]. Sets overlap, allowing objects to move between sets through several iterations of the local method. Thus, when only small changes in application workloads occur through, say, adaptive refinement, a few iterations of a local method can correct imbalances while keeping the amount of data migrated low. For dramatic changes in application workloads, however, many iterations of a local method are needed to correct load imbalances; in such cases, invocation of a global partitioning method may result in a better, more cost-effective decomposition. Local methods typically consist of two steps: (i) computing a map of how much work (nodal weight) must be shifted from heavily loaded to lightly loaded processors, and (ii) selecting objects (nodes) that should be moved to satisfy that map. Many different strategies can be used for each step. Most strategies for computing a map of the amount of data to be shifted among processes are based on the diffusive algorithm of Cybenko [15]. Using processor connectivity or application communication patterns to describe a computational mesh, an equation representing the workflow is solved using a first-order finitedifference scheme. Since the stencil of the scheme is compact (using information only from neighboring processes), the method is local. Hu and Blake [16] take a more global view of load distributions, computing a diffusion solution while minimizing work flow over edges of a graph of the processes. 4. Parallelizing an Adaptive Mesh Generator using Refinement Tree Partitioning and SFC Adaptive computational techniques provide a reliable, robust, and efficient means of solving problems involving PDEs by finite difference, finite volume, or finite element technologies. With an adaptive approach, an initial mesh used to discretize the computational domain and numerical method used to discretize the PDEs are enhanced during the course of the solution procedure in order to optimize, e.g., the computational effort for a given level of accuracy. Enhancement typically involves h-refinement, where a mesh is refined or coarsened, respectively, in regions of low or high accuracy; r - refinement, where a mesh of a fixed topology is moved to follow evolving dynamic phenomena; and p-refinement, where the method order is increased or decreased, respectively, in regions of low or high accuracy. Unfortunately, parallelism greatly complicates an adaptive computation. The unique features of presented algorithm is the integrated utilization of space-filling curve (SFC) techniques for ordering and using refinement tree partitioning to parallelize the problem at hand. 4.1 The Adaptive Grid and Meshing Strategy The domain Ω is triangulated automatically by a grid generator based on h- refinement. Assuming that some sort of error estimator is available, the elements with large errors than some suitable tolerance are subdivided into a smaller element of the same type. The same procedure is continually repeated until either the error for each element in the newly constructed mesh is no larger than the tolerance or the highest allowable level of mesh refinement is reached. The refinement strategy is based on bisecting a marked edge and a marking algorithm that prevents small angles in the refined mesh. With uniform refinement every two levels all edges are halved. Note that the algorithm describes only one refinement level and can be applied for each grid level recursively. Fig 4. Edge marking strategy for 2D refinement by bisecting a marked edge

6 Algorithm 4.1. Let each element τ of the triangulation have a marked refinement edge, and let Σ be the set of elements flagged for refinement. I. bisect each τ Σ, obtain τ 1 and τ 2, the daughter elements; II. mark the edges opposite to the newly inserted node in τ i (i = 1 : 2) as depicted in figure 4; III. now, set Σ to be the set of triangles τ with hanging nodes; IV. IF Σ = ; THEN stop, ELSE go to step Refinement Tree Partitioning This method is based on the refinement tree that is generated during the process of adaptive grid refinement. It is not as generally applicable as the other fast algorithms, which use only information contained in the final grid, but in the context of adaptive multilevel methods it is able to produce higher quality partitions by taking advantage of the additional information about how the grid was generated. The refinement-tree partitioning algorithm [20] is a recursive bisection method. This means that the core of the algorithm partitions the data into two sets, i.e., bisects the data. The algorithm then bisects those two sets to produce four sets, and so forth until the desired number of sets is produced. The refinement tree of an adaptive triangular grid generated by bisection refinement is a binary tree containing one node for each triangle that appears during the refinement process. (It may actually be a forest, but the individual trees can be connected into a single tree by adding artificial nodes above the roots.) The two children of a node correspond to the two triangles that are formed by bisecting the triangle corresponding to the parent node. In Fig. 5, the numbering of the triangles in the grid and the nodes in the tree indicates the relationship. The nodes of the tree have two weights associated with them; a personal weight and a subtree weight. The personal weight is a representation of the computational work associated with the corresponding triangle. For example, a smaller weight can be used for elements containing Dirichlet boundary equations which require less computation than interior equations. The interior nodes, i.e., those that are not leaves, correspond to triangles in the coarser grids. These nodes can be assigned nonzero weights to represent the computation on the coarser grids of the multigrid algorithm, which is not possible with partitioning Fig 5. Refinement Trees algorithms that only consider the finest grid. For simplicity, in this paper a weight of 0 is assigned to the interior nodes, whereas leaves have different weights. The subtree weight of a node is the sum of the personal weights in the subtree rooted at that node and can be computed in O(N) operations for N triangles, using a depth first post order traversal of the tree. The algorithm for bisecting the grid into two equal sized sets is given below. For scalability, the refinement tree structure used for dynamic repartitioning must be distributed across the cooperating processes [17]. It also must be constructed automatically in parallel. Each node maintains information about its region of space (bounding box), process ownership, parent and offspring links, and attached objects and their costs for weighted load balancing. In a distributed tree, links may cross process boundaries, and must include both a process id and a pointer. A distributed tree also increases the complexity and overhead of interprocess communication, node refinement and pruning, and the insertion of new objects (e.g., elements created or removed by adaptive h-refinement) into the correct node. Objects to be inserted may reside on any process, and some objects will likely reside on processes other than those owning their destination nodes. Such objects are called orphans and must be migrated to the appropriate process. 4.3 Space Filling Curve for ordering Unstructured Grid The first part of the refinement-tree partitioning algorithm sums the personal weights in the tree to compute the subtree weights using depth first post order traversal. The traversal is done to indicate an ordering, or linearization, of the leaf nodes of the refinement tree. Since partitions are formed from contiguous segments of this linearization, its form has a direct effect on the quality of the resulting partitions. Space-filling curves provide a continuous mapping from one-dimensional to d-dimensional space [11] that have been used to linearize spatially-distributed data for partitioning [11],[18] storage and memory management [11], and

7 computational geometry. Herein, we regard the space-filling curves as a way of organizing the refinement tree traversals and, hence, linearizing the leaf nodes of the distributed refinement tree. In this context we construct a Algorithm 4.2 algorithm bisect Compute subtree weights Bisect_subtree (roots) end algorithm bisect bisect_subtree (node) If node is a leaf then assign node to the smaller set else if node has one child then bisect_subtree (node) else (node has two children) select a set for each child for each child, examine the sum of the subtree weight with the accumulated weight of the selected set for the smaller of the two sums, assign the subtree rooted at that child to the selected set, and add the subtree weight to the weight of the set bisect_subtree(other child) endif end algorithm bisect_subtree discrete SFC that is fine enough to have a curve node in each element, thus inducing a consecutive numbering of elements. Due to the preservation of data locality for a SFC it guarantees connectedness and locally compact partitions, where the consecutive numbering is used for partitioning the computational triangulated domain. We use a bitmap-based algorithm [19] for the indexing of generated SFC. Following data have to be known a priory: a) the number of triangles in the initial triangulation N 0, b) the maximum number of refinement levels l. With these data, for each element we need a bit structure of length b = log 2 (N 0 ) + l. While the first b - l bits are used for consecutively numbering the initial elements arbitrarily, each level is then represented by an additional bit. To illustrate the algorithm, observe the series of step in fig 6. The construction of a space-filling curve is given by the following algorithm. Fig 6.Series of steps in the construction of spacefilling curve in a locally bisection refined mesh Algorithm 4.3 Let τ k be an element on level k of the grid, and we denote with τ k i, (i = 1 : 2) both daughters of element τ k-1. (1) The algorithm starts with a zero bitmap of length b in τ 0 (2) Then, FOR each level (k = 1: l) DO: a. copy the mother's (τ k-1 ) bitmap to both daughter elements (τ k i ); b. determine left or right side element τ k e according to the level: τ e k = left if mod (k,2) = 0; τ e k = right if mod (k,2) = 1; c. set the k-th bit of daughter τ e k to 1 (3) END DO

8 5. Current Challenges As parallel simulations and environments become more sophisticated, partitioning algorithms must address new issues and application requirements. Software design that allows algorithms to be compared and reused is an important first step; carefully designed libraries that support many applications benefit application developers while serving as test-beds for algorithmic research. Existing partitioners need additional functionality to support new applications. Partitioning models must more accurately represent a broader range of applications, including those with non-symmetric, non-square, and/or highly-connected relationships. And partitioning algorithms should perform resource aware balancing i.e. they need to be sensitive to state-of-theart, heterogeneous computer architectures, adjusting work assignments relative to processing, memory and communication resources. New simulation areas such as electrical systems, computational biology, linear programming and nanotechnology show that traditional mesh-based PDE simulations include high connectivity, heterogeneity in topology, and matrices that are rectangular or non-symmetric. While graph models (see Section 3.4) are often considered the most effective models for mesh-based PDE simulations, but show limitation for such problems. As an alternative to graphs, hypergraphs can be used to model application data. A hypergraph HG = (V;HE) consists of a set of vertices V representing the data objects to be partitioned and a set of hyperedges HE connecting two or more vertices of V. By allowing larger sets of vertices to be associated through edges, the hypergraph model overcomes many of the limitations of the graph model. In the hypergraph model, the number of hyperedge cuts is equal to the communication volume, providing a more effective partitioning metric. Another challenge is Multi-phase simulations e.g. crash simulations, as they have different work loads in each phase of a simulation i.e. computation of forces and contact detection problem. Often, separate decompositions are used for each phase; data is communicated from one decomposition to the other between phases [21]. Obtaining a single decomposition that is good with respect to both phases would remove the need for communication between phases. Each object would have multiple loads, corresponding to its workload in each phase. The challenge would be computing a single decomposition that is balanced with respect to all loads. Such a multicriteria partitioner could be used in other situations as well, such as balancing both computational work and memory usage. References 1. Rajkumar Buyya et.al. High Performance Cluster Computing Vol 1, Prentice Hall PTR, G. Pfister. In Search of Clusters. Prentice Hall, 2 nd Edition, Grama, Gupta, Karypis, Kumar. Introduction to Parallel Computing, Addision Wesley, 2 nd Edition, Alexandre Plastino, Celso C. Ribeiro, Noemi de La Rocque Rodriguez. Developing SPMD applications with load balancing. Parallel Computing, 2003,Vol: 29, Issue: 6, p James D. Teresco, Karen D. Devine, and Joseph E. Flaherty. Partitioning and Dynamic Load Balancing for the Numerical Solution of PDE 6. Flaherty, Loy, Shephard, Szymanski, Teresco, Ziantz: Adaptive local refinement with octree loadbalancing for the parallel solution of three-dimensional conservation laws. J. Parallel Distrib. Comput., 47: , (1997) 7. Plimpton et. al. Transient dynamics simulations: Parallel algorithms for contact detection and smoothed particle hydrodynamics. J. Parallel Distrib. Comput., 50: , (1998) 8. Flaherty et. al. The quality of partitions produced by an iterative load balancer. Proc. Third Workshop on Languages, Compilers, and Runtime Systems, pages , Troy, (1996) 9. Pinar,Hendrickson, B.: Graph partitioning for complex objectives. 15th I PDPS, San Francisco, CA, (2001) 10. Diekmann et. al: Shape-optimized mesh partitioning and load balancing for parallel adaptive fem. Parallel Comput., 26(12): , (2000) 11. Flaherty et. al. Dynamic Octree Load Balancing using SFC. Williams College Department of Computer Science Technical Report CS-03-01, Berger, Bokhari,: A partitioning strategy for nonuniform problems on multiprocessors. IEEE Trans. Computers, 36: , (1987).

9 13. Simon, H. D.: Partitioning of unstructured problems for parallel processing. Comp. Sys. Engg., 2: (1991) 14. Garey, M., Johnson, D., and Stockmeyer, L.: Some simplified NP-complete graph problems. Theoretical Computer Science, 1(3): , (1976) 15. Cybenko, G.: Dynamic load balancing for distributed memory multiprocessors. J. Parallel Distrib. Comput., 7: , (1989) 16. Hu, Blake: An optimal dynamic load balancing algorithm. Preprint DL-P , Daresbury Laboratory, Warrington, WA4 4AD, UK, (1995) 17. Simone, Shephard, Flaherty, and Loy. A distributed octree and neighbor-finding algorithms for parallel mesh generation. Tech. Report , Rensselaer Polytechnic Institute, Scientific Computation Research Center, Troy, S. Aluru and F. Sevilgen. Parallel domain decomposition and load balancing using space-filling curves. In Proc. International Conference on High-Performance Computing, pages , J. Behrens et. al.: Amatos: Parallel adaptive mesh generator for atmospheric and oceanic simulation, Technical Report TR 02-03, BremHLR { Competence Center of High Performance Computing Bremen, Bremen, Germany, 2003, 20. W. F. Mitchell, Refinement Tree Based Partitioning for Adaptive Grids, in Proceedings of the 7th SIAM Conference on Parallel Processing for Scientific Computing, SIAM, Philadelphia (1995) pp S. Plimpton, Attaway, Hendrickson, Swegle, Vaughan, Gardner : Transient dynamics simulations: Parallel algorithms for contact detection and smoothed particle hydrodynamics, J. Parallel Distrib. Comput. 50 (1998)