?kt. An Unconventional Method for Load Balancing. w = C ( t m a z - ti) = p(tmaz - 0i=l. 1 Introduction. R. Alan McCoy,*

ENL-62052 An Unconventional Method for Load Balancing Yuefan Deng,* R. Alan McCoy,* Robert B. Marr,t Ronald F. Peierlst Abstract A new method of load balancing is introduced based on the idea of dynamically relocating virtual processes corresponding to computations on an abstract system with a larger number of processors. The algorithm introduced preserves the locality of nearest neighbor interactions and has been tested on simulated data and a molecular dynamics code. 1 Introduction A general approach to the problem of load balance for distributed-memory MIMD architectures is developed. It is targeted at those computations whose parallel structure is obtained by decomposing a task into components which run mostly independently, each component computation involving its own private data for the most part, and with only occasional synchronization points. For such "mostly local" computations, there is often considerable flexibility in how the work is allocated between processors, leading to very different degrees of load balancing. In many cases, complete balance should be possible, in principle. There are a number of factors which make such a balanced computation difficult to achieve. What may represent a uniform decomposition from the problem definition may not correspond t o a uniform distribution of load, and non-uniform decompositions may be much harder to program. The load distribution may not even be predictable a priori or, worse still, may vary during the course of the calculation. For this reason dynamic load balancing techniques, preferably requiring minimal work for the applications programmer and extra computer resources, are of great interest. We have developed a general method for attacking this problem. Load balancing is in fact an optimization problem. Suppose that for a given task distributed among p processors with a particular decomposition the computing times are t i V i 5 p with maximum t,,, and average f, Processor i is therefore idle for time (t,,, - t i ), so that the total waste of resources in this computation is P w = C ( t m a z - ti) = p(tmaz - 0i=l It is useful to define a normalized imbalance ratio which measures the percentage waste of resources due to load imbalance. Ignoring, for the moment, any overhead computing time introduced by the decomposition, the load balancing problem involves finding a decomposition which minimizes 1. *Center for Scientific Computing, The University at Stony Brook, Stony Brook, N Y 11794-3600 Department of Applied Sciences, Brookhaven National Laboratory, Upton, N Y 11973?kT OtSTRlBUTlON OF THIS DOCUMENT IS UNUMIT

2 DENGET AL. Our approach is to write the application code as if the number of available processors were some multiple of the actual number, and then run many such virtual processors (VPs) on each physical one. Concurrent with the application we analyze the load on each processor and then move entire VPs from one physical processor to another. This relocation is transparent to the application code being executed. The strategy for such relocation must depend on the particular problem. There are, in fact, three overhead costs introduced by the load balancing procedure: (1) Additional computation introduced by partitioning the problem into a larger number of pieces. (2) Extra communication costs in moving VPs away from their original assigned physical processors, when the problem decomposition is by spatial domain and the communication requirements are spatially local. (3) Costs to analyze and determine the desired relocation and to actually move the data associated with a VP. To completely solve the optimization taking the variation of these overhead factors into account is prohibitively expensive. Even an approximate solution may cost more than the load imbalance itself unless the load imbalance distribution changes slowly, relative to the interval between synchronization points in the computation, so that it is possible to allow many time steps to achieve improved balance. constraining the The approach we have adopted is to minimize 1,or equivalently t,,,, decomposition choices to keep the other costs small without formally including them in the optimization. 2 Virtual Processor Approach The simplest way of implementing the virtual processor approach is to actually run a single process on each processor, whose address space is partitioned into blocks corresponding to the virtual processes, and which repeatedly executes the computational code, successively pointing to the data blocks corresponding to the virtual processes associated with that processor. At intervals, such as after a certain number of time steps in a simulation, or whenever it is forced t o wait at a synchronization point for a more heavily loaded processor to catch up, the algorithm discussed below is executed. Once a better decomposition has been determined, the data blocks corresponding to the relocated VPs are moved to their new processors, and the lists of pointers adjusted correspondingly. If the code uses explicit message passing for communication, then it is important to trap the message passing calls to replace them with simple assignment statements if the virtual processors are on the same physical processor. In implementing the tests discussed here, we have made extensive use of the IPX package [3] which allows one processor to execute asynchronously procedures operating on the data of another processor. This execution does not require waiting for any explicit action by the destination processor, and can interrupt an ongoing computational thread. The destination processor can, however, block such preemption during critical sections of code. This enables the computation of the redistribution, and the actual movement of data, to be completely transparent to the underlying computational code being balanced. 3 Algorithm for Redistributing Load In a reasonably large problem, with many processors, and many virtual processors (VPs) for each physical one, there are generally many different rearrangements which would lead to the same imbalance if it were not for communications costs. The algorithm introduced here is based on the assumption that communication costs are significant, and dominated

DISCLAIMER Portions of this document may be illegible in electronic image products. Images are produced from the best available original document.

LOADBALANCING 3 by local interactions. Assume we are dealing with a computational region which is decomposed by a rectangular grid into cells, each VP executing the computation associated with a single cell. For simplicity, we discuss here the case of a two-dimensional decomposition, though the algorithm can be generalized straightforwardly to higher dimensional decompositions, and other than rectangular grids. We consider the processors to be also represented as a (coarser) rectangular grid, each grid cell corresponding to a processor. The two grids are related by the fact that each grid point of the coarse (processor) grid is mapped to a grid point of the fine (virtual process) grid, and that the lines bounding the processor cells are mapped into the lines connecting the mapped grid points. We then assign any VP to the processor whose mapped cell contains the center of the VP cell. We introduce the concept of a pressure associated with each processor cell, proportional to its wasted resources: pi = f - t i. As a result of this pressure, cells associated with light loads will tend to expand, while cells associated with heavy loads (negative pressure) will contract. This expansion or contraction is achieved by computing the net force at each mapped coarse grid point resulting from the pressures in the cells which share that point as a vertex, the boundaries being constrained to remain straight. We proceed iteratively, at each step allowing the mapping to change by zero or one fine grid step, depending on the magnitude and direction of the resulting force. It is clear, by construction, that the algorithm preserves locality and tends t o process neighboring regions on the same processor. We assert, without proof, that the configuration will tend to relax towards one with improved load balance. We have carried out a series of simulations for a variety of cases, and the results tend to confirm the hypothesis. We have also applied the algorithm to a real molecular dynamics calculation. The full details will be reported elsewhere. 3.1 Simulation Results A number of cases were studied, assigning arbitrary loads to the fine grid cells. We examined cases where the heavy load was concentrated in a single region, in several localized regions, or along one or more lines. We studied varying grid shapes and multiplicities. In every case the algorithm generated a significant improvement in the load distribution, though the approach to the best solution was not always monotonic. To avoid instability, we introduced a threshold, inhibiting any rearrangement if the force at a given vertex was too small. This is made necessary by the fact that the movement of a vertex is quantized to be at least one grid unit. If the threshold is too small, instability can occur: if the threshold is too large, the algorithm terminates before the best solution is reached. In the figures we show the results of two examples. In the first, the imbalance is very large and represents a smooth function peaking in the upper right hand corner. In the second case, we took an alternating configuration. Initially the algorithm results in zero net force at the interior vertices, but the reflecting boundary breaks the symmetry and the interior points in turn relax after a very few iterations. 4 Application to Molecular Dynamics Code We have implemented the load balance algorithm as part of a molecular dynamics (MD) code and observe significant gains in efficiency when load balancing it used. There is a large interest in load balancing MD algorithms, especially in the study of particles in non-

4 DENGET AL. Loads Processor I D s A2 A1 d 1 1 7. 3 4 5 Pmcsssor ID 6 7 8 9 Bl FIG. 1. These figures show the simulated results of applying the proposed algorithm t o Iwo diflerent load distributions. In both cases there were 9 processors with 4 x 4 VPs on each. Figures A 1 and B1 show the loads on each processor before (dashed bars) and after (solid bars) the load balancing. Figures A 2 and B2 show the distribution of VPs among processors. The grey scale regions represent the initial VP distribution on the processors, with darker regions corresponding to heavier loaded processors. Within each processor, the VP loads are randomly assigned. The dark lines indicate the final distribution of VPs after balancing. In case A the imbalance was reduced from 198% to 9% and in case B from 67% to 10%.

LOAD BALANCING 5 per t l t p 7- I - I I / 6-5- r,,, )//---- - c2 FIG. 2. Figure C1 shows CPU time for each step of a molecular dynamics code with load balancing (solid line) and with no load balancing (dashed line), for the same simulation of 250000 particles on 25 processors. Figure C2 shows Ihe final configuration of VPs (fine grid) on each processor. A strong attracting point near the origin created an imbalance across the periodic boundaries; the load balance algorithm re-distributed the VPs accordingly. equilibrium phases. Previous efforts to load balance MD codes are based on adjusting the size of regions and thus, are not easily extended beyond one-dimensional partitions [l].we show the results of the proposed algorithm for a two-dimensional partition below. The molecular dynamics algorithm we used for testing is a short-range, link-cell method for distributed memory computers, similar to those in [4, 5, 21. Our implementation simulates particles interacting in a three-dimensional parallelpiped with periodic boundary conditions. To create an imbalanced case for comparison, we start with 250000 particles equally distributed in a three-dimensional domain (with periodic boundaries), having a uniform (reduced) density of p = 0.25. We place an attracting point near the origin and allow the particles to interact for 7500 steps. As the simulation progresses, the attracting point causes the particles to converge toward the origin. We partition the 3D domain according to a two-dimensional grid of 5 x 5 processors. We evenly distribute 225 VPs across the 25 processors, so that each processor initially has an array of 3 x 3 VPs. The load balance algorithm is performed once every 200 steps, and the VPs are re-distributed only when Z > 10%. For comparison, we also executed the same simulation and partitioning with no load balancing. The tests were performed on a Paragon XP/S-4 parallel computer. Figure 2 shows the efficiency gained by using our load balancing algorithm in an MD code. The non-balanced case shows a steady increase in the amount of time for each MD step; when the region around the attracting point becomes saturated the rate of increase falls off. The load balanced case shows a great gain in efficiency due t o the automatic adjustment of VPs; the time per step does not rise substantially beyond that of the initial configuration, The non-monotonic behavior of the load balanced case occurs because the VPs are only moved when the imbalance Z is above the threshold. There is minimal overhead due to the use of multiple VPs per processor. In our implementation, the time

6 DENGET AL. required to relocate a VP was less than 2 CPU seconds, and we observe that this time is recovered in just a few time steps by the increase in load balance efficiency. 5 Future Development There are a number of ways in which the algorithm as developed and implemented to date can be extended and improved. (1) It should be generalized to three dimensional decomposition, and non-rectangular space filling grids. (2) The approach should be enhanced with a good graphical interface to allow interactive control of the rearrangement strategy. (3) The pressure concept might be generalized to allow curved boundaries, to achieve better final distributions. (4)To improve convergence for large problems, a nested approach could be taken in which groups of processors were merged into supercells to which the same algorithm was applied, the individual processor mappings then being refined. (5) The algorithm should be extended to allow for heterogeneous programming environments. (6) Some method should be developed to allow for possible memory constraints restricting the number of VPs which a single processor can handle. (7) Other virtual processor algorithms should be developed for cases where the communication pattern is other than local. 6 Conclusions We have proposed an fairly robust approach to a class of load balancing problems which can be implemented with very little impact on the details of the application code being balanced. Preliminary results indicate that, although crude, the method can significantly reduce the load imbalance in many cases with very little effort on the programmer s part, and very little dependence on the details of the architecture. Acknowledgements YFD and RAM thank Professor James Glimm for encouragement and the National Science Foundation for partial financial support (grant DMS-9201581.) The work at BNL was supported by the U.S. Department of Energy under contract number DE-AC02-76CH00016. References [l] F. Brugg and S. L. Fornili, A distributed dynamic load balancer and its implementation on multi-transputer systems for moleculac dynamics simulation, Computer Physics Comm., 60 (1990), pp. 39-45. [2] P. S. Lomdahl, P. Tamayo, N. Gronbech-Jensen, and D. M. Beazley, 50 GFlops Molecular Dynamics on the Connection Machine 5, Proceedings of SUPERCOMPUTING 1993, IEEE Press. [3] R. B. Marr, J. E. Pasciak, and R. Peierls, IPX - Preemptive remote procedure execution for concurrent applications, Brookhaven National Laboratory report BNL60632 (1994). [4] S. Plimpton, Fast parallel algorithms for short-range molecular dynamics, pre-print SAND911144, Sandia National Laboratories, (1993). [5] D. C. Rapaport, Multi-million particle molecular dynamics 11,Computer Physics Comm., 62 (1991) pp. 217-228. DISCLAIMER This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof. nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the. - -