?kt. An Unconventional Method for Load Balancing. w = C ( t m a z - ti) = p(tmaz - 0i=l. 1 Introduction. R. Alan McCoy,*



Similar documents
Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies

2-DFINITE ELEMENT CABLE & BOX IEMP ANALYSIS

PARALLEL IMPLEMENTATION OF THE TOPAZ OPACITY CODE: ISSUES IN LOAD-BALANCING

Fast Multipole Method for particle interactions: an open source parallel library component

Partitioning and Divide and Conquer Strategies

A Review of Customized Dynamic Load Balancing for a Network of Workstations

IRS: Implicit Radiation Solver Version 1.0 Benchmark Runs

Asynchronous data change notification between database server and accelerator control systems

Testing of Kaonetics Devices at BNL

3 Extending the Refinement Calculus

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Science Goals for the ARM Recovery Act Radars

GA A23745 STATUS OF THE LINUX PC CLUSTER FOR BETWEEN-PULSE DATA ANALYSES AT DIII D

Decentralized Method for Traffic Monitoring

ABSTRACT FOR THE 1ST INTERNATIONAL WORKSHOP ON HIGH ORDER CFD METHODS

Requesting Nodes, Processors, and Tasks in Moab

Second Line of Defense Virtual Private Network Guidance for Deployed and New CAS Systems

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

Dynamic Vulnerability Assessment

Explicit Spatial Scattering for Load Balancing in Conservatively Synchronized Parallel Discrete-Event Simulations

Public Service Co. of New Mexico (PNM) - Smoothing and Peak Shifting. DOE Peer Review Steve Willard, P.E. September 26, 2012

Multiphase Flow - Appendices

A Comparative Performance Analysis of Load Balancing Algorithms in Distributed System using Qualitative Parameters

Clustering & Visualization

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Multi-GPU Load Balancing for Simulation and Rendering

Status And Future Plans. Mitsuyoshi Tanaka. AGS Department.Brookhaven National Laboratory* Upton NY 11973, USA INTRODUCTION

Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.

Small Modular Nuclear Reactors: Parametric Modeling of Integrated Reactor Vessel Manufacturing Within A Factory Environment Volume 1

HPC Deployment of OpenFOAM in an Industrial Setting

IDC Reengineering Phase 2 & 3 US Industry Standard Cost Estimate Summary

Climate-Weather Modeling Studies Using a Prototype Global Cloud-System Resolving Model

Improving a Gripper End Effector

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Segmentation of building models from dense 3D point-clouds

Optimizing the Virtual Data Center

Mesh Generation and Load Balancing

Scientific Computing Programming with Parallel Objects

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

An optimisation framework for determination of capacity in railway networks

Development of High Stability Supports for NSLS-II RF BPMS

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

DYNAMIC RANGE IMPROVEMENT THROUGH MULTIPLE EXPOSURES. Mark A. Robertson, Sean Borman, and Robert L. Stevenson

User Guide. The Business Energy Dashboard

SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION. Marc-Olivier Briat, Jean-Luc Monnot, Edith M.

DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH

Muse Server Sizing. 18 June Document Version Muse

Parallel Scalable Algorithms- Performance Parameters

Final Report. Cluster Scheduling. Submitted by: Priti Lohani

Capacity Estimation for Linux Workloads

Grid Computing Approach for Dynamic Load Balancing

A SIMULATOR FOR LOAD BALANCING ANALYSIS IN DISTRIBUTED SYSTEMS

How To Balance In Cloud Computing

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

Objective Criteria of Job Scheduling Problems. Uwe Schwiegelshohn, Robotics Research Lab, TU Dortmund University

The International Journal Of Science & Technoledge (ISSN X)

Cellular Computing on a Linux Cluster

Advances in Oxide-Confined Vertical Cavity Lasers. Photonics Research Department. Albuquerque, NM (505) phone (505) FAX

Parallel Processing over Mobile Ad Hoc Networks of Handheld Machines

SCALABILITY AND AVAILABILITY

Jan F. Prins. Work-efficient Techniques for the Parallel Execution of Sparse Grid-based Computations TR91-042

NetIQ Privileged User Manager

A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems*

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MEASURING PERFORMANCE OF DYNAMIC LOAD BALANCING ALGORITHMS IN DISTRIBUTED COMPUTING APPLICATIONS

LOAD BALANCING FOR MULTIPLE PARALLEL JOBS

User Guide. The Business Energy Dashboard

CHAPTER 1 INTRODUCTION

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

Introduction to Cloud Computing

DISCLAIMER. This document was prepared as an account of work sponsored by an agency of the United States

Reliable Systolic Computing through Redundancy

Capacity Planning Process Estimating the load Initial configuration

Understanding the Benefits of IBM SPSS Statistics Server

Moral Hazard. Itay Goldstein. Wharton School, University of Pennsylvania

The STC for Event Analysis: Scalability Issues

Source Code Transformations Strategies to Load-balance Grid Applications

PARALLEL PROCESSING AND THE DATA WAREHOUSE

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

Overlapping Data Transfer With Application Execution on Clusters

RESEARCH PAPER International Journal of Recent Trends in Engineering, Vol 1, No. 1, May 2009

Locality-Preserving Dynamic Load Balancing for Data-Parallel Applications on Distributed-Memory Multiprocessors

UNIVERSITY OF CALIFORNIA, SAN DIEGO. A Performance Model and Load Balancer for a Parallel Monte-Carlo Cellular Microphysiology Simulator

FD4: A Framework for Highly Scalable Dynamic Load Balancing and Model Coupling

ESNET Requirements for Physics Research at the SSCL

Analysis of Filter Coefficient Precision on LMS Algorithm Performance for G.165/G.168 Echo Cancellation

A Study on the Application of Existing Load Balancing Algorithms for Large, Dynamic, Heterogeneous Distributed Systems

A Hybrid Load Balancing Policy underlying Cloud Computing Environment

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

CSE 4351/5351 Notes 7: Task Scheduling & Load Balancing

Introduction to DISC and Hadoop

Lecture Outline Overview of real-time scheduling algorithms Outline relative strengths, weaknesses

GPU Parallel Computing Architecture and CUDA Programming Model

A Systems Approach to HVAC Contractor Security

Resource Allocation Schemes for Gang Scheduling

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster

Performance Comparison of Dynamic Load-Balancing Strategies for Distributed Computing

HP Smart Array Controllers and basic RAID performance factors

Transcription:

ENL-62052 An Unconventional Method for Load Balancing Yuefan Deng,* R. Alan McCoy,* Robert B. Marr,t Ronald F. Peierlst Abstract A new method of load balancing is introduced based on the idea of dynamically relocating virtual processes corresponding to computations on an abstract system with a larger number of processors. The algorithm introduced preserves the locality of nearest neighbor interactions and has been tested on simulated data and a molecular dynamics code. 1 Introduction A general approach to the problem of load balance for distributed-memory MIMD architectures is developed. It is targeted at those computations whose parallel structure is obtained by decomposing a task into components which run mostly independently, each component computation involving its own private data for the most part, and with only occasional synchronization points. For such "mostly local" computations, there is often considerable flexibility in how the work is allocated between processors, leading to very different degrees of load balancing. In many cases, complete balance should be possible, in principle. There are a number of factors which make such a balanced computation difficult to achieve. What may represent a uniform decomposition from the problem definition may not correspond t o a uniform distribution of load, and non-uniform decompositions may be much harder to program. The load distribution may not even be predictable a priori or, worse still, may vary during the course of the calculation. For this reason dynamic load balancing techniques, preferably requiring minimal work for the applications programmer and extra computer resources, are of great interest. We have developed a general method for attacking this problem. Load balancing is in fact an optimization problem. Suppose that for a given task distributed among p processors with a particular decomposition the computing times are t i V i 5 p with maximum t,,, and average f, Processor i is therefore idle for time (t,,, - t i ), so that the total waste of resources in this computation is P w = C ( t m a z - ti) = p(tmaz - 0i=l It is useful to define a normalized imbalance ratio which measures the percentage waste of resources due to load imbalance. Ignoring, for the moment, any overhead computing time introduced by the decomposition, the load balancing problem involves finding a decomposition which minimizes 1. *Center for Scientific Computing, The University at Stony Brook, Stony Brook, N Y 11794-3600 Department of Applied Sciences, Brookhaven National Laboratory, Upton, N Y 11973?kT OtSTRlBUTlON OF THIS DOCUMENT IS UNUMIT

2 DENGET AL. Our approach is to write the application code as if the number of available processors were some multiple of the actual number, and then run many such virtual processors (VPs) on each physical one. Concurrent with the application we analyze the load on each processor and then move entire VPs from one physical processor to another. This relocation is transparent to the application code being executed. The strategy for such relocation must depend on the particular problem. There are, in fact, three overhead costs introduced by the load balancing procedure: (1) Additional computation introduced by partitioning the problem into a larger number of pieces. (2) Extra communication costs in moving VPs away from their original assigned physical processors, when the problem decomposition is by spatial domain and the communication requirements are spatially local. (3) Costs to analyze and determine the desired relocation and to actually move the data associated with a VP. To completely solve the optimization taking the variation of these overhead factors into account is prohibitively expensive. Even an approximate solution may cost more than the load imbalance itself unless the load imbalance distribution changes slowly, relative to the interval between synchronization points in the computation, so that it is possible to allow many time steps to achieve improved balance. constraining the The approach we have adopted is to minimize 1,or equivalently t,,,, decomposition choices to keep the other costs small without formally including them in the optimization. 2 Virtual Processor Approach The simplest way of implementing the virtual processor approach is to actually run a single process on each processor, whose address space is partitioned into blocks corresponding to the virtual processes, and which repeatedly executes the computational code, successively pointing to the data blocks corresponding to the virtual processes associated with that processor. At intervals, such as after a certain number of time steps in a simulation, or whenever it is forced t o wait at a synchronization point for a more heavily loaded processor to catch up, the algorithm discussed below is executed. Once a better decomposition has been determined, the data blocks corresponding to the relocated VPs are moved to their new processors, and the lists of pointers adjusted correspondingly. If the code uses explicit message passing for communication, then it is important to trap the message passing calls to replace them with simple assignment statements if the virtual processors are on the same physical processor. In implementing the tests discussed here, we have made extensive use of the IPX package [3] which allows one processor to execute asynchronously procedures operating on the data of another processor. This execution does not require waiting for any explicit action by the destination processor, and can interrupt an ongoing computational thread. The destination processor can, however, block such preemption during critical sections of code. This enables the computation of the redistribution, and the actual movement of data, to be completely transparent to the underlying computational code being balanced. 3 Algorithm for Redistributing Load In a reasonably large problem, with many processors, and many virtual processors (VPs) for each physical one, there are generally many different rearrangements which would lead to the same imbalance if it were not for communications costs. The algorithm introduced here is based on the assumption that communication costs are significant, and dominated

DISCLAIMER Portions of this document may be illegible in electronic image products. Images are produced from the best available original document.

LOADBALANCING 3 by local interactions. Assume we are dealing with a computational region which is decomposed by a rectangular grid into cells, each VP executing the computation associated with a single cell. For simplicity, we discuss here the case of a two-dimensional decomposition, though the algorithm can be generalized straightforwardly to higher dimensional decompositions, and other than rectangular grids. We consider the processors to be also represented as a (coarser) rectangular grid, each grid cell corresponding to a processor. The two grids are related by the fact that each grid point of the coarse (processor) grid is mapped to a grid point of the fine (virtual process) grid, and that the lines bounding the processor cells are mapped into the lines connecting the mapped grid points. We then assign any VP to the processor whose mapped cell contains the center of the VP cell. We introduce the concept of a pressure associated with each processor cell, proportional to its wasted resources: pi = f - t i. As a result of this pressure, cells associated with light loads will tend to expand, while cells associated with heavy loads (negative pressure) will contract. This expansion or contraction is achieved by computing the net force at each mapped coarse grid point resulting from the pressures in the cells which share that point as a vertex, the boundaries being constrained to remain straight. We proceed iteratively, at each step allowing the mapping to change by zero or one fine grid step, depending on the magnitude and direction of the resulting force. It is clear, by construction, that the algorithm preserves locality and tends t o process neighboring regions on the same processor. We assert, without proof, that the configuration will tend to relax towards one with improved load balance. We have carried out a series of simulations for a variety of cases, and the results tend to confirm the hypothesis. We have also applied the algorithm to a real molecular dynamics calculation. The full details will be reported elsewhere. 3.1 Simulation Results A number of cases were studied, assigning arbitrary loads to the fine grid cells. We examined cases where the heavy load was concentrated in a single region, in several localized regions, or along one or more lines. We studied varying grid shapes and multiplicities. In every case the algorithm generated a significant improvement in the load distribution, though the approach to the best solution was not always monotonic. To avoid instability, we introduced a threshold, inhibiting any rearrangement if the force at a given vertex was too small. This is made necessary by the fact that the movement of a vertex is quantized to be at least one grid unit. If the threshold is too small, instability can occur: if the threshold is too large, the algorithm terminates before the best solution is reached. In the figures we show the results of two examples. In the first, the imbalance is very large and represents a smooth function peaking in the upper right hand corner. In the second case, we took an alternating configuration. Initially the algorithm results in zero net force at the interior vertices, but the reflecting boundary breaks the symmetry and the interior points in turn relax after a very few iterations. 4 Application to Molecular Dynamics Code We have implemented the load balance algorithm as part of a molecular dynamics (MD) code and observe significant gains in efficiency when load balancing it used. There is a large interest in load balancing MD algorithms, especially in the study of particles in non-

4 DENGET AL. Loads Processor I D s A2 A1 d 1 1 7. 3 4 5 Pmcsssor ID 6 7 8 9 Bl FIG. 1. These figures show the simulated results of applying the proposed algorithm t o Iwo diflerent load distributions. In both cases there were 9 processors with 4 x 4 VPs on each. Figures A 1 and B1 show the loads on each processor before (dashed bars) and after (solid bars) the load balancing. Figures A 2 and B2 show the distribution of VPs among processors. The grey scale regions represent the initial VP distribution on the processors, with darker regions corresponding to heavier loaded processors. Within each processor, the VP loads are randomly assigned. The dark lines indicate the final distribution of VPs after balancing. In case A the imbalance was reduced from 198% to 9% and in case B from 67% to 10%.

LOAD BALANCING 5 per t l t p 7- I - I I / 6-5- r,,, )//---- - c2 FIG. 2. Figure C1 shows CPU time for each step of a molecular dynamics code with load balancing (solid line) and with no load balancing (dashed line), for the same simulation of 250000 particles on 25 processors. Figure C2 shows Ihe final configuration of VPs (fine grid) on each processor. A strong attracting point near the origin created an imbalance across the periodic boundaries; the load balance algorithm re-distributed the VPs accordingly. equilibrium phases. Previous efforts to load balance MD codes are based on adjusting the size of regions and thus, are not easily extended beyond one-dimensional partitions [l].we show the results of the proposed algorithm for a two-dimensional partition below. The molecular dynamics algorithm we used for testing is a short-range, link-cell method for distributed memory computers, similar to those in [4, 5, 21. Our implementation simulates particles interacting in a three-dimensional parallelpiped with periodic boundary conditions. To create an imbalanced case for comparison, we start with 250000 particles equally distributed in a three-dimensional domain (with periodic boundaries), having a uniform (reduced) density of p = 0.25. We place an attracting point near the origin and allow the particles to interact for 7500 steps. As the simulation progresses, the attracting point causes the particles to converge toward the origin. We partition the 3D domain according to a two-dimensional grid of 5 x 5 processors. We evenly distribute 225 VPs across the 25 processors, so that each processor initially has an array of 3 x 3 VPs. The load balance algorithm is performed once every 200 steps, and the VPs are re-distributed only when Z > 10%. For comparison, we also executed the same simulation and partitioning with no load balancing. The tests were performed on a Paragon XP/S-4 parallel computer. Figure 2 shows the efficiency gained by using our load balancing algorithm in an MD code. The non-balanced case shows a steady increase in the amount of time for each MD step; when the region around the attracting point becomes saturated the rate of increase falls off. The load balanced case shows a great gain in efficiency due t o the automatic adjustment of VPs; the time per step does not rise substantially beyond that of the initial configuration, The non-monotonic behavior of the load balanced case occurs because the VPs are only moved when the imbalance Z is above the threshold. There is minimal overhead due to the use of multiple VPs per processor. In our implementation, the time

6 DENGET AL. required to relocate a VP was less than 2 CPU seconds, and we observe that this time is recovered in just a few time steps by the increase in load balance efficiency. 5 Future Development There are a number of ways in which the algorithm as developed and implemented to date can be extended and improved. (1) It should be generalized to three dimensional decomposition, and non-rectangular space filling grids. (2) The approach should be enhanced with a good graphical interface to allow interactive control of the rearrangement strategy. (3) The pressure concept might be generalized to allow curved boundaries, to achieve better final distributions. (4)To improve convergence for large problems, a nested approach could be taken in which groups of processors were merged into supercells to which the same algorithm was applied, the individual processor mappings then being refined. (5) The algorithm should be extended to allow for heterogeneous programming environments. (6) Some method should be developed to allow for possible memory constraints restricting the number of VPs which a single processor can handle. (7) Other virtual processor algorithms should be developed for cases where the communication pattern is other than local. 6 Conclusions We have proposed an fairly robust approach to a class of load balancing problems which can be implemented with very little impact on the details of the application code being balanced. Preliminary results indicate that, although crude, the method can significantly reduce the load imbalance in many cases with very little effort on the programmer s part, and very little dependence on the details of the architecture. Acknowledgements YFD and RAM thank Professor James Glimm for encouragement and the National Science Foundation for partial financial support (grant DMS-9201581.) The work at BNL was supported by the U.S. Department of Energy under contract number DE-AC02-76CH00016. References [l] F. Brugg and S. L. Fornili, A distributed dynamic load balancer and its implementation on multi-transputer systems for moleculac dynamics simulation, Computer Physics Comm., 60 (1990), pp. 39-45. [2] P. S. Lomdahl, P. Tamayo, N. Gronbech-Jensen, and D. M. Beazley, 50 GFlops Molecular Dynamics on the Connection Machine 5, Proceedings of SUPERCOMPUTING 1993, IEEE Press. [3] R. B. Marr, J. E. Pasciak, and R. Peierls, IPX - Preemptive remote procedure execution for concurrent applications, Brookhaven National Laboratory report BNL60632 (1994). [4] S. Plimpton, Fast parallel algorithms for short-range molecular dynamics, pre-print SAND911144, Sandia National Laboratories, (1993). [5] D. C. Rapaport, Multi-million particle molecular dynamics 11,Computer Physics Comm., 62 (1991) pp. 217-228. DISCLAIMER This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof. nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the. - -