Decentralized Dynamic Load Balancing: The Particles Approach

appered in: Information Sciences, Vol 84, Issue 1-2 (May 1995) S 115-128 Decentralized Dynamic Load Balancing: The Particles Approach Hans-Ulrich Heiss Department of Informatics and Automation Technical University of Ilmenau, Germany Michael Schmitz Department of Informatics University of Karlsruhe, Germany Abstract: We consider the problem of mapping tasks to processor nodes at run-time in multiprogrammed multicomputer systems (ie message passing MIMD-systems) Besides load balancing, the goal is to place intensively communicating tasks close together to minimize communication delays Since the placement has to be done at run-time by migrating tasks, migration cost is also considered Our decentralized approach is inspired by a physical analogy: Tasks are considered as particles acted upon by forces Each aspect of the allocation goal is modeled by a dedicated force Migration activities cease, when a stable situation with balanced forces is reached Simulations for multiprogrammed hypercube and 2D-grid topologies confirmed the usefulness and suggests superiority of this more general approach to other decentralized algorithms for the environment considered

1 INTRODUCTION Multicomputers are MIMD-machines with local, private memories and an interconnection network for message passing We assume that each processor is autonomous in the sense that it is controlled by a private operating system kernel which provides basic functions such as task generation, multitasking, and intertask communication (within the node and across node boundaries) In addition, we consider multiprogramming operation, ie many independent parallel programs may run at the same time, sharing the machine in time and space The programs may enter the system at any node and at any time Each of these programs is assumed to consist of a dynamic set of parallel tasks that communicate with each other by interchanging messages In such an environment, there exists the problem of how to (re)allocate the particular tasks to processors to achieve low response times Two classes of approaches can be distinguished: Static Mapping (contractive embedding of parallel programs) Given a parallel program with m communicating tasks and a multicomputer with n<m processors, the problem is to find a mapping of the tasks to the processors that minimizes the program's execution time Heuristic algorithms for this NP-hard problem try to find a balanced allocation with minimal communication delays They are often based on graph partitioning techniques [16] and use modern optimization heuristics such as Simulated Annealing or Genetic Algorithms [5,14] However, they are designed to be executed off-line (eg in a separate step at link time running on the front-end of the parallel machine) and consider neither the actual load situation (due to multiprogramming) at run-time nor the data dependent behavior of the program They are usually also unable to deal with dynamic task creation and deletion during program execution Dynamic Load Balancing The other class of approaches originates from distributed systems where new tasks may enter the system at any time and at any node The problem is to distribute the tasks as evenly as possible to avoid idling nodes If, in addition, the tasks are elements of parallel programs, load balancing helps to process tasks at different 2

processors at the same speed, which leads to low synchronization delays and thus to low program response times Since the allocation is done at run-time, the basic mechanism is the migration of a task from one node to another which usually means the transfer of a considerable amount of data Most of the proposals use a decentralized algorithm where load balancing agents at the particular nodes negotiate the possible transfer of tasks from overloaded to underloaded nodes Intertask communication is usually ignored and if considered, it plays a minor role, since the algorithms are mostly dedicated to systems based on local area networks where all processors can be regarded as adjacent [7,10,18] For the considered environment as described above, a task allocation mechanism needs a combination of features and goals of both classes We therefore propose a decentralized dynamic load balancing mechanism that explicitly considers communication and migration overhead Our algorithm is based on a physical model that uses the notion of forces that correspond to independent optimization goals The basic idea behind our approach is the following: Imagine a flat container with an even bottom where different amounts of non mixable fluids of different viscosity are placed at different points (Figure 1) Gravitation forces the fluids to run but frictional resistance and cohesion forces that make up the viscosity are working counter Thinner fluids may spread out evenly across the bottom of the container, while more viscous fluids stick together like a lump After some time, the fluid distribution will reach a stable state with balanced forces Using this image we consider the parallel computations as fluids with the tasks as particles The load potential at each node in the system can be used to define a gravitational force Communication relations along with their intensities are associated with the cohesion forces in direction and magnitude Costs to migrate tasks act as frictional resistance and are also working counter to load balancing By using this model, our algorithm pursues the following goals in parallel: (i) (ii) (iii) (iv) Minimization of load unbalance Minimization of communication costs Avoidance of unproductive migrations Stability, ie avoidance of oscillations 3

The rest of the paper is organized as follows: In section 2, we introduce our system model and define the particular forces which are used to explain the algorithm in section 3 Section 4 presents some results from simulations conducted for a 2D-grid machine topology, and section 5 concludes the paper summarizing the major features of our approach 2 THE SYSTEM MODEL A natural formal model of a multicomputer is a processor connection graph PCG = (P,L) with P set of processor elements (vertices) L P P set of interprocessor links (edges) a(k,l) time it takes to transport one data unit from processor k to processor l µ k speed of processor k (# instructions per second) A single parallel program consists of a set of interacting tasks or threads and can therefore also be modeled as a graph, known as task interaction graph TIG = (T,C) with T set of tasks (vertices) C T T set of communication channels (edges) c(i,j) communication intensity, (i, j) C (# data units) s i d i length of task t i (# instructions to be executed) size of description of task t i (amount of data to be migrated) Programs are generated dynamically with random interarrival times The generation (or arrival) site is chosen randomly 21 Load Balancing Force Forces are the result of different levels of potentials A load balancing force is therefore based on the definition of a load potential In the framework of our model, three different definitions of the load potential of a node are possible: (i) the number of tasks assigned k V load := { j: loc( j ) = k} (1a) (ii) the amount of work assigned k V load := s j j: loc( j)=k (1b) 4

k 1 (iii) the time it takes to execute the assigned work V load := s µ j (1c) k j:loc( j)=k where loc(i) denotes the index of the processor node where task t i is located Definition (iii) is most accurate because it considers different processor speeds and different sizes of the tasks If all processors are identical, then definition (ii) is equivalent If all tasks are alike or if we do not know the lengths of the tasks we must resort to definition (i) Principally, any other quantity expressing the load-related delay can be used For the results reported below, we assume definition (1a) Based on this load potential one could define a load balancing force between two adjacent nodes as proportional to the difference of the two potentials However, since load balancing is intended to achieve roughly equal processing delay at different nodes, a relative measure of the load difference is more appropriate than an absolute one Assume, for instance, a load potential using (1a) and a local round-robin scheduling policy and consider two pairs of nodes, one pair with load potentials of 1 and 5, the other pair with 11 and 15: Both pairs have the same load difference of 4 but with regard to the processing speed this means in the former case a factor of 5, while in the latter a factor of only 15/11 Leveling out the load difference at the former processor pair is therefore more important Thus, we prefer to define as load balancing force the ratio of the potentials: j f k lb (t i ) := c lb V load +1 j k ( )( V load +1 ) (2) as the magnitude of a force acting on a task t i which is located at node j pointing to node k, where j and k are direct neighbors The increment +1 in (2) is applied simply to deal with potentials of 0 The coefficient c lb can be considered a universal constant used for weighting the different forces 22 Communication Force (Cohesion) The communication force is based on the communication intensities between the tasks To calculate it, we need to know, at least approximately, the current position of the communication partners The communication cost for a pair of communicating tasks can be defined as the product of the amount of data sent and the distance Assuming that the communication intensity c(i,j) is zero for non-communicating tasks, we define a 5

communication potential for a task t i residing at node k as its total communication costs (totalled over all communication partners): k V com T (t i ) := a(k,loc(j)) c(i, j) (3) j =1 The force trying to minimize communication cost for task t i is then defined as the difference of the potentials: j k j (t i ) := c com V com (t i ) V k com (t i ) f com ( ) (4) j In other words, f k com (t i ) indicates the gain in communication cost incurred by t i by moving t i from node j to node k 23 Damping Forces Migration means that the current state of the task has to be shipped which in the worst case is its entire address space Shipment and installing the task at the target node means cost, communication cost and execution cost as well Thus, migration is useful only in cases where the gain achieved by migration outweighs the incurred cost This is analogous to the physical friction of a body which acts as a counterforce to any other force attracting the body Any attracting force must exceed the friction in magnitude to actually move the body Consequently, we model friction as migration costs Let d i be the description of task t i, ie the amount of data that must be transferred for a migration Then the frictional force is defined as j k f frict (ti ) := c frict d i a( j,k) (5) Again, c frict serves as a constant to adjust the combined effect of the different forces In addition to friction, we introduce another damping component that counteracts migration: With each task t i, we associate a migration counter z i that is initialized to zero when the task is generated, and incremented each time the task is migrated A constant called max migs defines the maximum number of migrations permitted A migration takes place only if x i = 1 with z 1, if r i x i : = max migs (6) 0, otherwise 6

where r is a random variable drawn from a uniform distribution over the interval [0,1) Besides the fact that it reduces algorithm overhead by limiting the number of eligible tasks, it makes sure that no task will be sent around forever, since a task is migrated at most max migs times Therefore, after an initial placement of tasks and if no load is added or removed, migrations will finally cease and bring the load balancing activities to a stop 3 THE ALGORITHM Load balancing should take place when the load situation has changed The load situation changes when tasks are generated or finished Generating new tasks makes the generating site a possible sender of load It therefore initiates a load balancing step If tasks are finished, the finishing site informs all direct neighbors about the load change and continues normal operation If necessary, one or more of the informed neighbors start the load balancing procedure After collecting all required load information from its neighbors, the load balancing node determines which of its tasks are eligible for migration by evaluating the x i 's according to (6) For those tasks eligible, all forces are calculated that attract the task into one of the possible directions defined by the direct links The resultant is composed as a linear combination of the different force types: j k j k j k j k f res (ti ) := f lb (ti ) + f com (ti ) + f frict (ti ) (7) At node j this expression is the force component acting into direction of k Because migrations are only possible along the links, it makes no sense to calculate the total resultant force as it would be done in the Euclidean space Instead, we take the maximum of all these direction components and use it as the total force acting on a task (Figure 2) Of all tasks that are attracted to a neighbor, we choose that with the maximum magnitude and initiate its migration After each migration, the source node informs its neighbors about the new load situation, which may lead to a domino effect of load balancing activities At each node, the load balancing stops, if either no task is eligible for migration or all forces are negative (Figure 3) Considering the load balancing force only, those asynchronous, decentralized load balancing algorithms can be proven to converge to an equilibrium load distribution [2, 3] With the additional communication force, however, the algorithm realizes a 7

distributed gradient search which tends to converge into a local minimum if not the damping factors bring the algorithm to a premature stop However, since we are considering a highly dynamic multiprogramming environment with arriving and departing programs, the 'landscape' of the objective function is permanently changing Thus, every algorithm is deemed to trail a moving global optimum in a heuristic way The algorithm assumes the knowledge of some program characteristics that may not be readily available The most difficult data to obtain is the communication intensity between tasks which may be unknown in advance There are different ways to address this problem, eg: Over different executions of a program, the operating system can measure its behavior and provide an execution profile of the program The profile describes quantitatively the expected behavior of the program If no such profile is available, the algorithm can start with the assumption that communication intensities are uniform, ie assuming a task interaction graph with unit edge weights During the course of the program's execution the actual intensities may be measured and used by the algorithm 4 SIMULATION RESULTS The algorithm was evaluated by simulations using PASTE (Processor Allocation Strategy Test Environment), a testbed designed as a tool to analyze task allocation algorithms in parallel computers [11] The machine topology used was a 8x8-twodimensional grid (A 6D-hypercube was also simulated, but did not produce significantly different results) We assumed a processor speed of 200 Mips per node and a communication capacity of 10 Mbyte/sec for each link We used an open model with a program arrival rate large enough to produce a considerable load Programs arrived at randomly selected nodes and immediately started execution (no queuing) Each program consisted on the average of 12 interacting tasks with an exponentially distributed execution time with a mean of 50 sec (10 10 instructions) Each task had between 1 and 6 communication partners (node degree of the TIG) The communication was synchronous requiring the sender to wait until the data was received by the partner 8

(rendezvous) Multitasking at each node was implemented with a round-robin scheduling policy The task size (amount of data for migration) was uniformly distributed between 50 and 500 Kbyte Total simulation time was controlled by a confidence-level criterion which stopped, when the mean response time reached an accuracy of ±5% with 90% confidence It roughly corresponded to 1000 program generations Each point in the following graphs represents one simulation run To show the versatility of our approach, we focus on the influence of the weighting coefficients on different performance measures In general, the optimal coefficient values mainly depend on the characteristics of the underlying machine architecture and are rather insensitive to changes in the program behavior For the hardware parameters used we found empirically that the parameter set (c lb =08, c com =65, c frict =25, and max migs =15) was roughly optimal with regard to the mean response time This parameter set was the basis for the following figures The effect of switching off some forces - simply by setting the corresponding coefficient to 0 - on the mean response time is shown in Figure 4, and gives some possibility of comparison to other algorithms already proposed Ignoring both, communication and migration costs makes our algorithm similar to the Gradient Model [13] Ignoring migration cost only, roughly corresponds to the mapping algorithm proposed in [4] Figure 4 indicates that for low load levels pure load balancing is sufficient, because the predominant goal is to avoid idle processor nodes Due to the rendezvous-type communication, load balancing also helps to keep communication delays low: If all tasks of a program proceed with the same speed, synchronization delays will be minimized At higher utilization levels, however, where the probability for idle nodes is low, the potential gain that can be achieved by sending tasks to idle nodes decreases, and the interest shifts from load sharing to overhead avoidance Therefore, as we increase the arrival rate, pure load balancing delivers the worst results and it becomes necessary to consider communication and migration overhead For the high load region (λ > 025 in Figure 4), where it is profitable to take into account other aspects than load balancing, we show in Figure 5 the sensitivity of three 9

performance indices with regard to changing the values of the weighting coefficients Besides the mean response time of the programs, Figure 5 depicts: - (effective) utilization, defined as the accumulated user busy times (excluding overhead caused by the load balancing activities) of a processor divided by the total simulation time and averaged over all processors, - load unbalance index, defined as the variance of utilization over the different processors, averaged over the simulation time (based on 10 second sampling intervals) Figure 5a shows the effect of c lb and confirms the expectation that load balancing is necessary to achieve low response time and high utilization Interesting is the course of the response time, since it gives some evidence of the existence of a minimum which means that there is some optimal value for c lb : Giving load balancing too much weight compared to other goals will deteriorate the performance Figure 5b reports the variation of the communication (cohesion) force coefficient Here, too, the response time curve exhibits a minimum meaning that some effort to minimize communication overhead is advantageous to system performance (Note that for programs with asynchronous communication, the impact of the force on response time would be somewhat weaker since communication could be partially overlapped with computation With the synchronous communication considered here, there is only overlapping due to multitasking at the nodes) Not surprisingly, the communication overhead (as well as the migration overhead, not shown here) becomes lower with increasing communication force (high viscosity) Figure 5c clearly indicates that migration costs have to be taken into account Up to an optimal value of the coefficient, the response times drop continuously As for the communication force, too much weight for this frictional force, however, hampers migration and load balancing, and deteriorates performance The migrating overhead is monotonically decreasing and the load unbalance index increasing, as expected All response time curves (left column in Figure 5) exhibit a convex functional dependency on the coefficients with a clear minimum This underlines the necessity to include the corresponding forces The figures indicate that - for the environment under consideration, ie multiprogramming of parallel programs - (i) minimizing response times, (ii) maximizing utilization, and (iii) minimizing load unbalance are different 10

goals that must be well distinguished We can have perfect load balance and a high utilization but nevertheless bad response times, since the tasks of the programs may not be well mapped with regard to their communication relations All figures confirm what could be expected from the model, eg that load balance and utilization are monotonically increasing functions of the influence of a load balancing force, while the cohesion and frictional forces have an opposite effect Not shown is the effect of varying the maximum number of migrations There is also an optimal value with regard to the mean response time, but the sensitivity is rather low Extensive experiments suggest a value of the order of the diameter of the processor topology Further increasing max migs may still improve the load balance, but may deteriorate response times 5 CONCLUSION We proposed a new decentralized heuristic algorithm to the problem of dynamically distribute tasks of parallel programs in large multicomputer systems with multiprogramming In contrast to other load balancing schemes it reflects communication patterns among the tasks of the parallel programs So it also solves the mapping problem for parallel programs at run-time The approach is based on a physical analogy and comprises load balancing and minimization of communication and migration overhead Weighting coefficients to balance and fine-tune the effects of the subgoals can be tailored to pursue a large range of different objectives Experiments based on simulations for 2D-grid and hypercube topologies confirm the necessity to account for communication and migration overhead in order to achieve low response times Especially for high loads, load balancing alone is not sufficient and the consideration of the additional forces can reduce the response times by up to 50% To further prove its usefulness in real operation, the proposed algorithm will be implemented in our distributed operating system COSY [6] which currently runs on transputer-clusters 11

REFERENCES 1 Ahmad, I; Ghafoor, A: Semi-Distributed Load Balancing for Massively Parallel Microcomputer Systems" IEEE Trans on Software Engineering 17,10 (Oct 1991) pp 987-1004 2 Bertsekas, DP; Tsitsiklis,JN: Parallel and Distributed Computation Prentice Hall (1989) 3 Boillat, EJ: Load Balancing and Poisson equation in a graph Concurrency: Practice and Experience 2,4 (Dec 1990) pp 289-313 4 Boillat, E J; Kropf, PG: "A Fast Distributed Mapping Algorithm" CONPAR 90 (Int Conf on Vector and Parallel Processing) pp 405-416 5 Bultan,T; Aykanat,C: A New Mapping Heuristic Based on Mean Field Annealing Journal of Parallel and Distributed Computing 16 (1992) pp 292-305 6 Butenuth,R: The Cosy-Kernel as an Example for Efficint Kernel Call Mechanisms on Transputers Proc 6th Transputer/occam Int Conference (June 16-17, 1994) Tokyo, Japan 7 Casavant, TL; Kuhl,JG: "A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems" IEEE Trans on Software Engineering 14,2 (Feb 1988) pp 141-154 8 Chow,YC; Kohler,WK: "Models for Dynamic Load Balancing in a Heterogeneous Multiprocessor System" IEEE Trans on Computers 28,5 (May 1979) pp354-361 9 Cybenko,G: "Dynamic Load Balancing for Distributed Memory Multiprocessors" Journal of Parallel and Distributed Computing 7 (1989) pp 279-301 10 Eager,DL; Lazowska,ED; Zahorjan,J: "A Comparison of Receiver-Initiated and Sender-Initiated Adaptive Load Sharing" Performance Evaluation 6 (1986) pp 53-68 11 Heiss,H-U; Payer, A: "PASTE: A Toool for Evaluation of Processor Allocation Strategies" Proc 6th Int Conf on Modelling Tools and Techn for Comp Perf Eval, Edinburgh, Sept 1992, pp367-371 12 Heiss,H-U; Schmitz,M: "Distributed Load Balancing Using a Physical Analogy" Internal Report No5/92, Dep of Informatics, University of Karlsruhe (May 1992) 13 Lin,FCH; Keller,RM: "The Gradient Model Load Balancing Method" IEEE Trans on Software Engineering 13,1 (Jan 1987) pp 32-38 14 Mühlenbein,H; Gorges-Schleuter,M; Krämer,O: "New solutions to the mapping problem of parallel systems: The evolution approach" Parallel Computing 4(1987), pp 269-279 15 Nicol,DM; Saltz,JH: "Dynamic Remapping of Parallel Computations with Varying Resource Demands" IEEE Trans on Computers 37,9 (Sept 1988) pp 1073-1987 16 Sadayappan,P; Ercal,F; Ramanujam,J: "Cluster partitioning approaches to mapping parallel programs onto a hypercube" Parallel Computing, Vol 13 (1990), pp1-16 17 Shin,K; Chang,Y: "Load Sharing in Distributed Real-Time Systems with State Change Broadcasts" IEEE Trans on Computers 38,8 (Aug 1989) pp 1124-1142 18 Theimer,MM; Lantz,KA: "Finding Idle Machines in a Workstation-Based Distributed Systems" IEEE Trans on Software Engineering 15,11 (Nov 1989) pp 1444-1457 12

(a) adding a new fluid (b) after reaching stability (balanced forces) Figure 1: Situation in a container with different viscous fluids p 1 5 f 1 res (ti ) p 4 5 f 4 res (ti ) t i 5 f 2 res (ti ) p 2 p 5 5 f 3 res (ti ) p 3 Figure 2: Forces acting on t i along the links The maximum force component points to p 4 Thus, t i will be migrated to p 4 13

1 cycle Periodic activation 2 wait for load change 3 if load increase 4 then 5 M { i T x i = 1} Determine migratable tasks 6 load_balanced false 7 while M and not load_balanced do 8 get information from neighbors 9 for all i M do For all migratable tasks 10 for all k neighbors do calculate forces to all possible 11 calculate f(i,k) directions 12 end for 13 end for 14 cand i with max i M, k neighbors 15 direct k with max k neighbors { f( i,k) } Select task with largest force { f( cand,k) } Direction of largest force 16 if f(cand,direct) > 0 If force positive, migrate 17 then candidate to the direction of 18 migrate(cand, direct) largest force 19 send new load data to neighbors Update and distribute load data 20 end then to neighbors 21 else 22 load_balanced true 23 end else 24 end while 25 end then 26 else Load decrease 27 send new load data to neighbors Update and distribute load data 28 end else 29 end cycle Figure 3: Outline of the load balancing algorithm 14

95 response time [s] 85 75 65 55 45 c lb 35 00 03 06 09 12 utilization [%] 64 62 60 58 56 54 52 c lb 50 00 03 06 09 12 (a) Variation of the load balancing coefficient 07 load unbalance 06 05 04 03 02 01 c lb 00 00 03 06 09 12 95 85 response time [s] 64 62 utilization [%] 07 06 load unbalance 75 65 55 60 58 56 54 05 04 03 02 45 ccom 52 ccom 35 50 0 50 100 150 200 0 50 100 150 200 (b) Variation of the communication coefficient 01 00 ccom 0 50 100 150 200 95 85 response time [s] 64 62 utilization [%] 07 06 load unbalance 75 65 55 60 58 56 54 05 04 03 02 45 c f rict 35 0 10 20 30 40 50 (c) Variation of the frictional coefficient 52 c f rict 50 0 10 20 30 40 50 01 00 c frict 0 10 20 30 40 50 Figure 5: Impact of varying the strength of particular forces on different performance measures 15

1000 response time [s] 800 600 400 200 complete load balancing and friction load balancing and communication load balancing only 00 01 02 03 04 05 Figure 4: Mean response time as a function of the total load for different allocation schemes 16