Performance Comparison of Dynamic Load-Balancing Strategies for Distributed Computing

Performance Comparison of Dynamic Load-Balancing Strategies for Distributed Computing A. Cortés, A. Ripoll, M.A. Senar and E. Luque Computer Architecture and Operating Systems Group Universitat Autònoma de Barcelona. 08193-Bellaterra (Barcelona).Spain e-mail:{a.cortes, a.ripoll, m.a.senar, e.luque}#cc.uab.es Abstract * The DASUD (Diffusion Algorithm Searching Unbalanced Domains) algorithm belongs to the nearestneighbours class and operates in a diffusion scheme where a processor balances its load with all its neighbours. DASUD detects unbalanced domains and performs local exchange of load between processors to achieve global balancing. The DASUD algorithm has been evaluated by comparison with another well-known strategy, namely, the SID (Sender Initiated Diffusion) algorithm across a range of networks topologies including ring, torus and hypercube where the number of processors varies from 8 to 128. From the experiments we have observed that DASUD outperforms the other strategy as it provides the best trade-off between the balance degree obtained at the final state and the number of iterations required to reach such state. DASUD is able to coerce any initial load distribution into a highly balanced global state and also exhibits good scalability properties. 1. Introduction The load-balancing problem in parallel computation is concerned with how to distribute the workload of a computation among the available processors so that each processor would have the same or nearly the same amount of work to do. In most cases, load balancing is done prior to execution and is done only once called static load balancing or static mapping-. This static balancing can be quite effective for computations that have predictable runtime behaviours [1]. For computations whose run-time behaviour is non-deterministic or not so predictable, however, performing load balancing only once in the beginning is insufficient. For these cases, it might be better * This work was supported by the CICYT under contract TIC 95-0868 to perform the load balancing more than once or periodically during run-time because the problem s variable behaviour more closely matches available computational resources. For example, in data parallel applications, the computational requirements associated with different parts of a problem domain may change as the computation proceeds. This occurs when the behaviour of the physical system being modelled changes with time. Such adaptive data parallel computations appear frequently in scientific and engineering applications such as those in molecular dynamics and computational fluid dynamics. This paper is about load balancing in distributed memory message-passing parallel computers. Each processor has its own address space and has to communicate with other processors by message passing. In general, a direct point-to-point interconnection network is used for the communications. Many commercial parallel computers are of this class, including the Intel Paragon, the Thinking Machine CM-5, IBM SP2, Origin 2000, and Cray T3D/T3E. The focus is on nearest-neighbours load balancing methods in which every processor at every step communicates simultaneously with all its nearest neighbours in order to reach a local balance. Nearestneighbour methods are iterative in nature because a global balanced state could be reached through processor s successive local operations. Nevertheless, the proposal strategies [2,3,4,5,6] assume that workloads are infinitely divisible and hence represent the workload of a processor by a real number. This assumption is valid in parallel programs that exploit very fine grain parallelism. To cover medium and large grain parallelism, the algorithm must be able to handle indivisible tasks. Under this more realistic assumption, the previous strategies may fail to guarantee global load balance. A new algorithm DASUD (Diffusion Algorithm Searching Unbalanced Domains) flexible in terms of allowing one to control the balancing qualities, effective for preserving communication locality and easily scaled in 1

parallel computers with a direct communication network, was proposed in [7,8]. This paper compares the DASUD algorithm with a well-known nearest-neighbour SID (Sender Initiated Diffusion) algorithm..the performance characteristics of DASUD have been evaluated using simulation experiments. The results illustrate the benefits offered by DASUD with regard to the balance quality (the maximum difference of load between processors) and the efficiency of the algorithm measured as the number of steps and communication cost required to drive an initial workload distribution into a uniform distribution. The rest of the paper is organised as follows. In section 2 the DASUD strategy is described. The simulation results concerning the goodness of DASUD strategy are reported in section 3 and, finally, in section 4 the main conclusions are presented. 2. Description of the DASUD strategy DASUD is an asynchronous nearest-neighbours strategy based on the SID (Sender Initiated Diffusion) strategy proposed by Willebeek-LeMair et alter. in [5]. SID uses overlapping neighbourhood domains to achieve global load balancing over the network. A threshold identifies the overloaded processor (sender). A sender performs load balancing whenever each load level is greater than the threshold value. Once the sender is identified using the threshold, the next step is to determine the amount of load (number of tasks or data set) to transfer to the sender s neighbours. This dynamic load balancing strategy uses local state information to guide load distribution. The processor selection and task-transfers policies are distributed in nature: all processors in the network have the responsibility of achieving global local balance. This strategy assumes that workloads are infinitely divisible and hence represents the workload of a processor by a real number. Nevertheless, the SID algorithm can be adapted to the integer workload model using floor and ceiling functions. However, this integer approach may fail to guarantee a global balanced situation. Although the load of each processor may differ by only one unit at most from that of its neighbours, the global load balance may be very poor. DASUD was developed to solve this problem detecting unbalanced domains and it performs local exchange of load between processors to achieve a global load balanced state (where the maximum load difference between any two processors is one unit). The behaviour of DASUD can be summarised in figure 2.1 where the load of a processor i at time t is defined as w i (t) and w(t) = (w 1 (t),...w n (t)) represents the global load vector at time t. Each processor executes the same group of operations at each iteration of the load-balancing algorithm. First, each processor sends its load information to all its neighbours and receives the load information from all its neighbours as well (lines 4.1). Then, it computes the load average of its domain, including the load of all its neighbours and itself, and it also computes its value d ii (t) (line 4.2). If processor i is an overloaded processor d ii will be a negative value (d ii <0). Otherwise, if processor i is an under-loaded processor d ii will be a non-negative value (d ii 0). An overloaded processor i (d ii < 0) performs load balancing by apportioning its excess load only to deficient neighbours. So, a new weight d + ij ()is t computed for all neighbours with deficit of load (line 4.4) and the total amount of load deficits is computed on td to evaluate the proportion of processor i s load excess that is assigned to neighbour j, P ij (t) (line 4.5). The amount of i s load excess to be sent is finally computed as s ij (t) and sent to processor j (line 4.6). (1) while (not converged) do begin DASUD Algorithm (2) for ALL processors i (3) parbegin (4.1) Exchange load information with neighbours SID Algorithm (4.2) compute: w i (t) and d ii(t) = w i (t) - w i(t) (4.3) if (processor i has excess of load) then (4.4) evaluate the load deficits of processor i s neighbours: + if (d ij (t)>0) then dij ( t) = dij( t) else d + ij ( t) = 0 + (4.5) and its portion of excess load to be moved: Pij ( t) = dij td (4.6) send s ij(t) = floor ( - (P ij(t) * d ii(t)) units of load to processor j (4.7) if (s ij(t) = 0 ƒor all neighbours) then compute: w i max (t) ; w max vi (t ) ; w min i (t) ; w min vi (t) ; (4.8) if (( wi max (t) - wi min (t) )> 1) then begin (4.9) if (((w ii(t) = wi max (t) ) and ( wvi max (t) = wvi min (t) )) then distributes the excess load unit by unit among the deficient neighbors. (4.10) if ((w i(t) = wi max (t) ) and ( wvi max (t) J wvi min (t) )) then send one unit of load to one of the lowest loaded neighbour (4.11) if (w i(t) wi max (t) ) then send an instruction message to a highest loaded neighbour processor saying it must send a unit of load to one of the lowest loaded processor. end (4.12) if no units of load have been already sent then receive and sort all instruction messages and send a unit of load to the first processor of that list (5) delete the rest of instruction messages (6) parend (7) end Figure 2.1 DASUD algorithm 2

DASUD incorporates new features to the previous steps that detect whether a domain is balanced or not. For that purpose the following four parameters are evaluated (line 4.7): a) maximum load value of the domain: wi max (t), b) minimum load value of the domain: wi min (t), c) maximum load value of processor neighbours: wvi max (t), d) minimum load value of processor neighbours: wvi min (t). If the maximum load difference between the neighbours and its own is more than one (line 4.8), there is a load imbalance and units of load will be distributed based on the following three actions: Action 1: If processor i is the processor with maximum load of its domain and all its immediate neighbours have the same load, then processor i will distribute α= ( w max i ()-w t min i () t - 1) units of load to its neighbours (line 4.9). Action 2: If processor i is the processor with maximum load of its domain but not all the neighbours have the same load, then one unit of load is sent to one processor of the less loaded processors (line 4.10). Action 3: if the domain of processor i is not balanced but processor i is not the most loaded processor, then processor i will instruct one of its neighbours with maximum load to send a load unit to one of its neighbours with minimum load (line 4.11) Finally, each processor that has not sent any load unit in the previous actions waits for instruction messages generated by other neighbours in action 3. If instruction messages arrive, they are sorted and one unit of load is sent to the first processor in the list (line 4.12). A detailed justification for all the actions carried out on the instruction messages are out of the scope of this paper because they are needed to ensure the convergence of DASUD. Details of the formal proof of DASUD convergence can be found in [8]. 3. The experimental study In this section, we compare SID and DASUD algorithms with respect to their stability and efficiency. The stability (or balance quality) measures the ability of an algorithm to coerce any initial load distribution into an equilibrium state, i.e., to reach the global uniform distribution state. The efficiency measure is reflected by the time incurred in the load communication steps and the number of balancing steps required by the algorithm to drive an initial workload distribution into a stable distribution. To see the effect of DASUD over SID, different processor networks were simulated with different and representative initial load distributions. The following k- ary n-cube topologies have been used: k-ary 1-Cube (ring), 2-ary n-cube (hypercube) and k-ary 2-Cube (2- dimensional torus). The sizes of these communication networks were: 8, 16, 32, 64 and 128 processors (notice that, in order to have square k-ary 2-Cube, instead of 8, 32 and 128 processors, the sizes of these topologies have been changed by 9 (3x3), 36 (6x6) and 121 (11x11) respectively). Synthetic load distributions consist of a set of initial load distributions, w(0). The total workload is denoted as L. So, we can evaluate a priori the expected final load at each processor, i.e. the global load average, L/n or L/n, where n is the size of the topology. In our experiments the problem size was chosen as L = 3000. Initial load distributions were classified into two main groups: likely distributions and pathological distributions. Likely distributions cover all the situations that are assumed to appear in real scenarios where most of the processors start from an initial load that is not zero. In this case, each element w i (0) has been obtained by random generation from one of four uniform distributions patterns. The four patterns used in likely distributions were the following: varying 25% from the global load average: i w i (0) [ L/n-0.25*L/n, L/n+0.25*L/n] varying 50% from the global load average: i w i (0) [ L/n-0.50*L/n, L/n+0.50*L/n] varying 75% from the global load average: i w i (0) [ L/n-0.75*L/n, L/n+0.75*L/n] varying 100% from the global load average: i w i (0) [ L/n-L/n, L/n+L/n] The 25% variation pattern corresponds to the situation where all processors have a similar load at the beginning and these loads are close to the global average, i.e., the initial situation is quite balanced. On the other hand, the 100% variation pattern corresponds to the situation where the difference of load between processors at the beginning is considerable. 50% and 75% variation patterns constitute intermediate situations between the other two. For every likely distribution pattern, 10 different initial load distributions were used. The group of pathological distributions was also used in order to evaluate the behaviour of the strategies under extreme initial distributions. In these distributions a significant amount of processors has a zero initial load. These scenarios seem less likely to appear in practice but we have used them for the sake of completeness in the evaluation of the strategies. The pathological distributions were classified in four groups: A spiked initial load distribution, where all the load is located on a single processor: 25% of idle processors. 50% of idle processors. 75% of idle processors. 3

In addition to the above mentioned distributions, each one was scattered using two different shapes: a single mountain shape and a chain shape defined as follows: 1. Single Mountain (SM), where load values from the initial load distribution have been scattered by drawing a single mountain surface (see figure 3.1). 2. Chain, where load values from the initial load distribution have been scattered by drawing multiple mountain surfaces (see figure 3.1). Figure 3.1 Single Mountain and Chain shapes As a consequence, we have evaluated not only the influence of the values of initial load distribution, but also the influence of how these values are collocated onto the processors. To sum up, the total number of distributions tested for a given processor network was 87, which were obtained in the following way: 10 likely distributions * 4 patterns * 2 shapes + 3 pathological distributions * 2 shapes + 1 spiked pathological distribution. The simulation process was run until a global termination detection was accomplished. This termination condition can be a limit on the number of simulation steps set beforehand or the detection that no load movements have been carried out from one step to the next one. In our experiments, simulations were stopped when no new load movements were performed from one step to the next one. Although the simulation did not mimic the truly asynchronous behaviour of the algorithms, their results can still help us to understand the performance of the algorithms since the final load imbalances, for instance, are similar whether the algorithm is implemented synchronously or asynchronously. The main difference is in the convergence speed. 3.1 Stability analysis. As we have mention above, the stability reflects the ability of an algorithm in bounding any initial load distribution into an equilibrium state, that is, to reach the global uniform distribution state. Since we are dealing with integer load values, the final balanced state will be the one where the maximum load difference between any two processors of the topology should be zero or one depending on L and the number of processors. If L is an exact multiple of n, the optimal final balanced state is the one where the maximum load difference between any two processors of the system is zero. Otherwise, it should be one. Figure 3.2 shows the maximum load differences (dif_max) reached by DASUD and SID algorithms for all topologies and architectures sizes when the initial load distribution varies from 25% to 100% from the global load average (i.e. all the likely distributions). The effect of pathological distributions is shown in figure 3.2 when the number of idle processors varies from 25% to n-1. These global results indicate that DASUD strategy outperforms SID strategy in all cases. On the average, the maximum load difference obtained by SID was more than 4 times the maximum load difference obtained by DASUD. Moreover, the maximum load differences obtained by SID grew worse as the initial unbalance degree increases. Moreover, tables 3.1 and 3.1 show the standard deviation of load with respect to the load average obtained by both strategies for all the load distributions used in our experiments. As it can be seen, DASUD achieves a deviation that is very low for all topologies and distributions (always less than 1 for hypercubes and torus and less than 7 for rings). In contrast, SID exhibits a high deviation for all cases (always more than 4 times the deviation obtained by DASUD). From the results shown in figure 3.2 and tables 3.1 and 3.1, we can conclude that, on the one hand, DASUD achieves a smaller maximum difference than SID on average and, on the other hand, all the processors have a better final state, i.e., the overall system is closer to the optimal balanced state. Below we analyse the influence of some parameters in the final results. First, we compare the behaviour of SID and DASUD with respect to the topology and we give an upper bound derived experimentally for the maximum difference that DASUD can obtain for a given topology. Then, as DASUD proved to be the strategy that achieved a better stability, we give a more detailed information about the influence of the load distribution and its shape on its final results. 3.1.a. Influence of the topology on the stability. The topology influence is shown in figures 3.3 and 3.3 for likely and pathological initial load distributions, respectively. The maximum load difference obtained by SID is always greater than the one obtained by DASUD for all topologies. Moreover, DASUD demonstrates an additional quality: for hypercubes and torus topologies the maximum load difference keeps nearly constant for any system size and load distribution pattern (on the average, the maximum difference was 1.5). For rings, the maximum difference on the average was a somewhat higher and a slight increment was obtained when the initial unbalance was 75% or 100%, but it was always less than 10. By contrast, SID algorithm always obtained a higher maximum difference in every case and, additionally, the difference increases, as the initial unbalance grew higher. 4

Figure 3.2 Maximum load difference for SID and DASUD algorithms considering likely initial load distributions, pathological initial load distributions. Likely distributions Pathological distributions NªP. 8 16 32 64 128 8 16 32 64 128 Hyper 0.1 0.5 0.5 0.5 0.6 0.2 0.5 0.5 0.6 0.6 Torus 0.4 0.5 0.5 0.6 0.8 0.5 0.5 0.5 0.7 0.9 Ring 0.5 1 2.3 4.4 5 0.6 1 2.3 4.8 6.2 Likely distributions Pathological distributions NªP. 8 16 32 64 128 8 16 32 64 128 Hyper 2.6 5.2 5.8 6.3 6.4 1.2 4.7 8.4 8.5 9.3 Torus 2.5 5.7. 5.8 7.4 8.2 2.1 4.4 11.3 13.6 14.4 Ring 5.3 5.6 11 18.5 22.1 4.8 12.7 22.9 31.8 41.9 Table 3.1 Standard deviation obtained on average by DASUD and SID for likely and pathological distributions Figure 3.3 Influence of the topology: Maximum load difference for SID and DASUD algorithms considering likely initial load distributions and pathological initial load distributions. 5

As figure 3.3 shows, the worst results for both strategies were obtained for the ring topology. There are two reason for those results. On the one hand, the ring topology exhibits a small number of neighbours for any processor. As a consequence, the load movement through the network is slowed down. On the other hand, a platform effect appears. As we have already mentioned, the perfect balance is achieved when the maximum difference is one. However, due to the fully distributed nature of our method a local termination condition is achieved when the maximum difference is equal to one or zero for every domain of processors. This behaviour could lead to a situation where loads are finally laid out in a platform fashion. Figure 3.4 shows the platform effect that consists of obtaining a global unbalance caused by the existence of overlapped domains, in spite of achieving locally balanced domains. In the worst case, this effect will spread by the shortest path between two processors located at the maximum distance, i. e. by a path with a distance equal to the diameter of the architecture (d). Table 3.2 shows the value of the diameter for different topologies with different sizes. If we observe the values of the maximum difference obtained, we can derive an upper bound for the maximum difference achieved at the end of the balancing process that depends on the parameter d. We call this bound β (also d shown in table 3.2) and it is equal to: β = 2 5 4 4 3 3 2 Figure 3.4 Global state unbalanced, local domain balanced ( platform effect ). n=nº processors 8(9) 16 32(36) 64 128(121) Topol. Diam. d β d β d β d β d β Hyper log n 3 2 4 2 5 3 6 3 7 4 Torus n 2 1 4 2 6 3 8 4 10 5 2 * 2 Ring n 2 4 2 8 4 16 8 32 16 64 32 Table 3.2 Diameter of some topologies and its corresponding β bound. 5 d=1 β=1 d=2 d=3 d=4 d=5 4 4 3 3 2 β=1 β=2 β=2 β=3 Figure 3.5 Platform effect of β Figure 3.5 shows the variation of β as the value of d increases. Table 3.3 shows the maximum value for the maximum difference obtained by DASUD in all our tests. As it can be seen, in the worst case DASUD always achieves a maximum difference lower than the corresponding value of β. This means that, even for highly pathological initial distributions in a ring topology where there are a small number of intersections between multiple domains, DASUD is able to obtain a final maximum difference bounded by half of the diameter of the architecture. % load variation Likely distributions 8 16 32 64 128 idle processors Pathological distributions) 8 16 32 64 128 25% 0.4 1 2 2.1 2.9 25% 0 1 2 2 3 Hypercube 50% 0.8 1 2 2 3 50% 2 1 2 2 3 75% 1 1 1.9 2.4 2.9 75% 0 1 2 3 3 100% 0.2 1 1.9 2 3 n-1 2 1 2 3 3 25% 1 1 1.5 2.9 4 25% 1 1 2 3 4 Torus 50% 0.9 1 1.6 2.4 4 50% 1 2 1 2 4 75% 1 1.1 1.4 2.3 4 75% 1 1 2 2 4 100% 1 1 1.7 2.6 4 n-1 1 1 1 2 4 25% 2 3 7.9 16 10 25% 2 3 8 16 24 Ring 50% 2 3 7.5 16 21.8 50% 2 3 7 16 32 75% 1.8 3 7.6 16 31.1 75% 2 4 8 16 32 100% 2 3.1 7.4 16 31.2 n-1 2 3 8 16 32 Table 3.3 Maximum dif_max on average for likely and pathological distributions. 6

3.1.b. Influence of the distribution on DASUD s stability. Table 3.4 shows for likely and pathological initial distributions, the influence of the size of the architecture on the final balance when the DASUD algorithm is applied. We observed in our experiments that DASUD has the same behaviour for all topologies, so we only present the results obtained for hypercubes. From these results, we can conclude that as the number of processors increases the maximum difference obtained at the end likewise increases. This result is due to the completely distributed nature of our policy, where only local information from immediate neighbours is used during the balancing process. Moreover, the increment of the maximum difference observed is not very significant; for instance, on average, the maximum difference was always less than 3 when the number of processors was 128 for both likely and pathological distributions. Hypercube (dif_max) Nº likely distributions pathological distributions of P.. 25% 50% 75% 100% 25% 50% 75% n-1 8 0.3 0.5 0.6 0.2 0 1 0 2 16 1 1 1 1 1 1 1 1 32 1.5 1.5 1.6 1.65 2 1.5 1.5 2 64 1.75 2 2.2 2 2 2 2 3 128 1.95 2.15 2.45 2.6 2.5 2 2 3 Table 3.4 Maximum load difference for DASUD algorithm considering likely and pathological initial load distributions for hypercubes from 8 to 128 processors. 3.1.c. Influence of the shape on DASUD s stability. As a final consideration for the stability analysis we have observed the results obtained according to the original shape used in the initial load distribution. For all the experiments we always considered two different shapes for every initial load distribution: a Single Mountain (SM) shape and a Chain shape. For all topologies we observed that the final maximum load difference depends on how the load distribution was scattered through the system. Table 3.5 shows this dependency for hypercubes for both likely and pathological distributions and, additionally, takes into account the variation from global load average and the percentage of idle processors respectively. One can observe that for the chain shape initial scattering, the final state obtained is slightly more balanced than the final state obtained when the initial scattering corresponds to the single mountain shape. This behaviour can be explained because with the single mountain shape the platform effect has a great influence. With the chain shape, the workload is scattered onto various high-load areas surrounded by low-load. As a consequence, the number of platforms that appears is low and they have fewer levels. These results were also obtained for ring and torus topologies and we therefore do not include the corresponding tables, as no additional information would thereby be provided. Hypercube (dif_max by shapes) likely distributions pathological distributions 25% 50% 75% 100% 25% 50% 75% n-1 SM 1.68 1.76 1.84 1.62 1.6 2 1.8 2.2 Chain 0.94 1.12 1.34 1.36 1.4 1 0.8 Table 3.5 Influence of the shape on the initial distribution: dif_max 3.2 Efficiency analysis. In this section we analyse the efficiency of the DASUD and SID algorithms. The efficiency reflects the time required to either reduce the variance of processors loads or arrive at the equilibrium state. In order to have a measure of the time needed by both strategies to reach the final load distribution, we measure the number of simulation steps to reach a final stable distribution and we introduce the parameter u to measure the load movements incurred in the balancing process. For a given step of simulation process, s, the maximum amount of load moved from any processor to one of its neighbours is called max_load(s). According to our synchronous simulation paradigm, step s will not end until max_load(s) units of loads have been moved from the corresponding processor to its neighbour. Therefore, the duration of each step depends directly on the value of max_load(s). We assume a communication model where a processor is able to communicate with all its nearestneighbours simultaneously. The time required to send one unit of load from one processor to any one of its nearestneighbours is called per-hop time (t h ). So, if we multiply the addition of all max_load(s) where s varies from 1 to the number of simulation steps, by t h, we obtain the total time required for the global simulation process that we call u. u = t * h s= last _ step s= 1 max _ load( s) For simplicity, we assume that t h is equal to one. Figures 3.6 and 3.6 show SID and DASUD efficiency for likely and pathological initial load distributions in terms of u s and simulation steps respectively. These results summarise the time required for both strategies to reach the termination condition, for all topologies. As we can observe, the DASUD algorithm needs more time to achieve the stable final state than the SID algorithm independently of the number of processors. These results were to be logically expected because the DASUD algorithm is an extension of the SID algorithm, which tries to improve the balance degree of the final load distributions by detecting unbalanced domains and arranging these situations by performing some extra load movements. The additional time needed by DASUD is, 7

however, moderate on average. And, bearing in mind the results from the stability analysis, we can conclude that DASUD exhibits a better trade-off between the degree of global balance and the time needed to achieve it. Like in the stability analysis, we compare below the efficiency of SID and DASUD with respect to the topology and we give a more detailed information about the influence of the load distribution shape on DASUD s final results. 3.2.a. Influence of the topology on the efficiency. Figures 3.7 and 3.8 give more detailed information about the time in terms of u s and the number of steps needed on average for likely and pathological distributions. As can be observed for torus and hypercubes the time needed by DASUD was moderately higher than the time needed by SID. In particular, for likely distributions the time of DASUD was on average twice the time for SID, but the maximum difference obtained by SID was more than 4 times the maximum difference obtained by DASUD. For pathological distributions the time for DASUD was 30% more than SID, while the maximum difference obtained by SID was more than 7 times the maximum difference obtained by DASUD. In that sense, the improvement of the final load balancing obtained by DASUD was not only due to the increase of the number of steps. DASUD obtains a better final load balancing because it moves more load at each step and takes advantage of the overlapping of load movements. Higher differences between DASUD and SID in the time and number of steps were obtained for rings, and especially for pathological distributions. This could be explained again because of the better load balancing obtained at the end. While SID suffers from the platform effect at the initial steps of the balancing process and is unable to move load to the less loaded processors, DASUD is able to significantly overcome the platform effect and it can move load during several additional steps. However, due to the small connectivity exhibited by ring topologies, the movement of load performed by DASUD is very small once a certain balance degree has been achieved, and the strategy goes through a lot of steps where very few loads are being moved at each step. We can also observe a positive effect of system size on the performance of both strategies for a fixed size of the problem (L). The time required to achieve a stable state decreases as the system size increases. This characteristic remains constant for any percentage of load average variation. Notice that as the number of processors increases, the value of the load average decreases. Consequently, the total number of load units to be moved among the system also decreases. But, in spite of this, both algorithms, DASUD and SID, demonstrate the same behaviour. It has been observed that if the global load average remains constant as the number of processors increases (i.e. the value of L varies as the number of processors increases), the time required to reach the final state keeps more or less constant for any system size. 3.2.b. Influence of the shapes on DASUD s efficiency. The effect of how the initial load distribution is scattered through the processors has also been investigated in order to observe whether it has any influence on the total execution time or not. For all topologies it has been observed that the time behaviour follows the same pattern, for this reason, we only show the results for hypercubes topologies. In table 3.6 we show the time required to reach a stable state using DASUD algorithm on average depending on how the initial load distribution is scattered to processors. Each value is the mean value for all sizes of hypercubes. For 50%, 75% and 100% variations on load average and pathological initial distributions, the value of u for chain scattering is bigger than the one obtained when single mountain scattering is applied. This is attributable to the kind of load movements generated in both cases. When single mountain scattering is applied, all local load movements have the same global direction, from heavily loaded processors to lightly loaded ones, because this kind of load scattering generates a local gradient equal to the global one. Consequently, all local load movements are productive movements. On the other hand, when chain scattering is used, some processors can see themselves as locally load-maximum while not being globally-maximum. This is a consequence of the distributed nature of the algorithm Hypercube Topology (time -u s - by shapes) likely distributions pathological distributions 25% 50% 75% 100% 25% 50% 75% n-1 SM 40.3 71.2 97.6 144.1 107.4 163 324.4 781.4 Chain 36.9 80.2 118.7 167.1 126.6 197.4 279 Table 3.6 Influence of the shape on the initial distribution: time incurred in the load movement. 8

Figure 3.6 Efficiency in terms of: time incurred in the load movement and number of steps. Figure 3.7 Efficiency results for likely distributions: time incurred in th load movements and number of step. Figure 3.8 Efficiency results for pathol. distributions: time spent in load movement and number of steps 9

As we can see in table 3.6, the previous reasoning does not comply with 25% load average variation. In such a situation the maximum load difference between any two processors in the initial load distribution is not too big. This being so, the local load movements generated by any processors tend to be the productive ones and no penalty for unnecessary load thrashing is produced. We have also investigated the influence of the initial load scattering on the number of steps needed by the balance process to reach the termination condition. Table 3.7 shows the average of such number of steps for all sizes of hypercubes. For single mountain shapes, the number of steps is higher than for chain shapes. This characteristic is independent of the initial unbalanced degree and of the kind of distribution (likely or pathological). For all likely distributions the number of steps required for single mountain shapes is approximately twice the number of steps required for chain distributions. Hypercube Topology (steps by shapes) likely distributions pathological distributions 25% 50% 75% 100% 25% 50% 75% n-1 SM 12.5 17.26 19.62 21.8 17.4 20.8 24 24.2 Chain 6.62 9.68 10.9 11.76 14 13.6 10 Table 3.7 Influence of the shape on the initial distribution: number of steps. Bearing in mind the information set out in tables 3.5, 3.6 and 3.7, we can deduce that, starting from a chain-form collocation, DASUD achieves a more balanced final state that attained from starting with a Single Mountain collocation. And furthermore, it requires a smaller number of steps, since in each step a greater load quantity is moved. 4. Conclusions. In this paper, we have compared two algorithms, DASUD (Diffusion Algorithm Searching Unbalanced Domains) and SID (Sender Initiated Diffusion) for dynamic load balancing in parallel systems. The comparison was carried out by considering a large set of load distributions that exhibit different degrees of initial workload unbalancing as well as different shapes of workload unbalancing. These distributions were applied to ring, torus and hypercube topologies, and the number of processors ranged from 8 to 128. The experiments were conducted to analyse the balancing degree achieved by both strategies at the final state, the time incurred in the load movement and the number of balancing steps. From these experiments we have observed that DASUD outperforms the SID strategy as it provides the best tradeoff between the global balance degree obtained at the final state and the number of iterations required to reach such a state. For the most common topologies (torus and hypercubes) DASUD and SID spent on the average a similar number of balancing steps, while the maximum difference achieved by SID was more than 4 times larger than the maximum difference obtained by DASUD. This behaviour was observed independently of the initial unbalanced degree, the scattering of the loads or the number of processors. Moreover, DASUD not only obtained a smaller value for the maximum difference, but also achieved a better balanced degree for all the processors in the system, as all processors had a final load that was very close to the optimal load average. References [1] G.C.Fox, M.A. Johnson, G.A. Lyzenga,S.W. Otto, J.K. Salmon and D.W. Walkeer, Solving Problems on Concurrent Processors, vol. 1, Prentice-Hall, 1998. [2] S. H. Hosseini, B. Litow, M. Malkawi, J. McPherson, and K. Vairavan, Analysis of a Graph Coloring Based Distributed Load Balancing Algorithm, Journal of Parallel and Distributed Computing 10, 1990, pp. 160-166. [3] V. Kumar, A. Y. Grama and N. R. Vempaty, Scalable load balancing techniques for parallel computers, J. of Par. and Distrib. Comput., 22(1), 1994, pp. 60-79. [4] R. Subramain, I. D. Scherson, An Analysis of Diffusive Load-Balancing, In Proceedings of 6th ACM Symposium on Parallel Algorithms and Architectures, 1994 [5] M. Willebeek-LeMair, A. P. Reeves, Strategies for Dynamic Load Balancing on Highly Parallel Computers, IEEE Transactions on Parallel and Distributed Systems, vol. 4, No. 9, September 1993, pp. 979-993 [6] C. Z. Xu and F. C. M. Lau, Load Balancing Parallel Computers - Theory and Practice, Kluwer Academic Publishers, 1997 [7] A. Cortés, A. Ripoll, M. A. Senar and E. Luque, Dynamic Load Balancing Strategy for Scalable Parallel Systems, PARCO 97, 1997. [8] A. Cortés, A. Ripoll, M.A.Senar, F. Cedó and E. Luque, On the convergence of SID and DASUD load-balancing algorithms, Technical Report, UAB, 1998 10