A practical approach of diffusion load balancing algorithms

A practcal approach of dffuson load balancng algorthms Emmanuel Jeannot and Flaven Verner Abstract In ths paper, a practcal approach of dffuson load balancng algorthms and ts mplementaton are studed. Three problems are nvestgated. The frst one s the determnaton of the load balancng parameters wthout any global knowledge. The second problem conssts n estmatng the cost and the beneft of a load exchange. The last one studes the convergence detecton of the load balancng algorthm. For ths last pont we gve an algorthm based on smulated annealng to reduce the convergence towards a load repartton n steps that can be done wth dscrete loads. Several smulatons close ths paper and llustrate the mpact of the varous methods and algorthms ntroduced. 1 Introducton One of the most mportant problems n dstrbuted processng conssts n balancng the work load among all processors. The purpose of load (work) balancng s to acheve better performances of dstrbuted computatons, by mprovng load allocaton. The load balancng problem was studed by several authors from dfferent ponts of vew [1, 2, 3, 4, 5, 6, 7]. In ths paper we focus our study on the teratve load balancng algorthms ntroduced n [1]. These knds of algorthms assume that a node manages ts load only wth ts nearest neghbors. They are generc algorthms, useful when the system s decentralzed or when some nodes cannot drectly communcate wth all the other nodes. However these algorthms face several problems. Frstly, the majorty of studes about these algorthms use a global knowledge, lke the network or the nodes propertes, to determne the load balancng parameters. Secondly, most of these algorthms assume that balancng the load s always benefcal and leads to a reducton of the executon tme. Thrdly, snce the load s not nfntely dvsble, the fnal load balancng (after convergence of the algorthm) can face a step problem. In ths paper we propose a practcal approach of load balancng that solves the 3 above problems. They are 3 man problems that can appear n an mplementaton of these load balancng algorthms. To the best of our knowledge no load balancng INRIA - LORIA, campus scentfque, BP 239, 5456 Vandoeuvre les Nancy, France. emal: emanuel.jeannot@lora.fr UHP Nancy-I - LORIA, campus scentfque, BP 239, 5456 Vandoeuvre les Nancy, France. emal: flaven.verner@lora.fr

algorthm of the lterature can deal wth these 3 ssues at the same tme. We gve methods to determne the dffuson parameters wthout any global knowledge. We propose an analyss of the cost and beneft of load exchange n order to determne when t s worth exchangng some load. The convergence of the load balancng algorthms wth no nfntely dvsble loads s also studed n ths paper. Fnally, the gven methods are effcent and easy to mplement. It s mportant to note that, n ths work, we make very few assumptons. We can deal wth ether statc or dynamc load. The network topology can be of any type as long as t s connected. Nodes and networks can be homogeneous or heterogeneous. The noton of load s very abstract, t can be anythng that just requres some tme to be processed (data, etc.). The proposed methods deal wth statc networks but the adaptaton to dynamc networks s straght forward. Fnally, no global knowledge s requred to process the algorthm. The knowledge s lmted to the neghborhood. The obtaned results gve n the better case, performance gans greater than 1% and the algorthm does not always use all the avalable resources: t s able to fnd the rght amount of resources that gves a good speed-up. Ths paper s organzed as follows. Secton 2 presents the related works, we revew the dffuson on any statc network. In Secton 3.1, we study the problem of the connecton lnks heterogenety. Secton 3.2 presents a decentralzed method to compute the load balancng parameters. Secton 3.3 s dedcated to the not nfntely dvsble loads and to the detecton of the convergence of the load balancng algorthm. In Secton 4 we llustrate the behavor of the load balancng algorthm accordng to the methods that we gve by some expermentaton. 2 Related Work The studed algorthms are generally dedcated to statc networks. A statc network topology s classcally represented by a smple undrected connected graph G = (V, E), where V s the set of vertexes and E s the set of edges, E V V. Each processor s a vertex of the graph and each communcaton lnk between two processors, j s the edge (, j) E between the two vertexes and j (, j V ). Vertexes are labeled from 1 to n where n s the number of processors, hence V = n. Let m be the number of communcaton lnks ( E = m). Let F be the vector of edge-weght and let us note f,j the weght of edge (, j) (f,j = F k E k = (, j)). Let C n be the vector of node-weght such that the average of C n s normalzed Cn n = 1. In [1], Cybenko ntroduced a dstrbuted load balancng (LB) algorthm for statc networks called the dffuson algorthm or FOS (Frst Order Scheme). It assumes that a process balances ts load smultaneously wth all ts neghbors. To balance the load, a rato α j of the load dfference between the process and j s swapped between and j. In the general case - on heterogeneous networks - the LB step of a process wth all ts neghbors s gven by Equaton (1) where w (t) s the work load done by process at tme t. w (t+1) = w (t) j α,j.f,j. ( w (t) C n w (t) j C nj Equaton (1) s lnear and thus t can be re-wrtten n matrx form: W (t+1) = ) (1) 2

M T W (t), where W (t) s the vector (w (t) m j = ) and M s the dffuson matrx defned by α jf,j C nj f (, j) E j, 1 k m k k (, k) E = j, otherwse. Ths algorthm has often been studed and derved - Dmenson Exchange Algorthm [1, 2, 8], Second Order Scheme [9, 4], dynamc networks [1, 11, 12]... In the lterature, varous methods can be found that determne these parmeters α,j or f,j. There are three classcal methods to compute α: Cybenko Choce [1], Bollat Choce [5] or optmal Choce [13]. The optmal Choce and the Cybenko Choce need a global knowledge of the network and the Bollat Choce only needs, where d() s the degree of node at tme t. The parameter f,j must be determned accordng to the constrants of the dffuson matrx M, M must be stochastc, rreducble and aperodc [1]. a knowledge of neghbors degree to determne α: α j = 1 max(d(),d(j))+1 3 A decentralzed practcal approach 3.1 Cost and beneft of load balancng Let us start by defnng the cost and the beneft of a LB algorthm. The cost s the tme lost by exchangng the load, t s generally due to communcaton. The beneft s the tme ganed by exchangng the load, t s due to a better balance. In Equaton (1) the parameter f,j corresponds to the weght of edge (, j). Hence, the parameter f,j must be determned such that the cost of the LB algorthm s lower than the beneft gven by the exchange of load. In our practcal approach, f,j s n {, 1}. If the cost of an exchange L (t) j between and j s greater than ts beneft, then f,j s set to and there s no exchange between and j, otherwse f,j s set to 1. It can be noted that by ths defnton f,j depends on the tme, hence t becomes f (t),j and ts correspondng vector F becomes F (t). The cost and the beneft of an exchange depends on the sze of ths exchange. To determne the cost of an exchange we gve the followng equaton, Cost(L (t) j ) = PreExcCost( L(t) j ) + ExcCost( L(t) j ) + PostExcCost( L(t) j ). The cost of a load exchange L (t) j (PreExcCost( L (t) j s the tme to prepare ths load for the exchange )), plus the tme of the exchange (ExcCost( L(t) )), plus the tme to ntegrate t on the recever (PostExcCost( L (t) j )). PreExcCost and PostExcCost completely depend on the applcaton. ExcCost only depends on the load L (t) j and on the edge (, j), a good estmaton of ths cost can be: ExcCost( L (t) j ) = Lat j + L(t) j Bw, where Lat j and Bw j are respectvely the latency and the bandwdth j of edge (, j). Let us note that the communcaton can always be hdden some computaton. The beneft gven by the exchange of L (t) j can be estmated by the computaton tme on and j wthout exchange mnus the computaton tme on and j after j 3

ths exchange. Intutvely the beneft of a load exchange must be postve f the computaton tme s reduced by ths exchange and negatve n the other case. Let us recall that the computaton tme on and j s gven by the maxmum between the computaton tme on and the computaton tme on j. The followng equaton gves the beneft for the cases - L (t) j Beneft(L (t) j ) = where Cp(w (t) max(cp(w (t) ), Cp(w (t) max(cp(w (t) max(cp(w (t) max(cp(w (t) > and L (t) j <. j )) L (t) j ), Cp(w(t) j ), Cp(w (t) j )) + L (t) j ), Cp(w(t) j + L (t) j )) L (t) j )) +...) s the computaton tme of (w (t) +...) on. f L(t) j >, f L(t) j <, In the teratve LB algorthms, the beneft of an exchange of load at a gven teraton can ncrease wth the next teratons. The estmaton of the beneft that we gve n Equaton (2) s evaluated on only one teraton. Hence, a parameter k s ntroduced to estmate the beneft on the k successve teratons after an exchange. Indeed, f (t),j s equal to 1 f and only f Cost(L(t) j ) < k Beneft(L(t) j ). The parameter k can be constant or not (n Secton 4 the mpact of both cases are compared). One lmt of ths cost/beneft system appears when the algorthm converges to a load repartton n step. Ths problem s studed n Secton 3.3. 3.2 Parameter computaton From the general equaton of FOS we have determned the parameter f (t),j (2) n the prevous secton. Now, let us study the parameters α,j and C n. In Secton 2 we have seen that only the Bollat Choce does not need a global knowledge to compute α,j, but ths method s lmted to homogeneous networks. In ths secton, a method that only needs a local knowledge s gven to determne the relaton α,j C n. Let us denote C the vector of the processors speeds. Let C r be the matrx of relatve speeds defned by C r,j, the relatve speed of j compared to : C r,j = { Cj C +C j (, j) E, j otherwse. Thus the unt of C s not mportant, t can be MHz, Mflops or any other. Wth ths defnton of a relatve speed matrx a dffuson matrx that we denote M r can be gven. M r s defned such that: { mn(δ C r,j, δ j C r,j ) j M rj = 1 j(j ) M r j j = 1 wth δ =. By constructon, t s easy to show that M rj and j Cr,j j M r j = 1, n other words the matrx M r s stochastc. Theorem 1 The dffuson LB algorthm wth M r as dffuson matrx converges toward a load dstrbuton relatve to the node speed f and only f M r s rreducble and aperodc - the graph G must be connected and non-bpartte. 4

Proof If M r s stochastc, rreducble and aperodc, thus the Perron-Frobenus Theorem can be appled,.e. µ (µ s a fxed pont vector) such that Mr T µ = µ. By constructon of M r s stochastque and M T r tends to the matrx n whch each 1 column s C, thus µ = hc where h s such that C w() = h C. Thus for a gven W () the nvarant dstrbuton µ s proportonal to C. As shown by Theorem 1, the LB algorthm converges f the network G s connected and non-bpartte. The connectvty of the network depends on the set E and the network s not bpartte f M r s well constructed. The prevous method does not ensure that the network s not bpartte, to ensure that we can use the followng defnton to compute C r,j : C r,j = C j C +C j (, j) E, j C 2C j = otherwse. To buld the dffuson matrx M wth one of these two methods and the cost/beneft defned n Secton 3.1, the vector L (t) r of load exchange predcton must be defned to compute f (t) j. L(t) r s gven by L (t) r j = M rj w (t) M rj w (t) j. Wth F (t) and M r defned, the dffuson matrx M (t) - as F depends on t, M depends on t - s gven by: { m (t) f (t) j j = M r j j, 1 k(k ) f (t) k M r k j =. 3.3 Convergence detecton wth unt sze tokens The last step that we study n ths paper s the termnaton of the LB algorthm. Ths step conssts n detectng the end of the LB algorthm to stop t and avod the cost of exchange of nformaton done by the LB algorthm. Ths cost can be mportant f the network s slow and f the number of neghbors s hgh. The man problem to detect the convergence s that the load s not nfntely dvsble for the real applcatons. Ths mples that the LB algorthm cannot always reach a unform load dstrbuton, hence t does not always reach the convergence pont. Some steps of load can appear n the system that can block the LB algorthm. 3.3.1 The unt sze tokens problem Let us start by elmnatng ths step problem. In the lterature, the LB problem of ndvsble unt-sze tokens s studed n [9] where the authors ntroduced the I Owe You (IOU) unt on each edge, and n [14] where the authors ntroduced a randomzed algorthm that deals wth heterogeneous networks. In ths secton, a new approach based on smulated annealng algorthms s used. The objectve s to shake the system to move the load of the most loaded nodes toward the least loaded nodes when the classcal LB algorthm s blocked. Hence, the algorthm operates as follows: f a node s unbalanced wth ts neghbor j and no load s exchanged between these two nodes, a random value denoted alea s drawn between and 1 ( <alea< 1), and f alea < e ( κ Uj), a part of load s exchanged. U j denotes the number of successve LB teratons durng whch the neghbors nodes and j are 5

unbalanced and do not exchange load. The parameter κ defnes the probablty to exchange load and can be defned by κ = ln(p) τ where p s the probablty to exchange load at the teraton τ of U j. For example f 5% of probablty to exchange s wanted at the second teraton κ = ln(.5) 2. Let us note that ths method does not ensure to reach the unform load dstrbuton but t can reduce the unbalance. 3.3.2 Convergence detecton problem Let us recall that the frst problem presented n ths secton s the convergence detecton of the LB algorthm. Hence, we must detect that no more load s exchanged n the network. In [15] the authors gve a decentralzed convergence detecton algorthm dedcated to parallel teratve asynchronous algorthms. Ths algorthm s based on the leader electon on the IEEE-1394 (FreWre) protocol, and ths base can be used to detect a global state n synchronous algorthms wthout any centralzaton. These algorthms operate on a tree, hence a spannng tree of the network must be defned [16, 17, 18, 19]. For the LB algorthm, an adaptaton of algorthm gven n [15] s used. Ths adaptaton s synchronous and dedcated to bnary state detecton. The dea of ths algorthm s as follows: each node defnes k channels where k s the number of neghbors of. In the frst stage of the algorthm, f a node has only one channel that s not assocated to a neghbor, t assocates ths channel to ts neghbor that has no channel and defnes ths neghbor as ts father and sends to ts father the state of ts sub-tree. If a node receves the state of a sub-tree from a neghbor, t assocates a channel to ths neghbor and defnes ths neghbor as one of ts chldren. It s obvous that the leaf nodes of the spannng tree have exactly one such channel at the start of the protocol. Hence, the algorthm s started by a leave that sends ts state to ts father. In the second and last stage of the algorthm, when a node has all ts channels assocated to all ts neghbors and that all ts neghbors are ts chldren, ths node s the root of the tree. Hence, the state of ts sub-tree s the state of the tree, n other words, ths node detects the global state of the system. It sends ths global state to ts chldren and they do lkewse wth ther chldren and so on. Thus the nformaton of the global state goes through all the network. To fnsh wth, f the convergence s detected, the LB algorthm can be stopped f the load s statc. In the other case - dynamc load, dynamc networks or other... - the convergence detecton algorthm can be used to reduce the frequency of LB steps f the system became stable or ncrease t f t became unbalanced. 4 Smulatons The followng smulatons are realzed wth SmGrd [2]. The applcaton that s balanced s represented by an nteger that corresponds to the load of the applcaton. Let us recall that the load s statc to llustrate the convergence of the algorthm and t s consdered homogeneous. The behavor of the FOS algorthm s studed on the worse confguraton - a lne topology wth all the load on the frst node - wth 64 homogeneous nodes (2MFlops). The program that s balanced can be vewed as a parallel and teratve numercal solver that computes 1 teratons where the topology s 6

vrtual and depends on the data dependency - communcaton for data dependency are smulated. Ths study s realzed for two cases, a frst one when the network s a LAN and a second one when the network s a DSL. 4.1 Fast network In the former, a bandwdth of 1Mb/s s used wth.15ms of latency on each edge. Fgure 1 shows the gan gven by the FOS algorthm wth the cost/beneft system and wth convergence detecton (Algo2) compared to the FOS algorthm wthout cost/beneft system and wthout convergence detecton (Algo1). Let us note that n Algo2 the cost/beneft parameter k s gven by k (t+1) = k (t) 1 wth k () = 1. The gan s gven by T 1 T 2 T 1, where T 1 and T 2 are the computaton tme of Algo1 and Algo2, respectvely. In ths fgure, the gan depends on the load average w - the global load s gven by 64 w - and on the number of LB steps. The results on Fgure 1 show that the gan s sgnfcant when w s low and also Gan 1.5 1 1 w* 1 1 8 6 4 Lb Ite 2 1 Fgure 1: Gan gven wth the cost/beneft parameter: k (t+1) = k (t) 1 and wth the convergence detecton algorthm on a LAN network. show that the gan s null when w s hgh. Ths s due to the cost of the LB algorthm tself: when t has converged, ts cost s constant and t only depends of the network. Hence, when the computaton tme s low - when w s low - the cost of load balancng s relatvely hgh and when the computaton tme s hgh - when w s hgh - t s neglgble. If the cost of load balancng s neglgble, the cost/beneft system and the convergence detecton are not useful but t can be noted that they are not costly wth a LAN network: the gan on Fgure 1 s never negatve when w s hgh. 4.2 Slow network In the latter, the same problem s deployed on a DSL network where the bandwdth s 1Mb/s and the latency s 4ms. Fgures 2 and 3 show the program computaton tmes dependng on the load average w and on the number of LB steps. Fgure 2 corresponds to the program wth a classcal FOS algorthm wthout cost/beneft system and wthout convergence detecton algorthm. Here, we can see that the frst teratons of the LB algorthm gve a gan and that after some teratons of load balancng, the computaton tme ncreases and the tme to compute the 7

3 25 2 15 1 5 Tme 1 w* 6 2 1 8 6 4 Lb Ite 2 Fgure 2: Classcal load balancng. Tme 7 6 5 4 3 2 1 1 8 6 Lb Ite 4 2 2 4 6 8 1 w* Fgure 3: Load balancng wth convergence detecton and cost/beneft system wth k depends on the computaton tme of an teraton. program becomes much greater than a sequental computaton. Ths problem has two complementary reasons: the cost of load exchanges and the cost of nformaton exchanges after the convergence of the algorthm. Hence, the cost/beneft system and the convergence detecton algorthm can be nterestng, n partcular also for for small w. An mplementaton of the convergence detecton algorthm and the cost/beneft system as n the LAN confguraton - k defned by k (t+1) = k (t) 1 - showed us that ths defnton of k s not effectve n a DSL network. Fgure 3 shows the results obtaned wth the convergence detecton algorthm and the cost/beneft system wth k dependng on the computaton tme of an teraton. For a gven node, when the computaton tme of ts teraton s greater than the computaton tme of ts prevous teraton, t dvdes ts value of k by 2. Fgure 3 shows that wth ths system, the LB algorthm s stopped after a few teratons n whch the computaton tme has ncreased. Thus the LB algorthm s benefcal to the program n quas all confguratons. When the global load s small - when a parallel computaton s costler than a sequental one - the LB algorthm s not benefcal but t s stopped fast enough for ts cost to be neglgble. Moreover, t can be noted that wth ths extreme confguraton the LB algorthm wth ths cost/beneft system does not use all the processors, see Table 1. The optmal value s the number of processors to reach the mnmum computaton tme wth 8

load nxw 64x1 64x5 64x1 64x5 64x1 64x5 64x1 number used 3/64 5/64 6/64 7/64 8/64 9/64 1/64 of opt 1/64 1/64 1/64 1/64 3/64 5/64 1/64 Table 1: Ths table shows for a gven load, n lne 2 and n lne opt, the number of nodes used and the optmal number of nodes wth the cost/beneft system. the cost/beneft system. Ths optmal value s computed usng a global knowledge. We see that wthout global knowledge, we fnd a result close to the optmal. 5 Concluson In ths paper we have studed a practcal approach of dffuson load balancng. We have proposed an analyss of the cost and beneft of a load exchange. Based on ths analyss we are able to decde wherever or not to exchange the load. Ths cost and beneft mechansm ncreases the well-known step problem. In order to tackle ths problem, we propose a new feature based on smulated annealng that shakes the load when requred. Fnally, we have enhanced the classcal convergence detecton to take nto account these new elements. In ths work very few assumptons are made. We can deal wth statc or dynamc load, wth any knd of network topology, wth heterogeneous nodes and networks and wth any type of load. Furthermore, no global knowledge s requred to perform the algorthm. Results show that the proposed features do not degrade the performance of the load balancng algorthm and can lead (n the best case) to 1% of performance ncrease. Furthermore, n case of slow networks, the algorthm does not use all the avalable resources n order to gve a good speed-up. References [1] G. Cybenko. Dynamc load balancng for dstrbuted memory multprocessors. Journal of Parallel and Dstrbuted Computng, 7:279 31, 1989. [2] S.H. Hossen, B. Ltow, M. Malkaw, J. McPherson, and K. Varavan. Analyss of a graph colorng based dstrbuted load balancng algorthm. Jour. of Para. and Dst. Comp., 1:16 166, 199. [3] B. Ltow, S.H. Hossen, K. Varavan, and G.S. Wolffe. Performance characterstcs of a load balancng algorthm. Jour. of Para. and Dst. Comp., 31:159 165, 1995. [4] R. Dekmann, A. Frommer, and B. Monen. Effcent schemes for nearest neghbor load balancng. Parallel Computng, 25:289 313, 1998. [5] J.E. Bollat. Load balancng and posson equaton n a graph. Concurrency: Practce and Experence., 2(4):289 313, 199. [6] J.M. Bah and J. Gaber. Load balancng on networks wth dynamcally changng topology. In Europar 21 conference, Lecture Notes on Computer Scence, pages 175 182, Manchester, UK, 21. [7] D.P. Bertsekas and J.N. Tstskls. Parallel and Dstrbuted Computaton: Numercal Methods. nglewood Clffs NJ, Prentce-Hall, 1989. 9

[8] C.Z. Xu and F.C.M. Lau. Analyss of the generalzed dmenson exchange method for dynamc load balancng. Journal of Parallel and Dstrbuted Computng, 16(4):385 393, 1992. [9] B. Ghosh, S. Muthukrshnan, and M.H. Schultz. Frst and second order dffusve methods for rapd, coarse, dstrbuted load balancng. In Proc. of the 8 th Annual ACM Sympo. on Para. Algo. and Arch., pages 72 81, 1996. [1] J.M. Bah, R. Couturer, and F. Verner. Synchronous dstrbuted load balancng on dynamc networks. Journal of Parallel and Dstrbuted Computng, 65(11):1397 145, 25. [11] R. Elsässer, B. Monen, and S. Schamberger. Load balancng n dynamc networks. In Proc. 7 t h Inter. Sympo. on Para. Arch., Algo. and Net., 24. [12] F. Verner. Algorthmque tératve pour l équlbrage de charge dans les réseaux dynamques. PhD thess, Unv. de Franche-Comté (France), 24. [13] C.Z. Xu, B. Monen, R. Lülng, and F.C.M. Lau. An analytcal comparson of nearest neghbor algorthms for load balancng n parallel computers. In Proc. of 9th Inter. Para. Proc. Sympo., pages 472 479. IEEE CSP, 1995. [14] R. Elssser, B. Monen, and S. Schamberger. Load balancng of ndvsble unt sze tokens n dynamc and heterogeneous networks. In Proc. of 12 th Annual Euro. Symp. (ESA 4), volume 3221, page 64. Sprnger, 24. [15] J.M. Bah, Contassot-Vver S., R. Couturer, and F. Verner. A decentralzed convergence detecton algorthm for asynchronous parallel teratve algorthms. IEEE Trans. on Para. and Dst. Sys., 16(1):4 13, 25. [16] R.G. Gallager, P.A. Humblet, and P.M. Spra. A dstrbuted algorthm for mnmumweght spannng trees. ACM Transactons on Programmng Languages and Systems, 5(1):66 77, 1983. [17] I. Lavallee and G. Roucarol. A fully dstrbuted (mnmal) spannng tree algorthm. Informaton Processng Letters, 23(2):55 62, 1986. [18] B. Awerbuch. Optmal dstrbuted algorthms for mnmum weght spannng tree, countng, leader electon, and related problems. In The 19 th Annual ACM Conf. on Theo. of Comp., pages 23 24. ACM NY, 1987. [19] I. Lavallee and C. Lavault. Yet another dstrbuted electon and (mnmumweght)spannng tree algorthm. Rr-124, INRIA - Rocquencourt, 1989. [2] H. Casanova, A. Legrand, and L. Marchal. Schedulng Dstrbuted Applcatons: the SmGrd Smulaton Framework. In Proc. of the 3 rd IEEE Inter. Sympo. on Clust. Comp. and the Grd (CCGrd 3). IEEE CSP, may 23. 1