Transport-Problem-Based Algorthm fordynamcload Balancng n Dstrbuted LogcSmulaton YuryV. Ladzhensy,VatcheslavA. Kourtch Donets atonaltechncal Unversty Artema Street,58a Donets, Urane,83005 Abstract: Advantages and dsadvantages of adynamc load balancng algorthm whch mnmzes Eucldean norm ofdata mgraton are dscussed. Anew effectve algorthm for a dynamc load balancng problem ssuggested. The algorthm s based on atransport problem solvng and searchng for shortest ways n agraph. Expermental results forthe algorthm are provded. 1Introducton The dstrbuted logc smulaton sthe logc smulaton usng a networ of computers. Achevng a good load balance between processors of computers s crucal to mnmze tme spent for a smulaton process. The frst step of the load balancng s to cut a problem: to dstrbute the parts of the problem between processors. In some cases t s enough because the balance won t be changed. But n other cases the load balance should be perodcally restored. There are many methods and algorthms for statc and dynamc load balancng. We refer to revew of graph parttonng algorthms [Po96] and a multlevel algorthm [KK95] for statc load balancng methods. For dynamc load balancng methodssee, forexample, [AT95], [AT96], [JSL94],and alsorevews [De05]. In [95], authors suggest a new crtera of optmzaton for load balancng a mgraton level. The mgraton level s an amount of data to be transferred between processors to acheve a load balance. Indeed, after the load balancng algorthm has been executed a schedule of data or subproblems mgraton between the processors s ready. But the data mgraton needs tme, and ths tme should be also mnmzed. In ths paper, we focus on methods whch allow fndng a good load balance and mnmzng the mgraton level and suggest anewapproach to thsproblem. 2Defntons Let P be a number of processors and G P ( V P,E P ) be a processor graph, where V P = { 1,2,,P} sthe set of vertces representng processors and E P sthe set of edges. An edge (, j ) E P f data can be transferred from aprocessor to aprocessor j.ina 117
dstrbuted smulaton a graph wll always connected, and n many cases t wll be a clque, becausenacomputernetwor anypar of computers can exchange wthdata. Let be a number of subproblems. A subproblem can represent an element of smulated scheme or acluster of elements that depends of asmulaton problem. Let a graph G ( V,E ) be a problem graph, where V = {1,2,,} s the set of vertces representng subproblems and E sthe set of edges. Anedge (, j ) E f the subproblem j needs nformaton or events from the subproblem.we assume all subproblems have equal complexty. Let = { 1, 2,, P }beapartton ofthe graph G ( V,E ). Each V scomputed on processor.the subsets and j are not ntersected,.e. j =.The problem s to fnd new parttonng = { 1, 2,, P }whch mnmzes concurrently the msbalance, nterprocessor communcaton and mgraton level. The msbalance can be found usng oneofthe followng formulas [AT95], [AT96], [DS98]: M 1 ( ) = max l mn l, M =,1 P m m =,1 P P ( l l ) 1 = 1 ( ) = l P 2 max M 3 ( ) = max l, =,1 P 1, where P 1 l =, l max = max l, l = l. =,1 P P = 1 The formula M 1 ( )presents the dfference between the maxmum and the mnmum load. If M 1 ( )=0 the msbalance s absent. The formula M 2 ( )sthe second central moment of the loads of dfferent processors. Ths statstcal magntude s dffcult to evaluate, therefore ts used rarely. And M 3 ( )sthe maxmum value ofprocessor load. The maxmum load doesn t tell anythng about msbalance drectly, but the processor whch has the maxmumload wll wor slower than others. Thus, fthe maxmumload wll beasmnmal aspossble the smulatonprocess s the fastest. Others formulas don t gve such advantages. Also the M 3 ( )can be easly evaluated, so ts the most sutable. The mgraton level can be evaluated as P = 1 ( \ ' ) + ' \ ( ) = 2. Ths functon sum up the numbers of subproblems tobetransmtted from each processor - \ and the number of subproblems tobetransmtted to each processor \. 118
The factor ½ sneeded because each transmtted subproblem sconsdered twce n the numerator. Hence, the ( )sthe number of transmtted subproblems. In order to facltate the algorthm development we assume the nterprocessor communcatons are not mportant for the tas and do not consder t. But n [95] t was noted that the reasonng wth such assumpton should tae nto account future necessty of consderng nterprocessor communcaton. To do that, authors propose to allow data mgraton between two processors only f there are anedge (, j ) E P n processorgraph andsuch subproblems, l j that(,l) E. Let s consder agraph G (,E ), where s aproblem graph parttonng and (, ) E f and only f (, j ) E P and the followng condton s satsfed: l ((,l) E ). If there s such n that l l 1, one or several subproblems should be transferred to other part of the problem graph. The target part can be selected only from the set{ j (, j ) E }. If there sno such that l l 1 the load balance sacheved. We use l l 1 rule to chec the necessty of data mgraton because f l l data should be transferred to the,but not from the.and f l l < 1only apart of subproblem should be transferred to acheve the load balance, but the subproblems cannot be dvded. The rule l >l cannot be used because ncases wth non nteger l a processors wth the surplus load always exst, sothe load balancng wll be endless. 3Mnmzng Eucldean norm ofdata mgraton The algorthm whch mnmzes Eucldean norm of data mgraton s descrbed n [95]. The authors assume that any subproblem can be nfntely dvded and consder aload asreal number. Thus the amount of data to betransferred from the part s defned as l l. 3.1 Algorthm descrpton The amount of data to betransferred from the part tothe part j sdenoted as δ j ( δ j = δ j ). To mae the balance optmal the amounts δ j should satsfy the followng lnear system of equatons: j (, ) E j δ = l l =,1 P. The number of ndependent equatons s less than P,therefore the system of equatons has an unlmted number of solutons. A soluton wth mnmum data mgraton should be selected, thats weshould 119
Mnmze subject to 2 E δ j,, j j j (, ) E j δ = l l. Ths problem can be replaced by another system of equatons L λ=b, where λ s a varable vector of sze P, L samatrx of sze P P, b s avectorof sze P : ( L ) ( b ) j,1 f j and (, = deg( ), f = j 0, otherwse = l l j ) E Adegree of the vertex f the graph ( G (,E )sthe deg( )={ (, ) E }. After the value ofthe λ vector was found, the data mgraton amount can beevaluated as δ j = λ λ j. The load balancng algorthm s[95]: (1) Fndthe averageload and the vector b. (2) Solvelnearsystem of equatons L λ =b for λ. (3) The load tobetransferred from aprocessor toaprocessor j s λ λ j. 3.2Algorthm dscusson The algorthm from [95] was orgnally desgned for the parallel fnte element soluton of PDE s based on unstructured meshes. The bg advantages of ths algorthm are smplcty of mplementaton and possblty of the parallel mplementaton on the same parallel machne whch s used for smulaton. The authors suppose to use a parallel machne wth sparse nterprocessor connectons. Ths means that a degree of each vertex n aprocessorgraphssmall. In the dstrbuted logc smulaton any two processors can usually exchange data usng approxmately equal tme. Hence the processor graph has an edge for each par of vertces. We have studed propertes of the algorthm n ths case. Let s consder the full graph wth 5 vertex-processors (see fg.1a) and the processors load s l =(, 10, 50, 70, 30). 120
4 30 10 10 4 4 4 14 8 10 10 10 70 50 6 a) b) c) Fg. 1Example of the algorthm wor onthe full graph: a) the ntal graphwththe processors load;b) datamgratonand new processors load obtaned usng the algorthm;c) an alternatve soluton Usng the algorthm we have found λ =( 2.811, 6.811, 1.189, 7.189, 2.811). The data mgraton n ths case sshown on fg. 1b. We see that the load s balanced. The number of transferred subproblems s 4+4+4+4+6+8+10+14=54, or E =920. But other possble soluton exsts (see the fg. 1c) whch gves the same load balance but transfers only 10+10+10=30 subproblems ( E =600). Therefore, ths algorthm doesn t gve the best possble solutonforthe dstrbuted systems. The logc smulaton has other dfference from the fnte elements method. The number of subproblems can exceed the number of processors not much. Let s consder one more example to llustrate the wor of the algorthm n that case. On the fg. 2a the processor graph wth the small processors load s presented. The fgures 2b and 2c show the subproblems mgraton schedule obtaned wth the algorthm and manually. We can see that algorthm s balance s not optmal, and t needs totransfer 4subproblems ( E =8). Alternatve soluton gves the optmal balance and transfers 4subproblems also, but E = 12. Snce the optmal load balance achevng s more mportant, alternatve soluton whchdoest andtransfers the samenumber of subproblems sbetter. Reasonng from these dsadvantages of the algorthm [95] for the load balancng, we need newalgorthm avoded thesedrawbacs. a) b) c) Fg. 2Example of the algorthm wor wth small load: a) the ntal graph wththe processors load; b) data mgraton and new processors load obtaned usng the algorthm; c) an alternatve soluton 121
4Transport-problem-based algorthm The man problems of the load balancng are determnaton of a number of subproblems to be transferred from each processor and a target processor for each transferred subproblem. 4.1Man deaof the algorthm We have some processors wth the load l l 1, each of these processors has togve away l l subproblems. Also we have some processors wth the load l < l,each of these processors may receve l l subproblems. Ifwe defne adstance between any par of processors we can solve ths problem as the transport problem usng the method of potentals. If some subproblem wll be transferred from one processor to another through the thrd one, twll be transferred twce (see fg. 3). Hence, the subproblem should always be transferred usng the shortest path n the graph. The shortest paths between all pars of vertces n the graph can be found usng the Floyd algorthm. We use the method of potentals [ME78] and the Floyd algorthm [Fl62] assubtass of the newalgorthm. a) b) c) Fg. 3Data mgraton route: a) the processorgraph; b) the optmal but prohbted mgraton schedule (the dotted lne sthe prohbted way of mgraton); c) the mgraton schedule usng the ntermedateprocessor The man dea of the algorthm s tofnd the shortest paths n the graph between all pars of processors andformthe nput datafor the transport problem, solve tusng the method of potentals. Then the results of the transport problem should be nterpreted n terms of the load balancng. 4.2Algorthm mplementaton The transport-problem-based algorthm for dynamc load balancng sprovded on the fg. 4. Ths algorthm returns a matrx of the P P sze, [,j] are the number of subproblems to be transferred from a processor to a processor j.intal values of [,j] sassumedtobezero. On the frst step of ths algorthm an average load sevaluated. The sets O and I of processors whch have to pass and receve the subproblems approprately are calculated then. The Floyd algorthm s used on the fourth step. The results of the Floyd algorthm are the array D of the ways lengths and the array W whch contans nformaton about ways. 122
Input data for the transport problem are formed onthe steps 6-8: the array T of the ways length between processors whch pass ther subproblems and processors whch receve them. Then the potental method s used on the step 9 to solve the transport problem. It uses the array T of the ways lengths between processors, the set O l of amounts of subproblems to be gven away from processors, the set l I of amounts of subproblems to be receved by processors. The result of the potental method s an array TS of sze O I, each element TS[,m] s an amount of subproblems to be transferred from an element O [ ]toanelement I [ m ]. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. foreach O 11. 12. 13. 14. 15. 16. l O { l I { < l D,W T reatematrx(o I) foreach O foreach m I endfor endfor TS SolveTransport(T,O foreach m I untl c=m; endfor endfor P 1 = P = 1 Floyd(G(,E T[,m] D[,m] c:=vs repeat } c W[c,m]; 1 } l, l I) [c,w[c,m]] [c,w[c,m]]+ts[,m] Fg. 4Transport-based algorthm for dynamc load balancng After that the results of the method of potentals must be nterpreted. For each gve away - receve par of processors (steps 10-11) the shortest past should be restored (steps 12-16). The varable c for path restorng s ntalzed on the steps 12 and refers to the current processor of path. The varable c s the current vertex (processor) ofthe path. Then the amount of subproblems to be transferred between processors and m s added to each edge of the path n the graph ast s shown on the fg. 3. In the algorthm mplementaton ths amount s TS[,m] and t s added to the element [c,w[c,m]] of the return matrx, where W [ c,m ]s the next processor n the path. Operatons are performed )) 123
on the steps 13-16 untl the value varable c whch represents current processor wll equal targetprocessor m,.e. the path sover. 4.3Algorthm dscusson The new algorthm allows avodng cycles n data transferrng. Ths s abg advantage n compare to the [95] algorthm. Itshould avod cyclc or ndrect (see fg. 1.b) data mgraton. Moreover, favertex has asurplus load twll only gve the load away, and f a vertex has lac of load twll only receve the load. Wth one excepton f there sno edge between vertces, the data wll mgrate usng the shortest path. We assume that t wll sgnfcantly decrease the mgraton level whch s crtcal for dstrbuted systems. Also new algorthm wll tae away all surplus loads from each processor, so t should create such mgraton schedules that wll always mnmzes the maxmum load n the best way. Ths s mportant for logc smulaton, because t allow speed up smulaton process. For the example problems from fgs. 1 and 2the new algorthm gves an optmal soluton (le on the fgs. 1c and 2c). The new algorthm has a dsadvantage. Whle [95] algorthm solves the lnear system of equatons whch can be done n parallel, new algorthm solve the transport problem and cannot be so effectvely parallelzed. 5Expermental results The algorthm from [95] and the transport-problem-based algorthm were mplemented on Delph language. Experments on random graphs were performed. Problem graphsweregenerated usng the followng scheme: (1) reate the P parts of the graph; each part contans a random number of vertces between mn and.the overall numberof vertces s V. (2) reate the V 1edge to create aconnected problem graph. Frst edge s added between two randomly selected vertces. Each next edge s added between a randomly selected vertex from already connected vertces and other vertex from not yetconnected vertces. (3) reateall otheredgesatrandom untl aneededgraphdensty sreached. (4) Aprocessorgraph sassumedtobeafull connected graphwth P vertces. Input graph Mnmumload Maxmumload Mgraton G V E mn mn mn max 1 997 4970 96 97 99 104 100 101 10 9 2 1004 50 96 96 99 104 101 102 8 8 3 1001 5010 95 97 99 104 101 101 9 12 4 999 4989 95 99 99 104 100 101 15 16 5 1002 5020 95 96 99 104 101 101 6 8 6 998 4980 96 99 98 103 100 101 11 7 7 990 4900 96 99 98 104 99 101 12 12 124
8 989 4890 95 98 98 103 99 100 11 8 9 1000 4999 96 100 99 104 100 101 12 9 10 1000 4999 95 100 99 104 100 101 11 8 Table 1Results for problem graphs wth around 1000 vertcesand 5000 edgeson 10 processors The expermental results are presented n Tables 1-3. Here G s the sequence number of graph n the table, mn sthe mnmum load, sthe maxmum load, V sthe number of vertces n the problem graph, E sthe number of edges n the problem graph, mn s the mnmum load f the transport-problem-based algorthm was used, max s the maxmum load f the transport-problem-based algorthm was used, s the number of subproblems was transferred by the transport-problem-based algorthm, mn s the mnmum load f the [95] algorthm was used, max s the maxmum load f the [95] algorthm was used, sthe number of subproblems transferred by the [95]algorthm, sthe averagedatamgratonnormfor each algorthm. Input graph Mnmumload Maxmumload Mgraton G V E mn mn mn max 1 10001 11002 425 482 498 570 501 502 386 510 2 10264 11588 431 498 511 570 514 515 353 485 3 10104 11230 427 490 503 574 506 507 414 558 4 10155 11343 445 503 506 565 508 509 273 385 5 9668 10281 429 476 481 573 484 486 319 427 6 10126 11278 428 493 504 566 507 508 383 514 7 9777 10514 428 486 487 558 489 492 417 503 8 10124 11274 430 491 504 572 507 509 370 504 9 10131 11290 444 498 504 569 507 509 359 457 10 10059 11130 427 502 501 566 503 505 391 517 Table 2Results for problem graphs wth around 10000 vertcesand 11000 edges on 20 processors Input graph Mnmumload Maxmumload Mgraton G V E mn mn mn max 1 9986 49860 432 489 497 574 500 501 419 536 2 9990 49900 425 490 498 550 500 500 324 441 3 9803 48049 425 474 489 572 491 492 419 553 4 9479 44925 431 473 471 565 474 476 308 424 5 10067 50672 430 491 502 572 504 505 347 472 6 10138 51389 427 505 506 574 507 509 382 509 7 10182 51836 431 492 507 573 510 510 444 561 8 10368 53747 429 509 517 571 519 520 396 514 9 9865 48659 429 479 491 571 494 497 300 436 10 9964 496 429 483 497 569 499 501 369 478 Table 3Results for problem graphs wth around 10000 vertcesand 50000 edges on 20 processors In all cases (see tables 1-3) a maxmum load after usng the transport-problem-based algorthm s less or equals amaxmum load M 3 ( )= after usng [95] algorthm, 125
that s max max. Thus alogc smulaton wll be performed faster f the transportproblem-based algorthm s used. But the ndex M 1 ( )= mn s worse for the transport-problem-based algorthm. 600 550 500 450 0 350 300 250 0 1 2 3 4 5 6 7 8 9 10 G Fg. 5The mgraton level forthe transport-problem-based algorthm () and nown algorthm () for each graph wth10000 vertcesand 10000 edges on 20 processors (see table 2) 600 550 500 450 0 350 300 250 0 1 2 3 4 5 6 7 8 9 10 G Fg. 6The mgraton level forthe transport-problem-based algorthm () and nown algorthm () for each graph wth10000 vertcesand 50000 edges on 20 processors (see table 3) For small graphs (see table 1), a number of transferred subproblems by the transportproblem-based algorthm can be bgger but f max s strctly less than max. For 126
bgger graphs (see tables 2 and 3) the number of transferred subproblems by the transport-problem-based algorthm s always less, t decreases a number of transferred subproblems on 24.7% at average aganst the [95] algorthm. On the fgs. 5 and 6 there are charts whch show the dfference between the mgraton levels of transportproblem-based and [95] algorthms n dependencyof graph. The 24.7% gan from the newalgorthm salmost constant. 6onclusons Analyss of the load balancng algorthm whch mnmzes Eucldean norm of data mgraton reveals some drawbacs of the algorthm usng for the dstrbuted logc smulaton. Ths algorthm gves not good solutons f aprocessor graph has hgh densty or a load of each processor s low. The new load balancng algorthm s developed. It uses Floyd algorthm to fnd shortest ways n a problem graph and then solves a transport problem usng the potental method to mnmze data mgraton between processors and the lengths of ways for data transferrng between processors. omputatonal experments show that the new algorthm decreases a number of transferred subproblems on 24.7% at average aganst the exsted algorthm. Also new algorthm mproves a load balance for the bg problems. We plan mprove the transportproblem-based algorthm by consderng the nterprocessor communcaton as the thrd crtera of optmzaton n future. Bblography [AT95] H. Avrl,. Tropper, lustered tme warp and logc smulaton//9th Worshop on Parallel and Dstrbuted Smulaton (PADS'95), 1995, pages 112-119. [AT96] H.Avrl,. Tropper, The Dynamc Load Balancng of lustered Tme Warp for Logc Smulatons//Worshop onparalleland Dstrbuted Smulaton, 1996, pages 20-27. [De05] K. Devne and al., Anew challenges n dynamc load balancng.//appled umercal Mathematcs, v52, 2005, pp 133-152. [DS98] Deelman E., Szymans B., Dynamc load balancng nparallel dscrete event smulaton for spatally explct problems.//pads'98, IEEE SPress, 1998, pages 46-53. [Fl62] Floyd, Robert W. Algorthm 97: Shortest Path. ommuncatons of the AM, volume 5 (ssue 6), 1962,p345. [95] Y.F. Hu, R.J. Blae, Anoptmal dynamc load balancng algorthm. Preprnt DL-P-95-011, Daresbury Laboratory, Warrngton, WA44AD, UK. (To be publshed n oncurrency:practce &Experence), 1995 [JSL94] M-R. Jang, S-P. Sheh, -L. Lu, Dynamc load balancng n parallel smulaton usng tme warpmechansm.// IPADS, 1994, pp. 222-229. [KK95] G.Karyps, V.Kumar, Analyss of multlevel graph parttonng. Tech. Report 95-035, omputer Scence Department,Unversty of Mnnesota, 1995. [ME78] Handboo of Operaton Research: Foundaton and Fundamentals.// Edted by J.J. Moder, [P00] S.E.Elmaghraby. Ltton Educaton Publshng, 1978 G-B Png and al., Load balancng for conservatve smulaton on shared memory multprocessor systems.//14th Worshop on Parallel and Dstrbuted smulaton, May 28-31, 2000. [Po96] A. Pothen, Graph parttonng algorthms wth applcatons to scentfc computng.//parallel umercal Algorthms, Kluwer Academc Press, 1996. 127