A Distributed Dynamic Load Balancer for Iterative Applications

A Distributed Dyamic Balacer for Iterative Applicatios Harshitha Meo, Laxmikat Kalé Departmet of Computer Sciece, Uiversity of Illiois at Urbaa-Champaig {gplkrsh2,kale}@illiois.edu ABSTRACT For may applicatios, computatio load varies over time. Such applicatios require dyamic load balacig to improve performace. Cetralized load balacig schemes, which perform the load balacig decisios at a cetral locatio, are ot scalable. I cotrast, fully distributed strategies are scalable but typically do ot produce a balaced work distributio as they ted to cosider oly local iformatio. This paper describes a fully distributed algorithm for load balacig that uses partial iformatio about the global state of the system to perform load balacig. This algorithm, referred to as GrapevieLB, cosists of two stages: global iformatio propagatio usig a lightweight algorithm ispired by epidemic [2] algorithms, ad work uit trasfer usig a radomized algorithm. We provide aalysis of the algorithm alog with detailed simulatio ad performace compariso with other load balacig strategies. We demostrate the effectiveess of GrapevieLB for adaptive mesh refiemet ad molecular dyamics o up to 3,72 cores of BlueGee/Q. Geeral Terms Algorithms, Performace Keywords load balacig, distributed load balacer, epidemic algorithm. INTRODUCTION imbalace is a isidious factor that ca reduce the performace of a parallel applicatio sigificatly. For some applicatios, such as basic stecil codes for structured grids, the load is easy to predict ad does ot vary dyamically. However, for a sigificat class of applicatios, load represeted by pieces of computatios varies over time, ad may be harder to predict. This is becomig icreasigly prevalet with the emergece of sophisticated applicatios. Permissio to make digital or hard copies of all or part of this work for persoal or classroom use is grated without fee provided that copies are ot made or distributed for profit or commercial advatage ad that copies bear this otice ad the full citatio o the first page. Copyrights for compoets of this work owed by others tha ACM must be hoored. Abstractig with credit is permitted. To copy otherwise, or republish, to post o servers or to redistribute to lists, requires prior specific permissio ad/or a fee. Request permissios from Permissios@acm.org. SC 3, November 7-2, 23, Dever, Colorado, USA Copyright 23 ACM 978--453-2378-9/3/...$5.. http://dx.doi.org/.45/2532.253284 For example, atoms movig i a molecular dyamics simulatio will lead to (almost) o imbalace whe they are distributed statically to processors. But, they create imbalace whe spatial partitioig of atoms is performed for more sophisticated ad efficiet force evaluatio algorithms. The presece of moisture ad clouds i weather simulatios, elemets turig from elastic to plastic i structural dyamics simulatios ad dyamic adaptive mesh refiemets are all examples of sophisticated applicatios which have a strog tedecy for load imbalace. All the examples above are of iterative applicatios: the program executes series of time-steps, or iteratios, leadig to covergece of some error metric. Cosecutive iteratios have relatively similar patters of commuicatio ad computatio. There is aother class of applicatios, such as combiatorial search, that ivolves dyamic creatio of work ad therefore has a tedecy for imbalace. This class of applicatios has distict characteristics ad load balacig eeds, ad has bee addressed by much past work such as work-stealig [25, 3, 32]. This paper does ot focus o such applicatios but istead o the iterative applicatios, which are predomiat i sciece ad egieerig. We also do ot focus o approaches that partitio the fie-graied applicatio data. For example, i ustructured mesh based applicatios, the etire mesh (cosistig of billios of elemets) may be partitioed by a library such as METIS [3]. This approach is expesive ad ot widely applicable; istead we focus o scearios where the applicatio work has already bee partitioed ito coarser work uits. For iterative applicatios, the basic scheme we follow is: The applicatio is assumed to cosist of a large umber of migratable uits (for example, these could be chuks of meshes i adaptive mesh refiemet applicatio). The applicatio pauses after every so may iteratios, ad the load balacer decides whether to migrate some of these uits to restore balace. balacig is expesive i these scearios ad is performed ifrequetly or wheever sigificat imbalace is detected. Note that a reactive strategy such as work-stealig, which is triggered whe a processor is idle, is almost ifeasible (e.g. Commuicatio to existig tasks must be redirected o the fly). Schemes for arrivig at a distributed cosesus o whe ad how ofte to balace load [28], ad how to avoid the pause (carryig out load balacig asychroously with the applicatio) have bee addressed i the past. I this paper we focus o a sychroous load balacer. Sice scietific applicatios have sychroizatios at various poits, this ca be used without extra overhead of sychroizatio.

Various strategies have bee proposed to address the load balacig problem. May applicatios employ cetralized load balacig strategies, where load iformatio is collected o to a sigle processor, ad their decisio algorithm is ru sequetially. Such strategies have bee show to be effective for a few hudred to thousad processors, because the total umber of work uits is relatively small (o the order of te to hudred per processor). However, they preset a clear performace bottleeck beyod a few thousad processors, ad may become ifeasible due to the memory capacity bottleeck o a sigle processor. A alterative to cetralized strategies are distributed strategies that use local iformatio, e.g. diffusio based [9]. I a distributed strategy, each processor makes autoomous decisios based o its local view of the system. The local view typically cosists of the load of its eighborig processors. Such strategies are scalable, but ted to yield poor load balace due to the limited local iformatio []. Hierarchical strategies [35, 23, ] overcome some of the aforemetioed disadvatages. They create subgroups of processors, ad collect iformatio at the root of each subgroup. Higher levels i the hierarchy oly receive aggregate iformatio ad deliver decisios i aggregate terms. Although effective i reducig memory costs, ad esurig good balace, these strategies may suffer from excessive data collectio at the lowest level of the hierarchy ad work beig doe at multiple levels. We propose a fully distributed strategy, GrapevieLB, that has bee desiged to overcome the drawback of other distributed strategies by obtaiig a partial represetatio of the global state of the system ad basig the load balacig decisios o this. We describe a light weight iformatio propagatio algorithm based o epidemic algorithm [2](also kow as the gossip protocol []) to propagate the load iformatio about the uderloaded processors i the system to the overloaded processors. This spreads the iformatio i the same fashio as gossip spreads through the grapevie i a society. Based o this iformatio, GrapevieLB makes probabilistic trasfer of work uits to obtai good load distributio. The proposed algorithm is scalable ad ca be tued to optimize for either cost or performace. The primary cotributios of this paper are: GrapevieLB, a fully distributed load balacig algorithm that attais a load balacig quality comparable to the cetralized strategies while icurrig sigificatly less overhead. Aalysis of propagatio algorithm used by GrapevieLB which leads us to a iterestig observatio that good load balace ca be achieved with sigificatly less iformatio about uderloaded processors i the system. Detailed evaluatios that experimetally demostrate the scalability ad quality of GrapevieLB usig simulatio. Demostratio of its effectiveess i compariso to several other load balacig strategies for adaptive mesh refiemet ad molecular dyamics o up to 3,72 cores of BlueGee/Q. 2. BACKGROUND characteristics i dyamic applicatios ca chage Processor L avg L max σ I 4 3 2 4 3 2 p p2 p3 p4 p5 p6 p p2 p3 p4 p5 p6 2 3 2 4 6.5 6 Table : Choice of load imbalace metric over time. Therefore, such applicatios require periodic load balacig to maitai good system utilizatio. To eable load balacig, a popular approach is overdecompositio. The applicatio writer exposes parallelism by overdecomposig the computatio ito tasks or objects. The problem is decomposed ito commuicatig objects ad the ru-time system ca assig these objects to processors ad perform rebalacig. The load balacig problem i our cotext ca be summarized as: give a distributed collectio of work uits, each with a load estimate, decide which work uits should be moved to which processors, to reduce the load imbalace. The load balacer eeds iformatio about the loads preseted by each work-uit. This ca be based o a model (simple examples beig associatig a fixed amout of work with each grid poit, or particle). But for may applicatios, aother metric turs out to be more accurate. For these applicatios, a heuristic called priciple of persistece [2] holds which allows us to use recet istrumeted history as a guide to predictig load i ear-future iteratios. The load balacig strategy we describe ca be used with either model-based or persistece-based load predictios. I persistece-based load balacer, the statistics about the load of each task o a processor is collected at that processor. The database cotaiig the task iformatio is used by the load balacers to produce a ew mappig. The ru-time system the migrates the tasks based o this mappig. It is importat to choose the right metric to quatify load imbalace i the system. Usig stadard deviatio to measure load imbalace may seem like a appropriate metric, but cosider the two scearios show i Table. I both the cases, the average load of the system is 2. If we cosider stadard deviatio, σ, to be a measure of imbalace, the we fid that i case ad case 2 we obtai the same σ of 6 whereas the utilizatio ad the total applicatio times differ. A better idicator of load imbalace i the system is the ratio of maximum load to average load. More formally, load imbalace (I) ca be measured usig I = Lmax L avg () I case, I is.5 ad i case 2, I is. We use this metric of load imbalace as oe of the evaluatio criteria to measure the performace of the load balacig strategy. Notice that this criteria is domiated by the load of the sigle processor viz. the most overloaded processor because of the max operator. This is correct, sice the executio time is determied by the worst-loaded processor ad others must wait for it to complete its step. Apart from how well the load balacer ca balace load,

it is importat to icur less overhead due to load balacig. Otherwise, the beefit of load balacig is lost i the overhead. Therefore, we evaluate quality of load balace, cost of the load balacig strategy ad the total applicatio time. 3. RELATED WORK balacig has bee studied extesively i the literature. For applicatios with regular load, static load balacig ca be performed where load balace is achieved by carefully mappig the data oto processors. Numerous algorithms have bee developed for statically partitioig a computatioal mesh [2, 5, 6, 6]. These model the computatio as a graph ad use graph partitioig algorithms to divide the graph amog processors. Graph ad hypergraph partitioig techiques have bee used to map tasks o to processors to balace load while cosiderig the locality. They are geerally used as a pre-processig step ad ted to be expesive. Our algorithm is employed where the applicatio work has already bee partitioed ad used to balace the computatio load imbalace that arises as the applicatio progresses. Our algorithm also takes ito cosideratio the existig mappig ad moves tasks oly if a processor is overloaded. For irregular applicatios, work stealig is employed i task schedulig ad is part of rutime systems such as Cilk [3]. Work stealig is traditioally used for task parallelism of the kid see i combiatorial search or dividead-coquer applicatios, where tasks are beig geerated cotiuously. A recet work by Dia et al. [] scales work stealig to 892 processors usig the PGAS programmig model ad RDMA. I work that followed, a hierarchical techique described as retetive work stealig was employed to scale work-stealig to over 5K cores by exploitig the priciple of persistece to iteratively refie the load balace of task-based applicatios [23]. CHAOS [3] provides a ispector-executor approach to load balacig for irregular applicatios. Here the data ad the associated computatio balace is evaluated at rutime before the start of the first iteratio to rebalace. The proposed strategy is more focused towards iterative computatioal sciece applicatios, where computatioal tasks ted to be persistet. Dyamic load balacig algorithms for iterative applicatios ca be broadly classified as cetralized, distributed ad hierarchical. Cetralized strategies [7, 29] ted to yield good load balace but exhibit poor scalability. Alteratively, several distributed algorithms have bee proposed i which processors autoomously make load balacig decisios based o localized workload iformatio. Popular earest eighbor algorithms are dimesio-exchage [34] ad the diffusio methods. Dimesio-exchage method is performed i a iterative fashio ad is described i terms of a hypercube architecture. A processor performs load balacig with its eighbor i each dimesio of the hypercube. Diffusio based load balacig algorithms were first proposed by Cybeko [9] ad idepedetly by Boillat [4]. This algorithm suffers from slow covergece to the balaced state. Hu ad Blake [7] proposed a o-local method to determie the flow which is miimal i the l 2-orm but requires global commuicatio. The toke distributio problem was studied by Peleg ad Upfal [3] where the load is cosidered to be a toke. Several diffusive load balacig policies, like direct eighborhood, average eighborhood, have bee proposed i [8, 4, 9]. I [33], a seder-iitiated model is compared with receiver-iitiated i a asychroous settig. It also compares Gradiet Method [24], Hierarchical Method ad DEM (Dimesio exchage). The diffusio based load balacers are icremetal ad scale well with umber of processors. But, they ca be ivoked oly to improve load balace rather tha obtaiig global balace. If global balace is required, multiple iteratios might be required to coverge [5]. To overcome the disadvatages of cetralized ad distributed, hierarchical [35, 23, ] strategies have bee proposed. It is aother type of scheme which provides good performace ad scalig. I our proposed algorithm, global iformatio is spread usig a variat of gossip protocol []. Probabilistic gossipbased protocols have bee used as robust ad scalable methods for iformatio dissemiatio. Demers et al. use a gossip-based protocol to resolve icosistecies amog the Clearighouse database servers []. Birma et al. [2] employ gossip-based scheme for bi-modal multicast which they show to be reliable ad scalable. Apart from these, gossipbased protocols have bee adapted to implemet failure detectio, garbage collectio, aggregate computatio etc. 4. GRAPEVINE LOAD BALANCER Our distributed load balacig strategy, referred to as GrapevieLB, ca be coceptually thought of as havig two stages. ) Propagatio: Costructio of the local represetatio of the global state at each processor. 2) Trasfer: distributio based o the local represetatio. At the begiig of the load balacig step, the average load is calculated i parallel usig a efficiet tree based allreduce. This is followed by the propagatio stage, where the iformatio about the uderloaded processors i the system is spread to the overloaded processors. Oly the processor ID ad load of the uderloaded processors is propagated. A uderloaded processor starts the propagatio by selectig other processors radomly to sed iformatio. The receivig processors further spread the iformatio i a similar maer. Oce the overloaded processors have received the iformatio about the uderloaded processors, they autoomously make decisios about the trasfer of the work uits. Sice various processors do ot coordiate at this stage, the trasfer has to happe such that the probability that a uderloaded processor becomes overloaded is low. We propose a radomized algorithm that meets this goal. We elaborate further upo the above two stages i the followig sectios. 4. Iformatio propagatio To propagate the iformatio about the uderloaded processors i the system, GrapevieLB follows a protocol which is ispired by the epidemic algorithm [2] (also kow as the gossip protocol []). I our case, the goal is to spread the iformatio about the uderloaded processors such that every overloaded processor receives this iformatio with high probability. A uderloaded processor starts the ifectio by sedig its iformatio to a radomly chose subset of processors. The size of the subset is called faout, f. A ifected processor further spreads the ifectio by forwardig all the iformatio it has to aother set of radomly selected f processors. Here, each processor makes a idepedet radom selectio of peers to sed the iformatio. We show that the umber of rouds required for all processors to receive the iformatio with high probability is

Algorithm Iformed selectio at each processor P i P Iput: f - Faout L avg - Average load of the system. k - Target umber of rouds L i - of this processor : S Set of uderloaded processors 2: L of uderloaded processors 3: if (L i < L avg) the 4: S P i; L L i 5: Radomly sample { P,..., P f } P 6: Sed (S, L) to { P,..., P f } 7: ed if 8: for (roud = 2 k) do 9: if (received msg i previous roud) the : R P \ S Iformed selectio : Radomly sample { P,..., P f } R 2: Sed (S, L) to { P,..., P f } 3: ed if 4: ed for : whe (S ew, L ew) is received New message 2: S S S ew; L L L ew Merge iformatio O(log f ), where is the umber of processors. We propose two radomized strategies of peer selectio as described below. Note that although we discuss various strategies i terms of rouds for the sake of clarity, there is o explicit sychroizatio for rouds i our implemetatio. Naive Selectio: I this selectio strategy, each uderloaded processor idepedetly iitiates the propagatio by sedig its iformatio to a radomly selected set of f peers. A receivig processor updates its kowledge with the ew iformatio. It the radomly selects f processors, out of the total of processors, ad forwards its curret kowledge. This selectio may iclude other uderloaded processors. Iformed Selectio: This strategy is similar to the Naive strategy except that the selectio of peers to sed the iformatio is doe icorporatig the curret kowledge. Sice the curret kowledge icludes a partial list of uderloaded processors, the selectio process is biased to ot iclude these processors. This helps propagate iformatio to the overloaded processors i fewer umber of rouds. This strategy is depicted i Algorithm. 4.2 Probabilistic trasfer of load I our distributed scheme the decisio makig for trasfer of load is decetralized. Every processor eeds to make these decisios i isolatio give the iformatio from the propagatio stage. We propose two radomized schemes to trasfer load. Naive Trasfer: The simplest strategy to trasfer load is to select processors uiformly at radom from the list of uderloaded processors. A overloaded processor trasfers load util its load is below a specified threshold. The value of threshold idicates how much of a imbalace is acceptable. As oe would expect, this radom selectio results i overloadig processors whose load is closer to the average. This is illustrated i Figure ad described i detail i Sectio 7.. Iformed Trasfer: A more iformed trasfer ca be made by radomly selectig uderloaded processors based Algorithm 2 Iformed trasfer at each processor P i P Iput: O - Set of objects i this processor S - Set of uderloaded processors T - Threshold to trasfer L i - of this processor L avg - Average load of the system : Compute p j P j S Usig eq. 2 2: Compute F j = k<j p k Usig eq. 3 3: while (L i > (T L avg)) do 4: Select object O i O 5: Radomly sample X S usig F Usig eq. 4 6: if (L X + load(o i) < L avg) the 7: L X = L X + load(o i) 8: L i = L i load(o i) 9: O O \ O i : ed if : ed while o their iitial load. We achieve this by assigig to each processor a probability that is iversely proportioal to its load i the followig maer: p i = Z ( Li Z = N L avg ( ) Li L avg ) (2a) (2b) Here p i is the probability assiged to the ith processor, L i its load, L avg is the average load of the system ad Z is a ormalizatio costat. To select processors accordig to this distributio we use the iversio method for geeratig samples from a probability distributio. More formally if p(x) is a probability desity fuctio, the the cumulative distributio fuctio F (y) is defied as: F (y) = p(x < y) = y p(x)dx (3) Give a uiformly distributed radom sample r s [, ], a sample from the target distributio ca be computed by: y s = F (r s) (4) Usig the above, we radomly select the processors accordig to p i for trasferrig load. This is summarized i Algorithm 2. Figure illustrates the results. 4.3 Partial Propagatio A iterestig questio to ask is what happes if the overloaded processors have icomplete iformatio. This may happe with high probability if the propagatio stage is termiated earlier tha log rouds. We hypothesize that to obtai good load balace, iformatio about all the uderloaded processors is ot ecessary. A overloaded processor ca have a partial set of uderloaded processors ad still achieve good balace. We empirically cofirm our hypothesis by a set of experimets i Sectio 7.. 4.4 Grapevie+ Eve though the scheme where every processor makes autoomous decisio for radomized trasfer of work is less

Naive Trasfer 5 4 3 2 2 3 4 Probability.48.36.24.2 2 3 4 Requests 2 5 5 2 3 4 5 4 3 2 2 3 4 Iformed Trasfer 5 4 3 2 2 3 4 Probability.2.8.4 2 3 4 Requests 2 5 5 2 3 4 5 4 3 2 2 3 4 (a) of (b) Probability Distributio (c) Trasfers Received (d) After LB Figure : (a) Iitial load of the uderloaded processors, (b) Probabilities assiged to each of the processors, (c) Work uits trasferred to each uderloaded processor, (d) Fial load of the uderloaded processors after trasfer. likely to cause uderloaded processors to become overloaded, this may still happe. To guaratee that oe of the uderloaded processors get overloaded after the trasfer, we propose a improvemet over the origial GrapevieLB strategy. I the improved scheme, referred to as Grapevie+LB, we employ a egative-ackowledgemet based mechaism to allow a uderloaded processor to reject a trasfer of work uit. For every potetial work uit trasfer, the seder iitially seds a message to the receiver which cotais details about the load of the work uit. The receiver, depedig o the curret load, chooses to either accept or reject. If acceptig the work uit makes the receiver overloaded, the it rejects with a Nack (egative-ackowledgemet). A seder o receivig a Nack will try to fid aother processor from the list of uderloaded processors. This trial is carried out for a limited umber of times after which the processor gives up. This scheme will esure that o uderloaded processor gets overloaded. Although this requires exchagig additioal messages, the cost is ot sigificat as the commuicatio is overlapped with the decisio makig process. 5. ANALYSIS OF THE ALGORITHM This sectio presets a aalysis of the iformatio propagatio algorithm. We cosider a system of processors ad, for simplicity, assume that the processors commuicate i sychroous rouds with a faout f. Note that i practice the commuicatio is asychroous (Sectio 6). We show that the expected umber of rouds required to propagate iformatio to all the processors i the system with high probability is O(log f ). Although we aalyze the case of sigle seder, the results are same for multiple seders sice they commuicate cocurretly ad idepedetly. I roud r =, oe processor iitiates the iformatio propagatio by sedig out f messages. I all successive rouds, each processor that received a message i the previous roud seds out f messages. We are iterested i the probability, p s, that ay processor P i received the message by the ed of roud s. We ca compute it by p s = q s, where q s is the probability that the processor P i did ot receive ay message by the ed of roud s. Probability that a processor P i did ot receive a message set by some other processor is ( ) ( ),. Further, the umber of messages set out i roud r is f r, sice the fa-out is f. Clearly, ( q = ) f (5) Therefore, the probability that P i did ot receive ay message i ay of the r {,..., s} rouds is s ( q s = ) f r ( = ) (f+f 2 +f 3 + +f s ) = r= ( ) f f s f ( ) γf s, Where γ = f f Here f s f s, f s. Takig log of both sides ( log q s γf s log ) γf s ( ) γf s q s exp Approximatig by the first two terms of the Taylor expasio of e x q s γf s Sice we wat to esure that the probability that a processor P i did ot receive ay message i s rouds is very low i.e. q s, substitutig this i the above yields γf s As q s s log f log log γ ( ) f s log f log f f = O(log f ) Our simulatio results show i figure 3 cocur with the above aalysis. It is evidet that icreasig the fa-out results i sigificat reductio of the umber of rouds required to propagate the iformatio.

6. IMPLEMENTATION We provide a implemetatio of the proposed algorithm as a load balacig strategy i Charm++. Charm++ is a parallel programmig model which has message drive parallel objects, chares, which ca be migrated from oe processor to aother. Chares are basic uits of parallel computatio i Charm++, which are mapped oto processors iitially usig a default mappig or ay custom mappig. --withig Charm++ load balacig framework supports istrumetig load iformatio of work uits from the recet past ad usig it as a guidelie for the ear future. The key advatage of this approach is that it is applicatio idepedet, ad has bee show to be effective for a large class of applicatios, such as NAMD [27] ad ChaNGa [8]. Charm++ has a user-friedly iterface for obtaiig dyamic measuremets about chares. The load balacers, which are pluggable modules i Charm++, ca use this istrumeted load iformatio to make the load balacig decisios. Based o these decisios Charm++ RTS migrates the chares. Sice the Charm++ RTS stores iformatio about chares ad processors i a distributed database, it is compatible with GrapevieLB s implemetatio requiremets. Although we have described the GrapevieLB algorithm i terms of rouds, a implemetatio usig barriers to eforce the rouds will icur cosiderable overhead. Therefore, we take a asychroous approach for our implemetatio. But such a approach poses the challege of limitig the umber of messages i the system. We overcome this by usig a T T L (Time To Live) based mechaism which limits the circulatio of iformatio forever. It is implemeted as a couter embedded i the messages beig propagated. The first message iitiated by a uderloaded processor is iitialized with the T T L of desired umber of rouds before beig set. A receivig processor icorporates the iformatio ad seds out a ew message with updated iformatio ad decremeted T T L. A message with T T L = is ot forwarded ad is cosidered expired. The key challege that remais is to detect quiescece, i.e. whe all the messages have expired. To this ed, we use a distributed termiatio detectio algorithm [26]. 7. EVALUATION We evaluate various stages of GrapevieLB with simulatios usig real data ad compare it with alterative strategies usig real world applicatios. 7. Evaluatio usig Simulatio We first preset results of simulatio of GrapevieLB strategy usig real data o a sigle processor. This simulatio allows us to demostrate the effect of various choices made i differet stages of the algorithm. For the simulatios, the system model is a set of 892 processors, iitialized with load from a real ru of a adaptive mesh refiemet applicatio with same umber of cores o IBM BG/Q. This applicatio was decomposed ito 253, 45 work uits. Figure 2 shows the load distributio for this applicatio whe the load balacer was ivoked. The average load of the system is 35, the maximum load is 66, therefore I, metric for imbalace from Equatio, is.88. Note that the value of I idicates perfect balace i the system. Amog the 892 processors, 495 are overloaded ad 497 are either uderloaded or have their Processors 6 4 2 2 3 4 5 6 7 Figure 2: distributio for a ru of AMR used i simulatio. Couts of processors for various loads are depicted. Rouds 2 6 2 8 f=2 4 f=3 f=4 496 892 2288 6384 System Size () Figure 3: Expected umber of rouds take to spread iformatio from oe source to 99% of the overloaded processors for differet system sizes ad faouts. load close to average. We perform a step-by-step aalysis of all the stages of the proposed algorithm based o this system model. It is to be oted that we have simulated sychroous rouds. The experimets were ru 5 times ad we report the results as mea alog with its stadard deviatio. Number of Rouds ad Faout: Figure 3 illustrates the depedece of expected umber of rouds required to spread iformatio o the system size. Here we cosider oly oe source iitiatig the propagatio ad report whe 99% of processors have received the iformatio. As the system size () icreases, the expected umber of rouds icrease logarithmically, O(log ), for a fixed faout. This is i accordace with our aalysis i Sectio 5. Note that the umber of rouds decreases with icrease i the faout used for the iformatio propagatio. A system size of 6K, faout of 2, requires 7 rouds to propagate iformatio to 99% processors whereas, faout of 4, takes 8 rouds. Naive vs Iformed Propagatio: Figure 4 compares the expected umber of rouds take to propagate iformatio usig Naive ad Iformed propagatio schemes. Although, the expected umber of rouds for both the schemes is o the order of O(log ), the Iformed scheme takes oe less roud to propagate the iformatio. This directly results i the reductio of the umber of messages as most of the messages are set i the later rouds. We ca also choose to vary the faout adaptively to reduce

Rouds 8 6 4 2 8 496 892 2288 6384 System Size () Naive Iformed Max 7 6 5 4 3 Max Imbalace 4 6 64 256 24 496 Uderloaded Processor Ifo.75.5.25 Imbalace Figure 4: Expected umber of rouds take to spread iformatio from oe source to 99% of the overloaded processors usig Naive ad Iformed schemes for differet system sizes. Here f = 2 ad 5% of the system size is uderloaded Figure 5: Evaluatio of load balacer with partial iformatio. Max load(left) ad Imbalace(right) decrease as more iformatio about uderloaded processors is available. It is evidet that complete iformatio is ot ecessary to obtai good performace. the umber of rouds required, while ot icreasig the umber of messages sigificatly. Istead of havig a fixed faout, we icrease the faout i the later stages. This is based o the observatio that messages i the iitial stages do ot carry a lot of iformatio. We evaluated this for a system of 496 processors where 5% were overloaded. Iformatio propagatio without the adaptive variatio requires 3 rouds with a total of 796 messages. While a adaptive faout strategy, where we use a faout of 2 iitially ad icrease the faout to 3 beyod 5 rouds ad further icrease to 4 beyod 7 rouds, helps reduce the umber of rouds to with a total of 864 messages. Naive vs Iformed Trasfer: We compare the performace of the two radomized strategies for trasfer give i Sectio 4. Figure shows the Naive scheme for the trasfer of load where a uderloaded processor is selected uiformly at radom. Here we also show the probability distributio of the uderloaded processors for the Iformed trasfer strategy usig the equatio 2 ad the trasfer of load which follows this distributio which are show i Figure. It shows the iitial load distributio of the uderloaded processors, probability assiged to each processor (uiform distributio), umber of trasfers based o the probability distributio ad the fial load of the uderloaded processors. It ca be see that the maximum load of the iitially uderloaded processors is 44 while the average is 35. Compariso with Figure clearly shows that the fial distributio of load is much more reasoable. Further, the maximum load of the uderloaded processors is 38 while the system average is 35. Evaluatio of a Pathological Case: We evaluate the behavior of the proposed algorithm uder the pathological case where just oe out of 892 processors is sigificatly overloaded (I is 6.8). Aalysis i Sectio 5 shows that q s decreases rapidly with rouds for a particular source. Sice all uderloaded processors will iitiate iformatio propagatio, this sceario should t be ay worse i expectatio. We experimetally verify this ad fid that for a faout value of 2 ad usig the Naive strategy for iformatio propagatio, it takes a maximum of 4 rouds to propagate the iformatio which is similar to the case where may processors are overloaded. Oce the iformatio is available at the overloaded processor, it radomly trasfers the work uits, reducig the I from 6.8 to.. Evaluatio of Quality of Balacig: To aswer the questio posed i the earlier sectio as to what happes if the overloaded processors have icomplete iformatio, we simulate this sceario by providig iformatio about oly a partial subset of uderloaded processors to the overloaded processors. The subset of uderloaded processors for each processor is selected uiformly at radom from the set of uderloaded processors ad the probabilistic trasfer of load is the carried out based o this partial iformatio. The quality is evaluated based o the metric I give by equatio. Figure 5 shows the expected maximum load of the system alog with stadard deviatio, σ ad the value of I metric. It ca be see that o oe had havig less iformatio, 5 uderloaded processors, yields cosiderable improvemet of load balace although ot the optimal possible. O the other had, havig complete iformatio is also ot ecessary to obtai good load balace. Therefore, this gives us a opportuity to trade-off betwee the overhead icurred ad load balace achieved. Evaluatio of Iformatio Propagatio: Based o the earlier experimet, it is evidet that complete iformatio about the uderloaded processors is ot required for good load balace. Therefore, we evaluate the expected umber of rouds take to propagate partial iformatio about the uderloaded processors to all the overloaded processors. Figure 6 shows the percetage of overloaded processors that received the iformatio as the rouds progress for a faout of 2. The x-axis is the umber of rouds ad the y- axis is the percetage of overloaded processors who received the iformatio. We plot the umber of rouds required to propagate iformatio about 2, 4, 248, 497 uderloaded processors to all the overloaded processors. I the case of propagatig iformatio about at least 2 uderloaded processors i the system, % of the overloaded processors receive iformatio about at least 2 uderloaded processors i 2 rouds ad 99.8% received i 9 rouds. It took 8 rouds to propagate iformatio about all the uderloaded processors i the system to all the overloaded processors. This clearly idicates that if we require oly partial iformatio, the total umber of rouds ca be reduced which will result i reductio of the load balacig cost.

% Processors.8.6.4 2 4.2 248 496 2 4 6 8 2 4 6 8 Rouds Figure 6: Percetage of processors havig various amouts of partial iformatio as rouds progress. There are a total of 496 uderloaded processors. 99% receive iformatio about 4 processors by 8th roud while it takes 2 rouds for all the 496 uderloaded processors. From the above experimets, it is evidet that good load balace could be attaied with partial iformatio. This is particularly useful as propagatig partial iformatio takes fewer umber of rouds ad icurs lesser overhead. We utilize this observatio to choose a value of T T L much lower tha log for compariso with other strategies o real applicatios. 7.2 Evaluatio usig Applicatios We evaluate our GrapevieLB load balacig strategy o two applicatios, LeaMD ad adaptive mesh refiemet (AMR), by comparig agaist various load balacig strategies. We use GrapevieLB with a fixed set of cofiguratios, {f = 2, T T L =.4 log 2, Iformed Propagatio, Iformed Trasfer }, ad focus o comparig with other load balacig strategies. Results preseted here are obtaied from experimets ru o IBM BG/Q Mira. Mira is a 49, 52 ode Blue Gee/Q istallatio at the ALCF. Each ode cosists of 6 64-bit PowerPC A2 cores ru at.6ghz. The itercoect i this system is a 5D torus. I the followig sectios, we first provide details about the applicatios ad the load balacers ad the preset our evaluatio results. 7.2. Applicatios Adaptive Mesh Refiemet: AMR is a efficiet techique used to perform simulatios o very large meshes which would otherwise be difficult to simulate eve o moder-day supercomputers. This applicatio simulates a popular yet simple partial differetial equatio called Advectio. It uses a first-order upwid method i 2D space for solvig the advectio equatio. The simulatio begis o a coarse-graied structured grid of uiform size. As the simulatio progresses, idividual grids are either refied or coarseed. This leads to slowly-growig load imbalace which requires frequet load balacig to maitai high efficiecy of the system. This applicatio has bee implemeted usig the object-based decompositio approach i Charm++ [22]. LeaMD: It is a molecular dyamics simulatio program writte i Charm++, that simulates the behavior of atoms based o the Leard-Joes potetial. The computatios performed i this code are similar to the short-rage oboded force calculatio i NAMD [27], a applicatio that has wo the Gordo Bell award. The three-dimesioal simulatio space cosistig of atoms is divided ito cells. I each iteratio, force calculatios are doe for all pairs of atoms that are withi a specified cutoff distace. For a pair of cells, the force calculatio is assiged to a set of objects called the computes. After the force calculatio is performed by the computes, the cells update the acceleratio, velocity ad positio of the atoms withi their space. The load imbalace i LeaMD is primarily due to the variable umber of atoms i a cell. The load o computes is proportioal to the the umber of atoms i the cells which chages over time as the atoms move based o the force calculatio. We preset simulatio of LeaMD for a 2.8 millio atom system. The load imbalace is gradual therefore load balacig is performed ifrequetly. 7.2.2 Balacers We compare the performace of GrapevieLB agaist several other strategies icludig cetralized, distributed ad hierarchical strategies. The load balacig strategies are GreedyLB: A cetralized strategy that uses greedy heuristic to assig heaviest tasks oto least loaded processors iteratively. This strategy does ot take ito cosideratio the curret assigmet of tasks to processors. AmrLB: A cetralized strategy that does refiemet based load balacig takig ito accout the curret distributio of work uits. This is tued for the AMR applicatio [22]. HierchLB: A hierarchical strategy [35] i which processors are divided ito idepedet groups ad groups are orgaized i a hierarchical maer. At each level of the hierarchy, the root ode performs the load balacig for the processors i its sub-tree. This strategy ca use differet load balacig algorithms at differet levels. It is a optimized implemetatio that is used i strog scalig NAMD to more tha 2K cores. DiffusLB: A eighborhood averagig diffusio strategy [8, 33] where each processor seds iformatio to its eighbors i a domai ad load is exchaged based o this iformatio. A domai costitutes of a ode ad all its eighbors where the eighborhood is determied by physical topology. O receivig the load iformatio from all its eighbors, a ode will compute the average of the domai ad determies the amout of work uits to be trasfered to each of its eighbors. This is a two phase algorithm: i the first phase tokes are set ad i the secod phase actual movemet of work uits is performed. There are multiple iteratios of toke exchage ad termiatio is detected via quiescece [26]. We use the followig metrics to evaluate the performace of various load balacig strategies: ) Executio time per step for the applicatio, which idicates the quality of the load balacig strategy. 2) balacig overhead, which is the time take by a load balacig strategy. 3) Total applicatio time, which icludes the time for each iteratio as well as the time for load balacig strategy. 7.2.3 Evaluatio with AMR We preset a evaluatio of differet load balacig strategies o the AMR applicatio o BG/Q ragig from 496 to 372 cores. AMR requires frequet load balacig to ru efficietly because coarseig ad refiemet of the mesh itroduces dyamic load imbalace. Time per Iteratio: First we compare the executio

Time per Step (ms) No LB Diffus LB Amr LB Hierch LB GV LB GV+ LB LB Number of Cores 4K 8K 6K 32K 65K 3K No 27.6 7.3.6 6. 3.98 2.94 Hierc 87.58 4.23 2.6 9.84 6.3 3.25 Amr 36.98 35.4 37.55 58.42 84.9 49.22 Diff 22.26 2.6 7.23 4.4 3.24 2.2 Gv 22.2 2. 6.56 4.2 2.76.69 Gv+ 2.5.48 6.44 3.73 2.34.48 496 892 6384 32768 65536 372 Number of Cores Table 3: Total applicatio time (i secods) for AMR o BG/Q. Proposed strategies Gv ad Gv+ perform the best across all scales. Figure 7: Compariso of time per step (excludig load balacig time) for various load balacig strategies for AMR o Mira (IBM BG/Q). GV+ achieves quality similar to other best performig strategies. Note that axes are log scale. LB Number of Cores 4K 8K 6K 32K 65K 3K Hierc 9.347 5.55 2.2.888.56.29 Amr 2.8 3.32 4.475 7.836.72 2.47 Diff.8.7.6.6.6.5 Gv.2.2.3.4.6.8 Gv+.2.3.3.4.6.8 Table 2: Average cost (i secods) per load balacig step of various strategies for AMR time per iteratio of the applicatio to evaluate the quality of the load balacers. This directly relates to I metric give i equatio because as I, the maximum load of the system approaches the average load, resultig i least time per iteratio. Figure 7 shows, o logarithmic scale, the time take per iteratio with various load balacig strategies. The base ru was made without ay load balacig ad is referred to as NoLB. It is evidet that with NoLB the efficiecy of the applicatio reduces as it is scaled to higher umber of cores. The Grapevie+LB load balacer (show as GV+ LB) reduces the iteratio time by 22% o 4K cores ad 5% o 3K cores. AmrLB ad HierchLB also show comparable performace for this metric. We see a icrease i gai because o larger umber of cores, the load imbalace becomes sigificat. This is because the umber of work uits per processor decreases ad the chace that a processor becomes overloaded icreases. DiffusLB also shows some improvemet but much less tha the aforemetioed oes o larger scale. For 3K, it reduces the time per step by 22% while others (AmrLB, HierchLB ad Grapevie+LB) reduce it by 5%. A iterestig thig to ote here is that, Grapevie+LB load balacer performs better tha GrapevieLB (show as GV LB) for core couts more tha 32K. This is due to the fact that Grapevie+LB esures that o uderloaded processor gets overloaded usig a Nack mechaism. From this it is evidet that the quality of load balace performed by Grapevie+LB is at-par with the quality of the cetralized ad hierarchical strategies. Overhead: Table 2 shows the overhead icurred by various load balacers i oe load balacig step for differet system sizes. The overhead(load balacig cost) icludes the time for fidig the ew assigmet of objects to processors ad the time for migratig the objects. The overhead icurred by AmrLB is 2. s for 4K cores ad icreases with the icrease i the system size to a maximum of 2.4 s for 3K cores. HierchLB icurs a overhead of 5.5 s for 8K cores ad thereafter the cost reduces to a miimum of.29 s for 3K cores. This is due to the fact that as the umber of processors icreases, the umber of sub groups also icrease resultig i a reductio of work uits per group. Hece, the time take for the root to carry out the load balacig strategy reduces. The distributed load balacig strategies, GrapevieLB ad DiffusLB, icur cosiderably less overhead i compariso to other strategies. Total Applicatio Time: The total applicatio time usig various strategies is give i Table 3. I this applicatio frequet load balacig is required. The overhead of the cetralized strategies dimiishes the beefit of load balacig. AmrLB does ot improve the total applicatio time because of the overhead of load balacig. This is true for the hierarchical strategy as well. The DiffusLB results i a reductio of the executio time by 28% for 6K cores ad 24.8% for 3K cores where as GrapevieLB gives a reductio of 35% ad 49.6% respectively. GrapevieLB provides a large performace gai by achievig a better load balace ad icurrig less overhead. It eables more frequet load balacig to improve the efficiecy. A future directio would be to use MetaBalacer [28] to choose the ideal load balacig period. 7.2.4 Evaluatio with LeaMD We evaluate LeaMD by executig a iteratios ad ivokig the load balacer first time at the th iteratio ad periodically every 3 iteratios there after. Executio time per iteratio: We compare the executio time per iteratio of the applicatio to evaluate the quality of the load balacers. For 4K to 6K cores, the cetralized, hierarchical ad GrapevieLB strategies improve the balace up to 42%. The diffusio-based strategy improves the balace oly by 35% at 8K cores ad there after it shows dimiishig gais. GrapevieLB o the other had performs at-par to the cetralized load balacer up to 32K. At 3K cores, it oly gives a improvemet of 25% i compariso to 36% give by cetralized scheme. This reductio is because the umber of tasks per processor decreases to 4 at 3K, causig refiemet-based load balacers to perform suboptimally. GrapevieLB is cosistetly better tha the DiffusLB because it has a represetatio of the global state of the system which helps it make better load balacig decisios.

Steps per Secod 4 35 3 25 2 5 5 Performace of LeaMD o BlueGee/Q No LB Refie LB GV+ LB Nbor LB Hybrid LB Greedy LB GV LB LB Number of Cores 4K 8K 6K 32K 65K 3K No 59.9 263.3 3.56 67.9 4.49 27.2 Hierc 325. 63.65 84.62 44.56 33.49 22.43 Grdy 336.34 84.9 2.23 9.9 99.5 5.35 Diff 342.5 7.4 99.67 58.47 34.9 24.29 Gv 3.2 57.34 8.45 45.58 3.9 22.79 Gv+ 35.2 52.2 79.94 43.88 3.3 2.53 Table 5: Total applicatio time (i secods) for LeaMD o BG/Q Number of processes Figure 8: Compariso of time per step (excludig load balacig time) for various load balacig strategies for LeaMD o Mira (IBM BG/Q). Note that axes are log scale. LB Number of Cores 4K 8K 6K 32K 65K 3K Hierc 3.72.84.92.494.242.262 Grdy 7.272 7.567 8.392 2.46 8.792 2.93 Diff.8.57.5.35.27.8 Gv.7.3.4.6.5.8 Gv+.7.3.3.5.5.8 Table 4: Average cost per load balacig step (i secods) of various strategies for LeaMD Overhead: Table 4 presets a compariso of overhead icurred by various strategies for a sigle load balacig step. The load balacig cost of the cetralized strategy is very high ad is o the order of tes of secods. The high overhead of GreedyLB is due to the overhead of statistics collectio, makig the decisio at the cetral locatio ad the migratio cost. The hierarchical strategy, HierchLB, icurs less overhead. It takes 3.7 s for 4K cores ad decreases to.26 s as the system size icreases to 3K. The overhead of DiffusLB is.8 s for 4K cores ad decreases thereafter. This is because the umber of work uits per core decreases as the umber of cores icrease. Fially, we observe that GrapevieLB has a overhead of.7 s for 4K cores ad decreases with icrease i system size to.3 s for 6K cores ad thereafter icreases to.8 s for 3K. The load balacig cost for GrapevieLB icludes the time for iformatio propagatio ad trasfer of work uits. At 4K cores the load balacig time is domiated by the trasfer of work uits. As the system size icreases, the work uits per processor decreases. This results i cost beig domiated by iformatio propagatio. Total Applicatio Time: Table 5 shows the total applicatio time for LeaMD. The cetralized strategy improves the total applicatio time but oly for core couts up to 6K. Beyod 6K cores, the overhead due to load balacig exceeds the gais ad results i icreasig the total applicatio time. DiffusLB icurs less overhead i compariso to the cetralized ad hierarchical strategies but it does ot show substatial gais because the quality of load balace is ot good. At 32K cores, it gives a reductio of 2% i total executio time while GrapevieLB gives 34% ad HierchLB gives 33%. HierchLB icurs less overhead i compariso to the cetralized strategies. It reduces the total executio time by 37% for 8K cores while GrapevieLB reduces it by 42%. GrapevieLB cosistetly gives better performace tha other load balacig strategies. Grapevie+LB gives the maximum performace beefit by reducig the total applicatio time by 2% for 3K, 4% for 6K cores, aroud 42% for 4K ad 8K cores. Thus, GrapevieLB ad Grapevie+LB provide a improvemet i performace by achievig a high quality load balace with sigificatly less overhead. 8. CONCLUSION We have preseted GrapevieLB, a ovel algorithm for distributed load balacig. It icludes a light weight iformatio propagatio stage based o gossip protocol to obtai partial iformatio about the global state of the system. Exploitig this iformatio, GrapevieLB probabilistically trasfers work uits to obtai high quality load distributio. We have demostrated performace gais of GrapevieLB by comparig agaist various cetralized, distributed ad hierarchical load balacig strategies for molecular dyamics simulatio ad adaptive mesh refiemet. GrapevieLB is show to match the quality of cetralized strategies, i terms of the time per iteratio, while avoidig associated bottleecks. Our experimets demostrate that it sigificatly reduces the total applicatio time i compariso to other load balacig strategies as it achieves good load distributio while icurrig less overhead. Ackowledgmet The authors would like to thak Phil Miller, Joatha Lifflader ad Nikhil Jai for their valuable help i proofreadig. This research was supported i part by the US Departmet of Eergy uder grat DOE DE-SC845 ad by NSF ITR-HECURA-83388. This research also used resources of the Argoe Leadership Computig Facility at Argoe Natioal Laboratory, which is supported by the Office of Sciece of the U.S. Departmet of Eergy uder cotract DE-AC2-6CH357. Experimets for this work were performed o Mira ad esta, IBM Blue Gee/Q istallatios at Argoe Natioal Laboratory. The authors would like to ackowledge PEACEdStatio ad PARTS projects for the machie allocatios provided by them. 9. REFERENCES [] I. Ahmad ad A. Ghafoor. A semi distributed task allocatio strategy for large hypercube supercomputers. I Coferece o Supercomputig, 99.

[2] K. Birma, M. Hayde, O. Ozkasap, Z. Xiao, M. Budiu, ad Y. Misky. Bimodal multicast. ACM Trasactios o Computer Systems (TOCS), 999. [3] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserso, K. H. Radall, ad Y. Zhou. Cilk: A Efficiet Multithreaded Rutime System. I PPoPP, 995. [4] J. E. Boillat. balacig ad poisso equatio i a graph. Cocurrecy: Practice ad Experiece, 2(4):289 33, 99. [5] U. Catalyurek, E. Boma, K. Devie, D. Bozdag, R. Heaphy, ad L. Riese. Hypergraph-based dyamic load balacig for adaptive scietific computatios. I Proc. of 2st Iteratioal Parallel ad Distributed Processig Symposium (IPDPS 7), pages. IEEE, 27. Best Algorithms Paper Award. [6] C. Chevalier, F. Pellegrii, I. Futurs, ad U. B. I. Improvemet of the efficiecy of geetic algorithms for scalable parallel graph partitioig i a multi-level framework. I I Proceedigs of Euro-Par 26, LNCS, pages 243 252, 26. [7] Y.-C. Chow ad W. H. Kohler. Models for dyamic load balacig i homogeeous multiple processor systems. I IEEE Trasactios o Computers, 982. [8] A. Corradi, L. Leoardi, ad F. Zamboelli. Diffusive load balacig policies for dyamic applicatios. I IEEE Cocurrecy, pages 7():22 3, 999. [9] G. Cybeko. Dyamic load balacig for distributed memory multiprocessors. Joural of parallel ad distributed computig, 7(2):279 3, 989. [] A. Demers, D. Greee, C. Hauser, W. Irish, J. Larso, S. Sheker, H. Sturgis, D. Swiehart, ad D. Terry. Epidemic algorithms for replicated database maiteace. I ACM Symposium o Priciples of distributed computig, 987. [] J. Dia, D. B. Larkis, P. Sadayappa, S. Krishamoorthy, ad J. Nieplocha. Scalable work stealig. I Coferece o High Performace Computig Networkig, Storage ad Aalysis, 29. [2] George Karypis ad Vipi Kumar. A coarse-grai parallel formulatio of multilevel k-way graph partitioig algorithm. I Proc. of the 8th SIAM coferece o Parallel Processig for Scietific Computig, 997. [3] George Karypis ad Vipi Kumar. Multilevel k-way Partitioig Scheme for Irregular Graphs. Joural of Parallel ad Distributed Computig, 48:96 29, 998. [4] A. Ha c ad X. Ji. Dyamic load balacig i distributed system usig a decetralized algorithm. I Itl. Cof. o Distributed Computig Systems, 987. [5] B. Hedrickso ad K. Devie. Dyamic load balacig i computatioal mechaics. Computer Methods i Applied Mechaics ad Egieerig, 84(2):485 5, 2. [6] B. Hedrickso ad R. Lelad. The Chaco user s guide. Techical Report SAND 93-2339, Sadia Natioal Laboratories, Albuquerque, NM, Oct. 993. [7] Y. Hu ad R. Blake. A optimal dyamic load balacig algorithm. Techical report, Daresbury Laboratory, 995. [8] P. Jetley, F. Gioachi, C. Medes, L. V. Kale, ad T. R. Qui. Massively parallel cosmological simulatios with ChaNGa. I IPDPS, 28. [9] L. V. Kalé. Comparig the performace of two dyamic load distributio methods. I Proceedigs of the 988 Iteratioal Coferece o Parallel Processig, pages 8, St. Charles, IL, August 988. [2] L. V. Kalé. The virtualizatio model of parallel programmig : Rutime optimizatios ad the state of art. I LACSI 22, Albuquerque, October 22. [2] W. Kermack ad A. McKedrick. Cotributios to the mathematical theory of epidemics. ii. the problem of edemicity. Proceedigs of the Royal society of Lodo. Series A, 38(834):55 83, 932. [22] A. Lager, J. Lifflader, P. Miller, K.-C. Pa, L. V. Kale, ad P. Ricker. Scalable Algorithms for Distributed-Memory Adaptive Mesh Refiemet. I SBAC-PAD 22, New York, USA, October 22. [23] J. Lifflader, S. Krishamoorthy, ad L. V. Kale. Work stealig ad persistece-based load balacers for iterative overdecomposed applicatios. I HPDC, 22. [24] F. C. H. Li ad R. M. Keller. The gradiet model load balacig method. Software Egieerig, IEEE Trasactios o, ():32 38, 987. [25] Y.-J. Li ad V. Kumar. Ad-parallel executio of logic programs o a shared-memory multiprocessor. J. Log. Program., (/2/3&4):55 78, 99. [26] F. Matter. Algorithms for distributed termiatio detectio. Distributed computig, 2(3):6 75, 987. [27] C. Mei ad L. V. K. et al. Eablig ad scalig biomolecular simulatios of millio atoms o petascale machies with a multicore-optimized message-drive rutime. I Proceedigs of the 2 ACM/IEEE coferece o Supercomputig. [28] H. Meo, N. Jai, G. Zheg, ad L. V. Kalé. Automated load balacig ivocatio based o applicatio characteristics. I IEEE Cluster, 22. [29] L. M. Ni ad K. Hwag. Optimal load balacig i a multiple processor system with may job classes. I IEEE Tras. o Software Eg., volume SE-, 985. [3] D. Peleg ad E. Upfal. The toke distributio problem. SIAM Joural o Computig, 8(2):229 243, 989. [3] S. Sharma, R. Pousamy, B. Moo, Y. Hwag, R. Das, ad J. Saltz. Ru-time ad compile-time support for adaptive irregular problems. I Proceedigs of Supercomputig 994, Nov. 994. [32] Y. Su, G. Zheg, P. Jetley, ad L. V. Kale. A Adaptive Framework for Large-scale State Space Search. I IPDPS, 2. [33] M. H. Willebeek-LeMair ad A. P. Reeves. Strategies for dyamic load balacig o highly parallel computers. I IEEE Trasactios o Parallel ad Distributed Systems, September 993. [34] C. Xu, F. C. M. Lau, ad R. Diekma. Decetralized remappig of data parallel applicatios i distributed memory multiprocessors. Cocurrecy - Practice ad Experiece, 9(2):35 376, 997. [35] G. Zheg, A. Bhatele, E. Meeses, ad L. V. Kale. Periodic Hierarchical Balacig for Large Supercomputers. IJHPCA, March 2.