DDE:AModiedDimensionExchangeMethod forloadbalancingink-aryn-cubes StateUniversityofNewYorkatBualo DepartmentofComputerScience Min-YouWuandWeiShu algorithmforthehypercubestructure.ithasbeengeneralizedtok-aryn-cubes.however,the k-aryn-cubealgorithmmusttakemanyiterationstoconvergetoabalancedstate.inthispaper, Abstract Thedimensionexchangemethod(DEM)wasinitiallyproposedasaload-balancing wu,shu@cs.bualo.edu Bualo,NY14260 weproposeadirectmethodtomodifydem.thenewalgorithm,directdimensionexchange (DDE)method,takesloadaverageineverydimensiontoeliminateunnecessaryloadexchange.It moreaccuratelyandmuchfaster.1.introduction balancestheloaddirectlywithoutiterativelyexchangingtheload.itisabletobalancetheload sweep(logniterations),theloadisbalanced. nodepairsexchangetheirloadinformationandattempttoaveragethenumberoftasks.aftera DEMissuperiortootherschedulingmethods[7].DEMforthehypercubenetworkisasimple rithmforthehypercubestructure[5,1].itbalancestheloadforindependenttasksondistributed memorymachines.theexperimentcarriedbywillebeek-lemairandreevesconformedthat algorithm.loadbalancingisperformediterativelyineachofthelogndimensions,inwhichonly Thedimensionexchangemethod(DEM)wasinitiallyproposedasafullyload-balancingalgo- linearlyproportionaltothenumberofnodesinachain,andhencetothedimensionorderkof isnotabletoreachthebalancedstateinonesweep.thenumberofsweepsforconvergenceis network[10].becauseanodeexchangesworkloadwithonlyoneofitsneighboratatime,gde ittakesmanysweepstoconvergetothebalancedload.hosseinietal.extendeditforarbitrary structuresusingthetechniqueofedge-coloringofgraphs[3].xuandlauproposedthegeneralized dimensionexchange(gde)method[9].thegdemethodwasextendedtothek-aryn-cube Unfortunately,whenDEMappliestootherstructures,suchasthemeshorthek-aryn-cube, thek-aryn-cubestructure. 1
n-cube. beappliedtotwoormoredimensionstobalancetheloadforthemesh,thetorus,andthek-ary method.unlikeiterativealgorithms,thisdirectmethodcanbalancetheloadinonesweep.the orunderloadedandsubsequentlyexchangeworkloadwithothernodes.theddemethodcan beeasilyobtainedbyasumreduction.eachnodeinthechainknowswhetheritisoverloaded loadinachainisfullybalancedbyutilizinginformationofthetotalnumberoftasks,whichcan Wepresentadirectmethodforthek-aryn-cube,calledtheDirectDimensionExchange(DDE) methodiscomparedtothegdemethod.section7concludesthepaper. respectively.thealgorithmforthek-aryn-cubeispresentedinsection5.insection6,thedirect Then,thedirectmethodforthechainandtheringstructuresisdescribedinsections3and4, Thispaperisorganizedasfollows.Section2brieyreviewstheDEMandtheGDEalgorithms. Toachievethisgoal,anestimationofthetaskexecutiontimeisneeded,whichcanbedoneeither byaprogrammerorbyacompiler.sometimestheestimationcanbeapplication-specic,and sometimesitisimpossibletoobtainsuchanestimation.duetothesediculties,eachtaskis Thegoalofloadbalancingistoscheduleworkssothateachprocessorhasthesameworkload. 2.TheDEMandGDEAlgorithms presumedtorequiretheequalexecutiontimeandthegoalofthealgorithmistoscheduletasks algorithmistoredistributetaskssothatthenumberoftasksineachnodeisequal.assumethe computingnodesareconnectedbyagiventopology.eachnodeihaswitasks.ascheduling sothateachprocessorhasthesamenumberoftasks. sumofwiofallnodescanbeevenlydividedbyn.theaveragenumberoftaskswavgiscalculated Theschedulingproblemcanbedescribedasfollows.Inaparallelordistributedsystem,N wavg=pn 1 i=0wi addressesdierinonlytheleastsignicantbitbalancetheloadbetweenthemselves.next,all \integerversion"ofdemisdescribedinfigure1.allnodepairsintherstdimensionwhose andthencombinedtoformlargerdomainsuntilultimatelytheentiresystemisbalanced.the Eachnodeshouldhavewavgtasksafterscheduling. DEMwasdesignedforthehypercubestructure.InDEM,smalldomainsarebalancedrst N: nodehasbalanceditsloadwitheachofitsneighbors. nodepairsintheseconddimensionbalancetheloadbetweenthemselves,andsoforth,untileach AfterexecutionoftheDEMalgorithm,theloaddierence D=max(wi) min(wi) 2
DEM forl=0ton 1 wi=(d(wi+wj)=2eifwi>wj if(wj wi)>1,receiveb(wj wi)=2ctasksfromnodej if(wi wj)>1,sendb(wi wj)=2ctaskstonodej nodeiexchangeswithnodejthecurrentvaluesofwiandwj,wherej=i2l GDE b(wi+wj)=2cotherwise while(notterminate) forl=1toc foredgecoloredlconnectingnodesiandj Figure1:TheDEMalgorithmforthehypercube. wi=(d(1 )wi+wjeifwi>wj if(wj wi)>1,receiveb(wj wi)ctasksfromnodej if(wi wj)>1,sendb(wi wj)ctaskstonodej nodeiexchangeswithnodejthecurrentvaluesofwiandwj DEMalgorithmis3n[7]. isboundedbyn,thedimensionofthehypercube[3].thenumberofcommunicationstepsofthe Figure2:TheGDEalgorithmforthek-aryn-cube. b(1 )wi+wjcotherwise sweepaftercconsecutiveexchangeoperations,wherecisthenumberofcolors.ink-aryn-cubes, tasksbetweenneighboringnodesislessthanorequaltoone.theconvergenceratedependson c=2nifkisanevennumber.theterminationconditionisthatthedierenceofthenumberof Forthehypercube,theoptimal=12,andGDEisequivalenttotheoriginalDEMalgorithm.For theexchangeparameter.thevaluevariesfordierenttopologiesanddierentnetworksizes. graph.the\integerversion"ofthealgorithmisshowninfigure2.anodenishesacomplete TheGDEalgorithmoperatesoncolorgraphsderivedfromedge-coloringofthegivensystem whenthedimensionorderkincreases.thereisnocommunicationconictinthisalgorithm. topologyby\folding"themeshineachdimensiondlogmetimes[7].thismethodcouldbeapplied othertopologies,istobeoptimizedtomaximizetheconvergencerate.forthek-aryn-cube,the loaddierencebetweenanypairofnodesisboundedbynk=2.theconvergenceratedecreases Willebeek-LeMairandReevessuggestedanotherapproachtoextendDEMtoanMMmesh 3
pairswouldnolongerbedirectlylinkedtooneanotherandcommunicationswouldconict. tok-aryn-cubestoo.theloaddierenceisboundedbyndlogke.however,inthisapproach,node method.theworkloadinachaincanbebalanceddirectly.thebasicideaistocalculatethe totalnumberoftasksinthechainandtheaveragenumberoftaskspernode.thus,nodesinthe chaincanexchangetaskstobalancetheload. InsteadofusingtheGDEmethodwhichbalancestheloaditeratively,weproposeadirect 3.TheDDEMethodfortheChain DDE-chain Letwibethenumberoftasksinnodei,wherei=0;1;:::;k 1. 1.GlobalInformationCollection:Performthescanwithsumoperationofwi: 3.QuotaCalculation:Thequotaofeachnodeqiiscomputed: 2.AverageLoadCalculation:T=W0,wavg=bT=kc,andR=Tmodk,whereTisthe totalnumberoftasks. Wi=k 1 Xl=iwl qi=(wavg+1ifi<r 4.FlowCalculation:xi 1;i=Qi Wi,fori=1;2;:::;k 1,wherexi;jistheowon Also,anaccumulationquotaforeachnodeiscomputed: wavgotherwise edge(i;j). Qi=k 1 Figure3:TheDDEalgorithmforthechain. Xl=iql usingthescanwithsumoperationfromnodek 1tonode0,wherekisthelengthofthechain. thenodeweightwi(i=0;1;:::;k 1)andoutputsthecalculatedowxi 1;i(i=1;2;:::;k 1) foreveryedgeinthechain.therststepistoobtainthetotalnumberoftasksinthechainby oftaskspernodeatnode0.ifthenumberoftaskscannotbeevenlydividedbyk,theremaining EachnoderecordsapartialsumWi=Pk 1 TheDDEalgorithmforthechainshowninFigure3isits\integerversion."Ittakesasinput l=iwl.thesecondstepcalculatestheaveragenumber 4
RtasksaredistributedtotherstRnodessothattheyhaveonemoretaskthantheothers. ThevaluesofwavgandRarebroadcasttoeverynode.Inthethirdstep,eachnodecalculatesits asitsquota. theowisavailable,theworkloadisexchangedsothateachnodehasthesamenumberoftasks iscalculatedbytakingdierencebetweenqiandwi.nodeicalculatesxi 1;iandxi;i+1.When EachnodekeepsrecordsofQi,Wi,Qj,andWj,wherej=i+1.Inthefourthstep,theow quota.theaccumulationquotaqicanbecalculateddirectlyasfollows: Qi=wavg(k i)+min(0;r i): Example1: shownbelow: Then,eachnodecalculatesthevalueofQiinstep3.Thevaluesofwi,Wi,Qi,andxi 1;iareas andr: readytobescheduled.valuesofwiarecalculatedinstep1.node0calculatesthevalueofwavg AnexampleisshowninFigure4.Atthebeginningofscheduling,eachnodehaswitasks 093737 172832 iwiwiqixi 1;i wavg=4;r=5: 242127 311722 441617 561212 6168 { 46510 i=0 94i=1 76i=2 47554 1 5i=3 11i=4 42i=5 62i=6 11 Aftertaskexchange,nodes0{4havevetaskseach,andnodes5{7havefourtaskseach. Figure4:ExampleforDDE-chain. i=7 5 toitsquota. Lemma1:AfterexecutionofDDEandtaskexchange,thenumberoftasksineachnodeisequal 5
Becausexi 1;i=Qi Wi;xi;i+1=Qi+1 Wi+1;Wi+1=Wi wi;andqi+1=qi qi Proof:AfterexecutionofDDEandtaskexchange,thenumberoftasksinnodeiis w0i=wi+(qi Wi) (Qi+1 Wi+1)=Qi Qi+1=qi w0i=wi+xi 1;i xi;i+1 stepsinstep4isatmostk.therefore,thetotalnumberofcommunicationstepsofthisalgorithm andapplyingthetwaalgorithmin[6].thus,thetotalnumberofcommunicationstepsofthis isnomorethan3k.thisalgorithmcanbefurtherimprovedbyselectingnodek/2astheroot algorithmcanbereducedto2k.whentisevenlydividedbyk,thisalgorithmminimizesthe Inthisalgorithm,steps1and2spend2kcommunicationsteps.Thenumberofcommunication 2 exchangealgorithms.therstone,calledreceive-before-send,isshowninfigure5. Receive-before-send totalnumberoftasktransfersandthetotalnumberofcommunications.thisalgorithmalso maximizeslocality.thatis,itminimizesthenumberoftasksthataremigratedtoothernodes. Fornodei 1.ifi>0andxi 1;i>0,waittoreceivexi 1;itasksfromnodei 1 TheworkloadisexchangedaccordingtotheowgeneratedbyDDE.Therearetwotask- 4.ifi<k 1andxi;i+1>0,sendxi;i+1taskstonodei+1 3.ifi>0andxi 1;i<0,sendjxi 1;ijtaskstonodei 1 2.ifi<k 1andxi;i+1<0,waittoreceivejxi;i+1jtasksfromnodei+1 nicationstepstonish: Usingthereceive-before-sendalgorithm,theloadexchangeinExample1takesfourcommu- (2) (1) node0tonode1,node5tonode6,node7tonode6 node1tonode2 Figure5:Taskexchange:receive-before-send. (3) (4) node2tonode3 node3tonode4 6
Send-before-receive Fornodei letai=xi 1;i;bi=xi;i+1 while(ai6=0orbi6=0) 1.ifi>0and(wi> ai>0)sendjaijtaskstonodei 1,andletwi=wi+ai,ai=0 2.ifi<k 1and(wi>bi>0)sendbitaskstonodei+1,andletwi=wi bi,bi=0 beforesendingoutmessages.byrelaxingthisconstraint,asend-before-receivealgorithmisshown 3.ifi>0andai>0andreceivedaitasksfromnodei 1,andletwi=wi+ai,ai=0 4.ifi<k 1andbi<0andreceivedjbijtasksfromnodei+1,andletwi=wi bi,bi=0 incomingmessage.thecommunicationtimeandprocessoridletimecanbereduced.ittakes infigure6.inthisalgorithm,anodecanstartsendingmessagesoutbeforeithasreceivedan onlytwocommunicationstepsforexample1: Inthereceive-before-sendalgorithm,eachnodemustreceiveanincomingmessage,ifany, Figure6:Taskexchange:send-before-receive. 2) 1) node5tonode6,node7tonode6 node0tonode1,node1tonode2,node3tonode4, before-sendorsend-before-receivealgorithmsisatrade-obetweencommunicationtimeand taskstoothernodes.butinthesend-before-receivealgorithm,anodemaysendlocaltasksto othernodesandthenreceivetasksfromothers.therefore,thedecisiononuseofthereceivebefore-sendalgorithm,anodecankeepthemaximumnumberoflocaltasksandsendnon-local Thesend-before-receivealgorithmmayhavesomenegativeimpactinlocality.Inthereceive- node2tonode3 locality. communicationsteps. advantageofthepipelineeectofwormholeroutingwhileavoidingchannelcontention.this oncommunicationtimecanoftenbeignored.therecursivedoublingalgorithm[2]cantake algorithmorganizesthenodesinachaintoatree.anexampleofeightnodesisshownin Figure7.ApplyingtheTWAalgorithmin[6]tothetree,theloadcanbebalancedwithin4logk Mostmassivelyparallelcomputersusewormholeroutingwithwhichtheeectofpathlength 7
i=0 i=6 i=4 i=5 i=2 i=3 i=1 4.TheDDEAlgorithmfortheRing Figure7:Thetreeforrecursivedoubling. i=7 tioncouldbereduced.wedescribeanalgorithmtominimizethetotalnumberoftaskstransferred. Thealgorithmisderivedfromtheminimumcostowalgorithm[4]andshowninFigure8.Inthis however,thecommunicationmaynotbeminimal.byutilizingtheend-roundedge,communica- algorithm,aninitialsolutionisobtainedbyusingdde-chainwithoutconsideringtheend-round gorithmcanbeappliedtotheringbyignoringtheend-roundedge.theloadcanbebalanced, Aringcanbeobtainedbyaddinganend-roundconnectiontoachain.TheDDE-chainal- Letnpbethenumberofedgeswithxi;j>0,nnthenumberofedgeswithxi;j<0,andnzthe numberofedgeswithxi;j=0. DDE-ring ApplyDDE-chaintotheringwithoutconsideringtheend-roundedge(k 1;0)toobtain x0;1;x1;2;:::;xk 2;k 1,wherexi;jistheowonedge(i;j).Letxk 1;0be0. Iftheowisclockwise,xi;jispositive;otherwise,itisnegative. 2.Foreachedge,xi;j=xi;j xm. 1.Ifnn+nz np<0,letxmbethemthlargestxi;jfromallxi;j>0;andifnp+nz nn<0, letxmbethemthsmallestxi;jfromallxi;j<0,wherem=dk=2e. Figure8:TheDDEalgorithmforthering. 8
ofdde-ring.here,weletx 1;0=xk 1;0. algorithmiso(klogk). edge.then,anaugmentationisappliedtoobtainanoptimalsolution.thecomplexityofthis Wecanuseeitherthereceive-before-sendorsend-before-receivealgorithmfortaskexchange negativecost.therefore,thenetworkowisofminimumcost[4]. numberoftaskstransferred. Lemma2:AfterexecutionofDDE-ring,thetotalnetworkowisofminimumcost. Proof:Ifnp+nz nn0andnn+nz np0,thereisnoowaugmentingcyclewith Thefollowinglemmashowsthatthisalgorithmminimizesthetotalcostofow,thatis,the Notethatn0n+n0z+n0p=k.Becauseofm=dk=2e, Then, Ifnn+nz np<0,aftermodicationofxi;j=xi;j xm,wehave n0z+n0p n0nm n0n=m (k n0z n0p)2m k n0z+n0p n0n2m k0 n0z+n0pm Then,n0n+n0z n0pk m+1 n0p=k m+1 (k n0n n0z)=1 m+n0n+n0z Wealsohave 1 m+k m+1=k 2m+2 n0n+n0zk m+1 Thus,thenetworkowisofminimumcost. costinallcases. Becauseofm=dk=2e, Thecaseofnp+nz nn<0canbeprovedsimilarly.thus,thenetworkowisofminimum n0z+n0p n0nk 2m+20 toconstructaring.applyingthedde-chainalgorithmtotheringwithoutconsideringthe AnexampleisshowninFigure9.Anend-roundedgeisaddedtothechaininFigure4 9 2
resultisshowninfigure9(b).thenumberoftaskstransferredisreducedto17. end-roundedge,theowisshowninfigure9(a).thenumberoftaskstransferredis19.the Becausenp+nz nn<0andthe4thsmallestxi;jis 2,everyxi;jissubtractedby 2.The augmentationisappliedtothisow:np=1;nz=2;nn=5 i=0 4i=1 6i=2 5i=3 (a) i=4 i=5 2i=6 1i=7 i=0 92i=1 74i=2 43i=3 112 Figure9:ExampleforDDE-ring. (b) i=4 42i=5 6 i=6 13i=7 5 nodes.thenodesineachringexchangetheirloadandtheneachnodeihaswl+1 n-cube.thealgorithmisshowninfigure10.initerationloftheddealgorithm,subcubeslm isdividedintokpartitionssl+1 WiththeDDE-ringalgorithm,itisnotdiculttocompositeaDDEalgorithmforthek-ary 5.TheDDEMethodforthek-aryn-cube km+bwherem=0;1;:::;kl 1andb=0;1;:::;k 1.Slmhaskn l getamesh.thisalgorithmcanbeappliedtothemeshbyperformingthedde-chainalgorithm nodesindierentdimensions.takingatorusandstripthemofalltheend-roundconnections,we orsend-before-receivealgorithm. DDE,nodeiwillhavewnitasks.Thetaskexchangestepcanuseeitherthereceive-before-send Thisalgorithmcanbeappliedtothen-dimensionaltorus,whichallowsdierentnumberof itasks.executing ofthek-aryn-cube. insteadofdde-ringineachstep. ThefollowingtheoremshowsthattheloaddierenceofDDEisboundedbyn,thedimension 10
DDEfork-aryn-cube Assumeak-aryn-cubeS0,thenumberofnodesiskn,andnodeihasw0itasks. forl=0ton 1 applythedde-ringalgorithmtokn 1ringsinthelthdimensionindependently, whereeveryringhasknodes(a0;a1;:::;al;:::;ak 1)andal=0;1;:::;k 1 exchangetasksaccordingtotheow eachnodeupdatesitsweightwl+1 i=wli+xi 1;i xi;i+1 Figure10:TheDDEalgorithmforthek-aryn-cube. Theorem1:AfterexecutionofDDE,theloaddierence D=max(wni) min(wni) isboundedbyn. Proof:InthelthstepofDDE,ak-ary(n l)-cubeispartitionedintokk-ary(n l 1)-cubes. Thedierenceofthenumberoftasksbetweentwopartitionsismaximalwhenineachringevery nodeinrstpartition,saysl+1 km,hasonemoretaskthanthatpossessedbythenodeintheother partitions,sl+1 Xkm+b,whereb=1;2;:::;k 1.Thus j2sl+1 kmwl+1 j=x j2sl+1 km+k 1wl+1 j+jsl+1 kmj=1 k 1(X j2slmwlj X j2sl+1 kmwl+1 j)+kn l 1 wherejsl+1 kmjdenotesthenumberofnodesinsubcubesl+1 kmwhichiskn l 1.Therefore, X j2sl+1 kmwl+1 j=1kx j2slmwlj+(k 1)kn l 2 Similarly,X j2sl+1 km+k 1wl+1 j=x j2sl+1 kmwl+1 j jsl+1 kmj=(x j2slmwlj (k 1)X j2sl+1 km+k 1wl+1 j) kn l 1 Therefore, X j2sl+1 km+k 1wl+1 j=1kx j2slmwlj kn l 2 Let Almax=max 0m<klX j2slmwlj 11
and Whenl=0, A0max=A0min=X Almin=min j2s0w0j=x 0m<klX j2slmwlj: Similarly, wheretisthetotalnumberoftasks.thus, Almax=(T1kAl 1 max+(k 1)kn l 1otherwise 0j<knw0j=T Thesolutiontotheaboverecurrenceisgivenby Almin=( 1kAl 1 min kn l 1otherwise T ifl=0 ifl=0 (1) Almax=Tkl+(k 1)lkn l 1 Almin=Tkl lkn l 1 (3) (4) (2) Itclearlysatises(1)and(2)forthebasis,l=0.If(3)satises(1)forl=m,then Therefore,itsatises(1)forl=m+1.Thus,byinductiononlwehaveshownthat(3)satises(1) wheneverl0.similarly,itcanbeshownthat(4)satises(2)wheneverl0. Am+1 max=t km+1+(m+1)(k 1)kn (m+1) 1=1k(T =1kA(m+1) 1 max+(k 1)kn (m+1) 1 km+(k 1)mkn m 1)+(k 1)kn (m+1) 1 BecauseD=Anmax Anmin=(k 1 atmostbyn. Letl=n Anmax=max Anmin=min k+1k)n=n,thenumberoftasksinanytwoprocessorsdiers 0j<knwnj=Tkn+k 1 0j<knwnj=Tkn 1kn; kn isshowninfigure11(c).themaximumloaddierenceis2. thatthedde-ringalgorithmappliestoeachringintherstdimension.then,dde-ringapplies toeachringintheseconddimension,asshowninfigure11(b).theresultantloaddistribution AnexampleisshowninFigure11.Thisisa4-ary2-cube(i.e.,torus44).Figure11(a)shows 2 12
173015 612616 21355 68516 24 2412 271 3 (a) 110101010 9 712967129671 867 (b) 2 9 8 8 8 87 Inthissection,wecompareperformanceofGDEandDDE.Weconsideratestsetofload Figure11:ExampleforDDE(4-ary2-cube). 6.ExperimentalResults (c) distributions,inwhichtheloadateachprocessorisrandomlyselectedwiththemeanequaltoa speciedvalue.inthissimulationexperiment,theaveragenumberoftasks(averageweight)per processoris1,000.eachresultistheaverageof100testcases.wetestedan88mesh,a1616 thanthatofdde. showsitsaverageindierentnetworks.here,theloaddierenceofgdeisfourtosixtimeslarger forgdeis0.723[10]. torus,an8883d-mesh,anda1616163d-torus.forthesenetworks,theoptimalvalueof byn,whereasthatofgdeisboundedbyn(k 1)forthemeshandnk=2forthetorus.Figure12 First,wecompareloadimbalanceofGDEandDDE.TheloaddierenceofDDEisbounded 13
16 GDE 14 thenumberofsweepssfordierentnetworks.thevalueofsisproportionaltok[10].moreover, 12 sincreaseswiththeaverageweight.tableishowstherelationshipbetweenthenumberofsweeps andtheaveragenumberoftasks,measuredonan88mesh. DDEcompletesloadbalancinginonesweepbutGDEneedsmanysweeps.Figure13shows Figure12:Loaddierence. 8 6 4 2 0 8x8 mesh 8x8x8 mesh 16x16 torus 16x16x16 torus 12 GDE 10 sweephasciterations,wherecisthenumberofcolors.forevennumberofk,c=2n.each Next,wecomparethenumberofcommunicationstepsofGDEandDDE.ForGDE,each Figure13:Thenumberofsweeps. DDE 8 6 4 2 reducethenumberofcommunicationstepssignicantly.theanalysishasbeenconrmedbythe mation.loadbalancingneedsatmostk 1andk=2communicationstepsforthemeshandthe torus,respectively.therefore,2knor32kncommunicationstepsintotalarerequired.ddecan iterationhasthreecommunications,twoforexchangingloadinformationandoneforloadbalancing.therefore,thetotalnumberofcommunicationsofssweepsare3sc=6sn.fordde, therearekcommunicationstepsineachdimensionforcollectionandbroadcastingofloadinfor- 0 experiment,asshowninfigure14. 8x8 mesh 8x8x8 mesh 16x16 16x16x16 14
TableI:TheRelationshipBetweentheNumberofSweepsandtheAverageWeight AverageNumberofSweeps7.289.2011.0813.0214.67 AverageNumberofTasks1003001,0003,00010,000 municationcostisdenedasthethetotalnumbersoftaskstransferreddividedbythetotal numberoftasks: Figure15showsthenormalizedcommunicationcostofGDEandDDE.Thenormalizedcom- Figure14:Thenumberofcommunicationsteps isabout50%largerthanthatofdde.itisduetothefactthatgdetransferstasksunnecessarily. Finally,DDEhasbetterlocalitythanGDE.Figure16showsthepercentageoflocaltasksthat whereejisthenumberoftaskstransmittedthroughtheedgej.thecommunicationcostofgde arenotmigratedtoothernodes.ddekeeps20%to50%moretasksinlocal. Piwi; Pjej 15 200 180 160 140 120 100 80 60 40 20 0 8x8 mesh 8x8x8 mesh 16x16 torus 16x16x16 torus GDE DDE
1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Figure15:Normalizedcommunicationcost. 8x8 mesh 8x8x8 mesh 16x16 torus 16x16x16 torus GDE DDE 60% 50% 40% 30% 20% 10% 0% Figure16:Thepercentageoflocaltasks. 16 8x8 mesh 8x8x8 mesh 16x16 torus 16x16x16 torus GDE DDE
n-cube,ddeisfaster,balancestheloadwell,reducescommunications,andkeepsbetterlocality. tothek-aryn-cube.comparedtothegdealgorithm,whichalsoextendeddemtothek-ary Thispaperproposedadirectmethodforloadbalancing.ItextendedtheDEMalgorithm 7.Conclusion themeshwalkingalgorithm[8].however,dderetainsitssimplicityofimplementationandcan deliverasatisedperformanceatthesametime. References DDEcanbefurtherimprovedforamorebalancedloadandlesscommunicationsbyextending [1]G.Cybenko.Dynamicloadbalancingfordistributedmemorymultiprocessors.J.ofParallel [4]E.L.Lawler.CombinatorialOptimization:NetworksandMatroids.Holt,Rinehartand [3]S.H.Hosseini,B.Litow,M.Malkawi,J.McPherson,andK.Vairavan.Analysisofagraph [2]M.Barnettetal.Broadcastingonmesheswithwormholerouting.TechnicalReportTR-93- Computing,10:160{166,1990. coloringbaseddistributedloadbalancingalgorithm.journalofparallelanddistributed 24,Univ.TexasatAustin,1993. Distrib.Comput.,7:279{301,1989. [5]S.Ranka,Y.Won,andS.Sahni.Programmingahypercubemulticomputer.IEEESoftware, [6]W.ShuandM.Y.Wu.Runtimeparallelschedulingfordistributedmemorycomputers.In Winston,1976. pages69{77,september1988. [9]C.Z.XuandF.C.M.Lau.Analysisofthegeneralizeddimensionexchangemethodfor [8]M.Y.WuandW.Shu.High-performanceincrementalschedulingonmassivelyparallelcomputers aglobalapproach.insupercomputing'95,december1995. September1993. onhighlyparallelcomputers.ieeetrans.parallelanddistributedsystem,9(4):979{993, Int'lConf.onParallelProcessing,pagesII.143{150,August1995. [7]MarcWillebeek-LeMairandAnthonyP.Reeves.Strategiesfordynamicloadbalancing [10]C.Z.XuandF.C.M.Lau.Thegeneralizeddimensionexchangemethodforloadbalancing December1992. ink-aryn-cubesandvariants.journalofparallelanddistributedcomputing,24(1):72{85, January1995. dynamicloadbalancing.journalofparallelanddistributedcomputing,16(4):385{393, 17