Nearestneighboralgorithmsforloadbalancingin ChengzhongXu parallelcomputers DepartmentofElectricalandComputerEngg. WayneStateUniversity,Detroit,MI48202 BurkhardMonien,ReinhardLuling czxu@ece.eng.wayne.edu UniversityofPaderborn,Germany DepartmentofComputerScience FrancisC.M.Laufbm,rlg@uni-paderborn.de DepartmentofComputerScience,TheUniversityofHongKong,HongKong fcmlau@csd.hku.hk sionsbasedonlocalizedworkloadinformationandmanagesworkloadmigrationswithinits Withnearestneighborloadbalancingalgorithms,aprocessormakesbalancingdeci- Abstract andtheirseveralvariants theaveragedimension-exchange(ade),theoptimally-tuned neighborhood.thispapercomparesacoupleoffairlywell-knownnearestneighboralgorithms,thedimension-exchange(de,forshort)andthediusion(df,forshort)methods dimension-exchange(ode),thelocalaveragediusion(adf)andtheoptimally-tuneddiffusion(odf).themeasuresofinterestaretheireciencyindrivinganyinitialworkload distributiontoauniformdistributionandtheirabilityincontrollingthegrowthofthevarianceamongtheprocessors'workloads.thecomparisonismadewithrespecttobothoneportandall-portcommunicationarchitecturesandinconsiderationofvariousimplementationstrategiesincludingsynchronous/asynchronousinvocationpoliciesandstatic/dynamigorithmleadsitselftobestsuitedforstaticallysynchronousimplementationsofaload thediusionmethodintheone-portcommunicationmodel.inparticular,theodeal- randomworkloadbehaviors.itturnsoutthatthedimension-exchangemethodoutperforms theodfalgorithmperformsbestinthatcase.theunderlyingcommunicationnetworks consideredassumethemostpopulartopologies,themeshandthetorusandtheirspecial diusionmethodisinasynchronousimplementationsintheall-portcommunicationmodel; balancingprocessregardlessofitsunderlyingcommunicationmodels.thestrengthofthe cases:thehypercubeandthek-aryn-cube.
Massivelyparallelcomputershavebeenshowntobeveryecientinsolvingproblemsthatcan bepartitionedintotaskswithstaticcomputationandcommunicationpatterns.however,there Introduction communicationpatterns.tosolvethiskindofproblemsecientlyinparallelcomputers,itis existalargeclassofproblemsthathaveunpredictablecomputationalrequirementsorirregular necessarytoperformloadbalancingoperationsatrun-time. totakeplace.everyloadbalancingstrategyhastoresolvetheissuesofwhentoinvokea viewofthesystemandsomenegotiationmechanismforworkloadmigrationsacrossprocessors balancingoperation,whomakesloadbalancingdecisionsaccordingtowhatinformation,and Theexecutionofaloadbalancingprocedurerequiressomemeansofmaintainingaglobal howtomanageworkloadmigrationsbetweenprocessors.combiningdierentanswerstothe aboveyieldsalargespaceofpossibledesignsofloadbalancingalgorithmswithwidelyvarying characteristics.nearestneighboralgorithmsaresuchaclassofmethodsinwhichprocessors workloadtonearestneighbors,thesealgorithmscanbeeasilyscaledtooperateinmassively makedecisionsbasedonlocalinformationinadecentralizedmannerandmanageworkload migrationswithintheimmediateneighborhood[,2,3,4,5].sincetheywouldonlyspreadlocal parallelcomputersofanysize,andwouldtendtopreservethecommunicationlocalityinherent intheunderlyingcomputations.ingeneral,thesealgorithmsareexecutediteratively,withthe expectationthatsuccessiveinvocationsoflocalloadbalancingwouldeventuallybringabouta spectrumofpossibilities,fromloadsharing(noidleprocessorscoexistwithbusyprocessors) globalbalancedstate;hence,theygivetheexibilityofcontrollingthebalancequalityovera uniformdistribution,andhenceateachoperation,needonlybeconcernedwiththedirection totheglobalbalancedstate. loadbalancingmethodsthatarecharacterizedbydierentchoicesofthedirectionofworkload ofworkloadmigrationandtheissueofhowtoapportionexcessworkloads.amongexisting Nearestneighborloadbalancingalgorithmsrelyonsuccessiveapproximationstoaglobal migration[6],weareinterestedinthediusionandthedimension-exchangemethods.these twomethodshavedrawnafairamountofattentioninrecentyears.withthediusionmethod, aheavilyorlightlyloadedprocessorbalancesitsworkloadwithallofitsnearestneighbors thesubsequentpairwisebalancing[8,5,9].thesetwomethodsarecloselyrelated,andthey simultaneouslyinaloadbalancingoperation[7,8].withthethedimension-exchangemethod, oneatatime,andeachtimeanewworkloadindexiscomputed,whichwillbeusedinthe lendthemselvesparticularlywelltoimplementationintwobasiccommunicationarchitectures, aprocessorinneedofloadbalancingbalancesitsworkloadsuccessivelywithitsneighbors theall-portandtheone-portmodels,respectively.theall-portmodelallowsaprocessorto neighboratonetime.bothofthesetwomodelswereassumedinmanyrecentresearcheson exchangemessageswithallitsdirectneighborssimultaneouslyinonecommunicationstep, whiletheone-portmodelrestrictsaprocessortoexchangemessageswithatmostonedirect communicationalgorithms[0,].althoughthelatestdesignsofmessage-passingprocessors tendtosupportall-portsimultaneouscommunications,therestrictiveone-portmodelisstill 2
validinexistingrealparallelcomputersystems.sincethecostinsettingupacommunication isxed,thetotaltimespentinsendingdmessagestoddierentports,assumingthebest possibleoverlappingintime,isstilllargelydeterminedbydunlessthemessagesarerather longṫheall-portandone-portmodelsfavorthediusionandthedimension-exchangemethods, usingthediusionmethodcanbecompletedinonecommunicationstepwhilethatusing respectively.inasystemthatsupportsall-portcommunications,aloadbalancingoperation bandwidthisconcerned.anaturalbutinterestingquestioniswhethertheadvantagetranslates hasanadvantageoverthedimension-exchangemethodasfarasexploitingthecommunication thedimension-exchangemethodwouldtakedsteps.itappearsthatthediusionmethod auniformdistribution.thismeasurealoneissucientforthosekindsofproblemsthatneed ofcommunicationstepsrequiredbythealgorithmtodriveaninitialworkloaddistributioninto algorithmisdeterminedbytwomeasures.oneiseciencywhichisreectedbythenumber intorealperformancebenetsinloadbalancingornot.theperformanceofaloadbalancing globalbalancingatruntime.however,fortheotherkindsofapplicationsthatneedtoachieve loadsharingratherthanglobalbalancing,weneedanothermeasure,thebalancequality,to indierentcommunicationmodels. thequestionconcerningtheperformanceofthediusionandthedimension-exchangemethods reecttheabilityofthealgorithminboundingthevarianceofprocessors'workloadsafter performingoneormoreloadbalancingoperations.theobjectiveofthisstudyistoanswer ofattentionfromboththeoreticalandexperimentalresearchers.thediusionmethodwas rstmodeledusinglinearsystemtheorybycybenko[8],andbertsekasandtsitsiklis[7].cybenkoshowedthatthediusionmethodwilleventuallycoerceanyinitialworkloaddistribution Intheliterature,thediusionandthedimension-exchangemethodshavereceivedalot intoaglobaluniformdistributioninstaticsituationsinwhichnoworkloadsaregeneratedor consumedduringloadbalancing,andpresentedanasymptoticboundforthevarianceofany workloaddistributionduringloadbalancinginthedynamicsituation.similarconvergence resultsinthestaticsituationwereobtainedindependentlybyboillat[2].boillatalsoproved ThediusionmethodinthedynamicsituationwasstudiedbyHongetal.[3],andQianand Yang[4],aswell.Theypresentedaconstantboundforthevarianceofworkloaddistribution thatthediusionloadbalancingwillconvergetoaglobalbalancedstateinpolynomialtime. itsoptimalvaluesforthemeshandthetorusnetworks[5]. Lauanalyzedtheeectsoftheparameterontheeciencyofthediusionmethod,andderived whenapplyingthemethodtosomespecicstructures.thediusionmethodischaracterized byaparameterwhichdeterminestheportionofexcessworkloadtobediusedaway.xuand benkoshowedthatregardlessoftheorderofdimensionsconsidered,thissimpleloadbalancing allelcomputers,inwhichbalancingproceedsiterativelyindimensions.ateachdimension, aprocessorbalancesitsworkloadwiththatofitsneighborbelongingtothedimension.cy- Thedimension-exchangemethodwasconceptuallydesignedforhypercube-structuredpar- methodyieldsauniformdistributionfromanyinitialworkloaddistributionafteraroundof balancingoperations[8].healsorevealedthesuperiorityofthedimension-exchangemethod 3
overthediusionmethodintermsoftheirecienciesandbalancequalities. appliedthismethodtoarbitrarystructuresbasedonedge-coloring[6].furthermore,xuand Laushowedthat\equalsplitting"oftheworkloadinapairwisebalancingoperationmight notleadtomaximumeciencyinmostpopularstructures,suchasthemeshandthetorus, Thedimension-exchangemethodisnotlimitedtohypercubestructures.Hosseinietal. formforthen-dmeshandtorusstructures. althoughitperformsbestinthehypercube[5,9].throughintroducinganexchangeparameter togovernthesplittingofworkloadateverystep,theyderivedtheoptimalvaluesinclosed theirsoundmathematicalfoundation.onthepracticalside,thebenetsofthediusion methodweredemonstratedinthecontextofdistributedcomputationsofbranch-and-bound algorithms[7,4],andthedimension-exchangemethodwasexperimentedinparallelgraph Thetheoreticalstudyofthediusionandthedimension-exchangemethodsestablished partitioning[8]andperiodicre-mappingofdataparallelcomputations[9].also,willebeek- concludedthatthespeedupduetothedimension-exchangemethodisbetterthanthespeedup LeMairandReeves[4]comparedtheresultsofthesetwomethodsinthedistributedcomputationofbranch-and-boundalgorithmsonahypercube-structurediPSC/2.Theirexperiments ofthedimension-exchangemethodinhypercubes,itmightnotbethecaseforotherpopular duetothediusionmethod.itisinagreementwiththecybenko'sresult. networks.ontheotherhand,previoustheoreticalstudiesofthesetwomethodsweremostly ontheirsynchronousimplementationsinwhichallprocessorsparticipateinloadbalancing Althoughtheresultsofboththeoreticalandexperimentalstudypointtothesuperiority resultshavebeenobtainedontheasynchronousimplementationsofthesemethods.bertsekas workloadmigrationsdemandedbythecurrentoperationhavecompleted.relativelylittle andtsitsiklisprovedtheconvergencepropertyofanasynchronousimplementationofthe operationssimultaneouslyandeachprocessorcannotproceedintothenextstepuntilthe diusionmethod[7],andsongextendedtheresulttothecaseofthetotalworkloadbeingtoo smalltobedividedinnitely[20].lulingandmonienconsideredarandomizedversionofthe diusionmethodinwhichaprocessorinneedofloadbalancingactivatesanoperationamong boththeissuesofeciencyandbalancequalitytogether. dierencebetweenanytwoprocessorsbounded[2].however,noneoftheseworksaddressed anumberofrandomlychosenneighbors,andshowedthatthealgorithmwillkeeptheworkload cationpolicies,andwithstatic/dynamicrandomworkloadbehaviors.thecommunication exchangemethodsintermsoftheireciencyandbalancingqualitywhentheyareimplemented inbothone-portandall-portcommunicationmodels,usingsynchronous/asynchronousinvo- Inthispaper,wemakeacomprehensivecomparisonbetweenthediusionandthedimension- cases:thering,thechain,thehypercubeandthek-aryn-cube.themeshandthetorusallow networkstobeconsideredincludethestructuresofn-dtorusandmesh,andtheirspecial dierentnumberofnodesindierentdimensions.ak-aryn-cubeisaspecialcaseofthen-d torusinthatithasknodesineachdimension[22,23].thehypercubeisaspecialcaseofboth then-dmeshandthek-aryn-cube.ahypercubeisann-dmeshhavingtwonodesineach 4
themostpopularchoicesoftopologiesincommercialparallelcomputers[23,24]. dimension,thatis,a2-aryn-cube.welimitourscopetothesestructuresbecausetheyare oftheparametervalueineachmethod:theaveragedimension-exchange(ade),theoptimallytuneddimension-exchange(ode),thelocalaveragediusion(adf),andtheoptimally-tuned Boththedimension-exchangeandthediusionmethodsareparameterizedmethods.Their performanceislargelyinuencedbythechoiceoftheparametervalues.wefocusontwochoices diusion(odf).theoptimalityhereisintermsoftheeciencyinstaticsynchronousimplementationsamongvariouschoicesofthedimension-exchangeandthediusionparameters. Theaverageversionsarethemostoriginalversionswhenthemethodswererstproposedand arestillbeingemployedinrealapplicationstoday.ourmainresultsarethatthedimensionexchangemethodoutperformsthediusionmethodintheone-portcommunicationmodel;in balancingevenundertheall-portcommunicationmodel;thestrengthofthediusionmethodis inasynchronousimplementationundertheall-portcommunicationmodel;theodfalgorithm particular,theodealgorithmisfoundtobebestsuitedforsynchronousimplementationin performsbestinthiscase. thestaticsituation;andthatthedimension-exchangemethodissuperiorinsynchronousload Section3describestheloadbalancingalgorithmsinauniedform.InbothSection4andSection5,thealgorithmsarecomparedwithrespecttotheirimplementationusingasynchronous insection2,whichprovidesaframeworkforthecomparisonoftheloadbalancingalgorithms. Therestofpaperisorganizedasfollows.Werstpresentagenericmodelofloadbalancing ancingalgorithms.weconcludeinsection7withasummaryoftheresultsofthecomparison betweenthedimension-exchangeandthediusionmethods. whichverifyourtheoreticalresultsaswellasprovidefurtherinformationontheseloadbal- andsynchronousinvocationpolicies,respectively.section6givestheresultsfromsimulations, cessorsinterconnectedbyadirectcommunicationnetwork.processorscommunicatethrough 2Weconsideraclassofparallelcomputerswhicharecomposedofanitesetofhomogeneouspro- Agenericmodelofloadbalancing passingmessages.thecommunicationchannelsareassumedtobefullduplexsothatapair ofdirectlyconnected(nearestneighbor)processorscansend/receivemessagessimultaneously andevvisasetofedges.everyedge(i;j)2ecorrespondstothecommunication messagesthroughachannelcantakeplaceinstantaneously.werepresentsuchasystemby asimpleconnectedgraphg=(v;e),wherevisasetofprocessorslabeledthroughn, to/fromeachother.inaddition,weassumethattheoperationsofsendingandreceiving channelbetweenprocessorsiandj.leta(i)denotethesetofnearestneighborsofprocessor i,d(i)=ja(i)jbethedegreeofprocessori,anddbethemaximumofd(i)forin. tobelargeenoughsothattheworkloadofaprocessorisinnitelydivisible.processesmaybe processes,whicharethebasicunitsofworkload.thetotalnumberofprocessesareassumed Theunderlyingparallelcomputationisassumedtocomprisealargenumberofindependent 5
dynamicallygenerated,consumed,ormigratedduetoimbalanceasthecomputationproceeds. ispossiblewhenprocessorsarecapableofmultiprogrammingorthebalancingoperationisdone operation,orbothoperationssimultaneously.theconcurrentexecutionofthesetwooperations operation.ananytime,aprocessorcanperformacomputationaloperation,abalancing Weclassifytheoperationsintotwotypes:thecomputationaloperationandthebalancing inthebackgroundbyspecialcoprocessors.theworkloadofprocessorscanbeeitherxedor varyingwithtimeduringtheloadbalancingoperation,whichwerefertoasthestaticandthe processoriattimetbywtiintermsofthenumberofresidingprocesses.weuseintegertime dynamicsituations,respectively. tosimplifythepresentation.theresultscanbeexpendedreadilytocontinuoustime.leti(t) denotethesetofprocessorsperformingbalancingoperationsattimet.thechangeofworkload Lettbeatimevariable,representingglobalrealtime.Wequantifytheworkloadof ofaprocessorattimetcanbemodeledbythefollowingequationinthestaticsituation andthefollowingequationinthedynamicsituation wt+ i=(wti+t+ fi(wti;wtjjj2a(i))i2i(t) i i62i(t) () wheret+ wt+ i=(wti+t+ i denotestheamountofworkloadgeneratedornishedfromtimettot+,and fi(wti;wtjjj2a(i))+t+ i i i2i(t) i62i(t) (2) fi()representsaloadbalancingoperator. loadbalancingatanytimet,i(t),areleftunspecied.theoperatorfi()canbeanynearestneighborloadbalancingalgorithm,includingthediusionandthedimension-exchange methods.theseti(t)isdeterminedbytheinvocationpolicyoftheloadbalancing.the Thismodelisgenericbecausetheloadbalanceoperatorfi()andthesetofprocessorsin choiceofi(t)isorthogonaltotheloadbalancingalgorithminthatanyinvocationpolicy canbeusedincombinationwithanyloadbalancingalgorithminimplementation.sincea parallelcomputationsusingdomaindecompositiontechniques,forexample,thecomputational requirementassociatedwitheachportionofaproblemdomainmaychangeasthecomputation loadbalancingoperationincursnon-negligibleoverheads,dierentapplicationsrequiredierentinvocationpoliciesforabettertradeobetweenperformancebenetsandoverheads.in proceeds.aneectivewaytoreducethepenaltyduetoloadimbalancesistoperiodicallyredecomposetheproblemdomainwiththeaimofachievingaglobaluniformdistributionacross theprocessors.tothisend,allprocessorsarerequiredtoperformloadbalancingoperations Bycontrast,theparallelexecutionofdynamictree-structuredcomputationsusuallyrequires theinstancewhentheglobalsystemstatesatisescertainconditionssuchasthosesetin[25]. synchronouslyforashorttimeperiod.thatis,i(t)=f;2;;:::;ngfortt0,wheret0is sors.thus,eachprocessorisallowedtoinvokealoadbalancingoperationatanytimeinan onlyloadsharing assuringthatnoidleprocessorsexistwhilethereareotherbusyproces- asynchronousmanneraccordingtoitsownlocalworkloaddistribution.asimplepolicyisto 6
5 5 Processor 4 4 Processor activatealoadbalancingoperationonceaprocessor'sworkloaddropsbelowapresetthreshold,wunderload,i.e.,i(t)=fijwti<wunderloadg.moresophisticatedinvocationpolicieswere discussedin[2,4].inshort,wemakeadistinctionbetweensynchronousandasynchronous implementationsofloadbalancingaccordingtotheirinvocationpolicies.figurepresents respectively. oneexampleofthesetwoimplementationmodelsinasystemofveprocessors.thedots andthetrianglesrepresentthecomputationaloperationsandtheloadbalancingoperations, 3 3 2 2 t t+5 t+0 t+5 time t t+5 t+0 t+5 time cedure.weareconcernedwithsubsequentworkloaddistributionsresultingfromdierent Figure:Anillustrationofgenericmodelsofloadbalancing loadbalancingalgorithms.denotetheoverallworkloaddistributionatcertaintimetbya Assumet=0whenprocessorsinvokeasynchronousorasynchronousloadbalancingpro- (a) Asynchronous implementation (b) Synchronous implementation Wt=(wt;wt;;wt),wherewt=PNi=wti=N.Wedenetheworkloadvariance,denotedby t,asthedeviationofwtfromwt;thatis, vectorwt=(wt;wt2;;wtn).denoteitscorrespondinguniformdistributionbyavector Withtheworkloadvariancet,wedenetheeciencyofaloadbalancingalgorithmasthe t=jjwt?wtjj2=nxi=(wti?wt)2: numberofloadbalancingstepsrequiredtoreducethevarianceoftheinitialstatetoatolerable Throughoutthepaper,E[]denotestheexpectedvalueofarandomvariable. algorithmswillbecomparedintermsofthesetwomeasuresunderthefollowingassumption. istobeguaranteedbytheloadbalancingprocedureinthedynamicsituation.loadbalancing levelinthestaticsituation;anddenethebalancequalityastheboundforthevariancewhich timet,t0,processors'workloadgeneration/consumptionamount,ti,in,arezero Assumption2.Initially,processors'workloads,w0i,iN,areNindependentand identicallydistributed(i.i.d.)randomvariableswithexpectation0andvariance20.atany inthestaticsituationori.i.d.randomvariableswithexpectationandvariance2inthe dynamicsituation. 7
Thissectionbrieydescribesthedimension-exchangeandthediusionmethods.Bothof themareparameterizedloadbalancingalgorithms.wepresentseveralinstancesofthesetwo 3 Thedimension-exchangeandthediusionmethods methodsbasedondierentchoicesofvaluesfortheirparameters. 3.Thedimension-exchangemethod way Withthedimension-exchangemethod,anyprocessorwhichinvokesaloadbalancingoperation balancesitsworkloadwithitsneighborssuccessively.foraprocessori,itworksinthefollowing f()for(c=;cd(i);c++) valuebeforehandwhichdeterminesthefractionofexcessworkloadtobemigratedbetweena wherejc2a(i);and0<<,calledthedimension-exchangeparameter,isgivenaxed wi=wi+(wjc?wi) (3) balancing.itisbecauseofthesequentialnatureinthesequenceofbalancingsteps,aload balancesitsworkloadwithoneofitsneighbors,andusesthenewresultforthesubsequent methodcomprisesd(i)pairwisebalancingstepsforprocessori.ateachstep,processori pairofprocessors.theformulasaysthatabalancingoperationinthedimension-exchange balancingoperationrequiresd(i)communicationstepsinboththeall-portandtheone-port communicationmodels. choicesoftheparameterwhichhavebeensuggestedasrationalchoicesintheliterature. parameter.adimension-exchangeoperationwithdierentchoicesoftheparameterwillreduce theworkloadvarianceofthesystembydierentdegrees.inthefollowing,wepresenttwo Theeciencyofthedimension-exchangemethodisdeterminedbythedimension-exchange.Averagedimension-exchange(ADE)equallysplitsthetotalworkloadofapairofprocessors 2.Optimally-tuneddimension-exchange(ODE)takescertainspecicparametervaluesthat operation,andhasbeenfavoredinhypercube-structuredsystems[8,26,27]. thatis,==2.itisastraightforwardchoiceforlocalbalancingateachpairwise havetheeectofmaximizingeciencyinstaticandsynchronousbalancing[5,9].the optimalparameterdependsonthetopologyandthesizeofunderlyingcommunication network.letk=maxfki;inginthekk2knmeshandtorus.then,their optimalparametervalueswereshown,in[9],tobe ==(+sin(2=k))inthetorus. ==(+sin(=k))inthemesh, onlyafewprocessorsthatarenotclosetoeachotherareinneedofloadbalancingatthe sametime.however,itssynchronousimplementationrequiresprocessorstobecoordinatedin Thedimension-exchangemethodcanbeimplementedwithoutdicultyincaseswhere 8
avoidcommunicationcollisions.thepotentialofparalleleciencyisduetothefactthatthe ordertoparallelizebalancingoperationsalongdierentcommunicationchannelsaswellasto executionorderofpairwisebalancingstepsintheoperationf()ofeq.(2)isleftundened. Thepairwisebalancingstepsalongthechannelsinthesamesubsetcanthenbeperformed ofedgesintoanumberofsubsetssuchthatnotwoadjoiningedgesareinthesamesubset. Theparallelizationofpairwisebalancingoperationscanberealizedbypartitioningtheset concurrentlywithoutcollisions.suchgraphpartitionisequivalenttotheproblemofedge coloringofgraphs[28].figure2showsexamplesofcolorgraphsofameshandatorus. Thenumbersinparenthesesaretheassignedchromaticindices.Analternativeapproachto parallelizingloadbalancingoperationsisrandommatchingwhichwasusedin[29]. 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 3.2Thediusionmethod Figure2:Examplesofcoloredmeshandtorus (a) Colored Mesh (b) Colored torus Withthediusionmethod,anyprocessorwhichinvokesaloadbalancingoperationcompares itsworkloadwiththoseofitsnearestneighbors,andthengivesawayortakesincertainamount canbewrittenintheform ofworkloadwithrespecttoeachofnearestneighbors.thediusionoperatorinaprocessori where0<ij<,calledthediusionparameter,ispredenedtodictatetheportiontobe fi()wi+x balancingoperationwiththediusionmethodrequiresonlyonecommunicationstepinthe migratedbetweenanytwoprocessors.processoriapportionsexcessworkloadjwi?wjjto all-portcommunicationmodel,butd(i)stepsintheone-portcommunicationmodel. processorjifwi>wj,orfetchessomeworkloadfromprocessorjotherwise.clearly,aload bythediusionparameter.followingaretwocommonchoicesoftheparameter..localaveragediusion(adf)takesanaverageoftheworkloadofneighboringprocessors Asinthedimension-exchangemethod,theeciencyofthediusionmethodisdetermined 9 2 2 2 2 3 3 3 2 3 3 2 3 3 2 3 3 2 3 2 3 3 2 3 3 2 3 3 4 4 4 2 4 4 2 4 4 2 4 4 2 4 2 4 4 2 4 4 2 4 4 j2a(i)ij(wj?wi) (4)
bysettingij= weuseasinglevalue= samedegree.themeshisapproximatelyregularwhenitssizeislarge.forsimplicity, +d(i)[2,3,4];thetorusisregularinthatallprocessorshavethe 2.Optimally-tuneddiusion(ODF)takescertainspecicparametervaluesformaximizing thetorus. +dtocoverallcommunicationchannelsinthemeshandin eciencyinstaticandsynchronousbalancing[8].asinthedimension-exchangemethod, network.letk=maxfk;k2;;knginthekk2knmeshandtorus.then,their theoptimaldiusionparameterdependsonthetopologyandthesizeoftheunderlying optimalchoiceswereshown,in[8,5],tobe ==(n+)inthen-dhypercube. ==(2n+?cos(2=k))inthetorus, ==2ninthemesh, Inanasynchronousimplementationofloadbalancing,processorsperformbalancingoperations 4discretelybasedontheirownlocalworkloaddistributionsandinvocationpolicies.Sinceload Asynchronousimplementations balancingalgorithmscanbetreatedasorthogonaltoinvocationpolicies,weconsidertheload balancingoperationsoftheprocessorsinonetimestepsoastoisolatetheireectsonthe processorisperformingloadbalancingoperations.thedynamicsituationpresentsonlyafew loadbalancinginwhichtheunderlyingcomputationinaprocessorissuspendedwhilethe workloadvariancefromtheeectsofinvocationpolicies.wefocusonthestaticsituationof relativelyminordierencestotheanalysisoftheeectsofloadbalancing. aretheresultsfromvariousloadbalancingoperations. variancewhent=.ourcomparisonwillbemadebetweenade,ode,adf,andodfwhich Let0betheoriginalsystemworkloadvariancewhent=0,andbethesystemworkload Assumption2..Then,E[ade]E[df]intheone-portcommunicationmodel,whileE[df] Theorem4.Supposeprocessorsarerunninganasynchronousloadbalancingprocessunder E[ade]intheall-portcommunicationmodel.Moreover,E[adf]E[odf]inchainandring networks,bute[odf]e[adf]intwo-orhigher-dimensionalmeshesandtori.inaddition, E[ade]E[ode]intheall-portcommunicationmodel. theone-portandtheall-portcommunicationmodels,respectively.morespecically,itreveals thattheodfalgorithmoutperformstheadfalgorithminhigherdimensionalmeshesand torialthoughtheodfwasoriginallyproposedforuseinsynchronousglobalbalancing. Thistheoremsaysthatthedimension-exchangeandthediusionmethodsaresuitablefor ThecalculationofE[]isbasedonalemmaconcerningthesamplevarianceofacombination ThetheoremisprovedthroughthederivationoftheclosedformofeachvarianceE[]. 0
ofrandomvariablesinasampleset,whichwepresentwithoutproof.itcanbeeasilyshown =PNi=i.Then, usingfundamentalstatisticaltheories. Lemma4.Supposethat;2;:::;NareNi.i.d.randomvariableswithvariance2,and.foranyk,kN, where0<ai<satisespki=ai=;andthevarianceisminimizedatai==kfora givenk. E(jkXi=aii?j2)=(kX=a2i?N)2; (5) 2.foranykandk2andkk2N, where0<ai<satisespk E(jk i=ai=and0<bj<satisespk2 Xi=aii?j2)E(jk2 Xj=bjj?j2) j=bj=. (6) ProofofTheorem4.Atcertaintimeinanasynchronousloadbalancingprocess,there simultaneously.letea(i)=fig[a(i)denotethebalancingdomainofaninvokerprocessor mightbemorethanoneprocessorthatareinvokingloadbalancingwithintheirneighborhoods areunionsofoverlappingdomains.processorsindierentspheresperformloadbalancing other.asawhole,thoseprocessorsthatarerunningloadbalancingprocessesarepartitioned i.thebalancingdomainsofconcurrentinvokersmayoverlapormaybeseparatedfromeach intoanumberofseparatedspheres,someofwhicharesingularbalancingdomainsandsome operationsindependently,whileprocessorsinthesamesphereperformloadbalancingina B;B2;;Bm.Then,bythedenitionoftheworkloadvarianceandAssumption2.,we synchronousmanner. havesupposeinitiallytherearemindependentbalancingspheresinthesystem,denotedby E[]=E(NXi=jwi?wj2) =NXi=E(jwi?wj2) =mxj=x i2bie(jwi?wj2)+x i2bje(jwi?wj2)+(n?n0)(?n)(20+2); i62[mj=bje(jwi?wj2) wheren0=j[mi=bijisthenumberofprocessorsinvolvedinloadbalancing.thelastterm (7) of(7)isduetotheunderlyingcomputationaloperations.itisaconstantforagivenn0and
independentofthetopologicalrelationshipsamongthen0processors.thersttermof(7)is thattheexpectedvalueofthesystemworkloadvarianceisinuencedindependentlybyload duetoloadbalancingoperationsinallseparatedbalancingspheres.itisasimplearithmetic sumofworkloadvarianceofeachsphere,pi2bje(jwi?wj2).asawhole,eq.(7)implies balancingoperationswithindierentbalancingspheres.therefore,itsucestocomparethe Case:loadbalancinginasingularbalancingdomain eectsofloadbalancingalgorithmswithindierentspheresusinglemma4.. Werstconsiderloadbalancinginspheresofsingularbalancingdomains.SupposeBissuch B.Then,withthediusionalgorithm,theworkloadsofprocessorsattheendofadiusion asphere,andwithoutlossofgenerality,b=ea()=f;2;3;;d+g.thatis,processor operationaregiven,accordingtoeq.(4),by invokesaloadbalancingoperationwithinitsdneighborswhicharelabeledfrom2tod+. LetX=Pd+ i=e(jwi?wj2),denotingtheexpectedvalueofworkloadvarianceofsphere InvokingLemma4.oneachcomponentwi,wehavethat wi=((?d)w0+pd+ w0+(?)w0ij=2w0iifi=; if2id+; (8) Xdf=d+ =E(j(?d)w0+d+ Xi=E(jwi?wj2) =[d2+(?d)2?=n]20+d[2+(?)2?=n)]20 Xi=2w0i?w0j2)+d+ Xi=2E(jw0+(?)w0i?w0j2) =(d22+3d2?4d+d+?d+ Letopt=2=(3+d).WereplacebyoptintheexpressionofXdf,andobtain ItcanbeeasilyveriedthatXdf,asaconvexfunctionof,isminimizedat=2=(3+d). N)20: (9) Recallthatadf==(d+),andthatodf==dinameshandodf==(d+?cos(2=k)) Xdfj(opt)=(d2+3 d+3?d+ inatorus,wherekisthemaximumdimensionalorderofthetorus.itfollowsthat N)20: (0) inthecaseofachain(i.e.,the-dmesh)whered=2,adf<2=(3+d)<odfand inthecaseofaring(i.e.,the-dtorus)whered=2,adf<opt<odfandjadf? jadf?optj<jodf?optj; inthecaseofhigherdimensionalmeshesandtoriwhered4,adf<odf<opt. optj<jodf?optj,fork2; 2
Consequently,withthediusionmethod, (XodfXadfXdf(opt)ifd=2andk2; Withthedimension-exchangemethod,processorisassumedtoperformpairwiseloadbalancingwithprocessors2;3;:::;d+inturninadimension-exchangeloadbalancingoperation. XadfXodfXdf(opt)ifd4 () theendofadimension-exchangeoperationaregiven,accordingtoeq.(3),by Assumetheunderlyingsystemisintheone-portcommunicationmodel.Then,theworkload generation/consumptionratioinaroundofpairwisebalancingstepshasthesamestatistical characteristicsasthoseinadiusionoperation.consequently,theprocessors'workloadsat wi=8><>:(?)dw0+pd? (?)w0i+(?)i?2w0+2pi?3 (?)w02+w0j=0(?)jw0d?j+ j=0(?)jw0i??jif3id+; ifi=2; ifi=; InvokingLemma4.oneachcomponentwi,wehavethat (2) Xde=d+ =[(?)2d+2d? Xi=E(jwi?wj2) +d+ Xi=3[(?)2+2(?)2(i?2)+4i?3 Xj=0(?)2j?=N]2o+[(?)2+2?=N]20 =[d(?)2+22?(?)2d?(?)2+4d??(?)2?(?)2d?2 Xj=0(?)2j?(d?)=N]20 Inparticular,substituting=2fortheintheexpressionofXde,callingitXade,leadstothat +(?)2d?d+ N]20: Xade=(3d+5+22?2d 9?d+ communicationmodel, From(0)and(3),itisknownthatXadeXdf(opt):Itisthusprovedthatintheone-port N)20: (3) asmuchtimeasadiusionloadbalancingoperation.thatis,inatimestepofthediusion Intheall-portcommunicationmodel,adimension-exchangepairwisebalancingsteptakes XadeXdf: (4) method,aprocessorbalanceswithonlyoneofitsneighborswiththedimension-exchange method.itresultsinthat Consequently,XadeislessthanXodebutlargerthanXdf. Xde=2[(?)2+2?N]20+(d?)(?N)20: 3
Case2:loadbalancinginaunionofoverlappingdomains Wenowconsiderloadbalancinginsphereswhichareunionsofoverlappingbalancingdomains. Abalancingspherecanbeaunionofanynumberofoverlappingdomains.Inconsideration ofthelikelihoodthatfewprocessorswillbeinvokingloadbalancingsimultaneouslyinasynchronousimplementations,wefocusontheunionoftwobalancingdomainsonly.figurepingbalancingdomainsin2-dmeshesandtori.thetrianglesareinvokersofloadbalancing illustratesthreetopologicalrelationshipsbetweenapairofprocessorswhichhaveoverlap- processesandthedotsareprocessorsbeinginvolvedinloadbalancing. Figure3:Illustrationsofoverlappingbalancingdomains wj2).supposeprocessorsjandj2havethesamenumberofdirectneighbors.then,inthe ea(j2).letydenotetheexpectedvalueoftheworkloadvarianceofb2,i.e.,y=pi2b2e(jwi? AssumeB2isaunionofbalancingdomainsofprocessorsjandj2.Thatis,B2=eA(j)[ (a) (b) (c) casethatprocessorsjandj2aredirectlyconnected,asinfigure3(a),wehavethatinthe andinthedimension-exchangemethod, Inparticular, Yde2Xde?2[(?)2+2(?)2(d?)+4?(?)2(d?) Yade2Xade?2(3+ 2?(?)2?N]20: (6) (orea(j2)nfjg)changesitsworkloadinthesamewayasinloadbalancingwithinasingular balancingdomainea(j)(orea(j2)).thereasonsoftheinequalityoftheydeinthedimensionexchangemethodareasfollows.withthedimension-exchangemethod,bothprocessorsjand TheequationofthediusionmethodisduetothefactthateachprocessorineA(j)nfj2g j2performpairwisebalancingoperationswiththeirneighborsinturnaccordingtoorderswhich as,processorj2asc+,andotherneighboringprocessorsofprocessorjas2tod+ inb2isthusinuencedbytheexecutionorderacrossthecommunicationchannels.suppose thechannel(j;j2)isindexedascth.withoutlossofgenerality,werelabeltheprocessorj arepresetthroughedge-coloringofthesystemgraph.thechangeoftheworkloaddistribution 4 diusionmethod, Ydf=2Xdf?2[(?)2+2?N]20; (5) 322d?N)20: (7)
excludingc+.then,itisclearthatprocessorsfrom2tocchangetheirworkloadsin thesamewayastheircounterpartsiftheyareperformingloadbalancingwithinasingular E(jwd+?wj2)E(jwi?wj2)fori>c,theboundofYdeishenceobtained. domainea(i)alone,whilethebehaviorsofotherprocessorswillbeinuencedbyprocessorsin workloadvariancee(jwi?wj2)inaunionofoverlappingdomainsthaninea(i)alone.since ea(j)nfig.fromlemma4.(2),itisalsoknownthateachprocessori,i>c,willpossessless theoptimalchoiceof.then,ydf(opt)=d?4d2?4d+ ItcanbeeasilyveriedthatYdfisminimizedat=(2d?)=(d2+3d?2).Letoptbe Asinthecaseofsingularbalancingdomain,itcanbeshownthat d2+3d?2: (8) inthecaseofachain(i.e.,-dmesh)whered=2,adf<opt<odfandjadf?optj< inthecaseofaring(i.e.,-dtorus)whered=2,adf<opt<odfandjadf?optj< jodf?optj; inthecaseofhigherdimensionalmeshesandtoriwhered4,adf<odf<opt. jodf?optj,fork6; Consequently,withthediusionmethod, (YodfYadfYdf(opt)ifd=2andk6; YadeYdf(opt). Ontheotherhand,thecomparisonbetweenYadeofEq.(7)andYdfj=optofEq.(8)yields YadfYodfYdf(opt)ifd4 (9) thereareatmosttwoprocessorsintheintersectoftheirbalancingdomainsinthemeshand torusnetworks.letsbethecardinalityoftheintersectea(i)\ea(j),s=or2.then,with thediusionmethod, Incasesthatprocessorsiandjarenonadjacent,asillustratedinFigure3(b)and3(c), Ydf=2Xdf?2s[(?)2+2?N]20+s[(?2)2+22?N]20 andwiththeadealgorithm, =2Xdf?s[(?22)?N]20; (20) Yade2Xade?s[3+ 322d?N]20: 2 YodfYadfincased=2,andYadfYodfincased4. Similarlytothecaseofsingularbalancingdomain,wehavetheresultthatYadeYdf, (2) ontheassumptionofone-portcommunicationmodel.intheall-portcommunicationmodel, Noticethattheprecedinganalysisofthedimension-exchangemethodisimplicitlybased 5
adimension-exchangepairwisebalancingstepcorrespondstoadiusionbalancingoperation. formedconcurrently,wethushave Becausetwopairwisebalancingoperationsinaunionoftwobalancingdomainscanbeper- wheres=or2.obviously,yadeislessthanyodebutlargerthanydf. Thetheoremisthenproved. Yde=4[(?)2+2?N]20+(2d?4?s)(?N)20; theoremstillholdsinthedynamicsituation.considerprocessorsinbalancingsphereb.since theworkloadsgenerated/consumedfromtime0totimeinanyprocessori,i2b,willnotbe consideredinitsloadbalancingoperationattimestep,theoperationinthedynamicsituation Notethateventhoughtheproofofthetheoremassumesstaticworkloadbehaviors,the N0=jBj.TheaddedtermisaconstantforagivenN0andindependentoftheloadbalancing workloadvarianceinthestaticsituation.asawhole,theaccumulativeworkloadvarianceof processorsinbalancingspherebinthedynamicsituationispi2be(wi?wj2)+n02,where thenresultsinanworkloadvariancee(jwi?wj2)+2,wheree(jwi?wj2)istheprocessor's situation. algorithmused.hence,theargumentsintheproofofthetheoremarevalidinthedynamic workloadofaprocessor,sayprocessor,anditssurroundingdprocessors,labeledfrom2to d+,inasimplewaythatprocessorgives(w?wi)loadstoprocessori,inthecaseof w>wi,andtakes(wi?w)loadfromprocessori,otherwise(2id+).inasingular Toconcludethissection,weremarkthatadiusionloadbalancingoperationaveragesthe balancing.specically,processorcalculatesthelocalaveragewas balancingdomain,theremightbeavariantoftheadfalgorithmwhichstrivesforlocalload w=w+p2id+wi iisdecientornot.aftersuchanoperation,eachprocessori,2id+endsupwiththe andthengivesortakesjwi?wjloadstoorfromprocessoriaccordingtowhetherprocessor +d ; Pd+ sameworkloadasprocessor.consequently,theexpectedworkloadvarianceofthedomain i=e(jwi?wj2)becomes model.althoughitincursmoreoverheadsthananodforadfoperation,suchavariantof whichisobviouslysmallerthanthatoftheademethodevenintheone-portcommunication (?d+ N)20; inbalancingsphereswhereanumberofbalancingdomainsoverlapwitheachotherbecause processorsinsuchasphereareunabletobalancetheirworkloadswithalltheprocessorsin suchanoperation. diusionoperationispreferredinsingularbalancingdomains.however,itmaynotbeeective 6
5Inasynchronousimplementationofloadbalancing,allprocessorsperformloadbalancing operationsconcurrentlyandcontinuouslyforatimeperiodinordertoachieveaglobalbalanced Synchronousimplementations dynamicsituation. stateinthestaticsituationortokeepthevaryingsystemworkloadvarianceboundedinthe oftheworkloaddistributionattimetinthediusionmethodcanbemodeledbytheequation bemodeledbylineariterativeprocesses,asillustratedin[2,8,5].fromeq.(4),thechange Thesynchronousimplementationofthediusionandthedimension-exchangemethodscan whered,calledadiusionmatrix,isgivenby Dij=8><>: ifprocessorsiandjaredirectlyconnected; Wt+=DWt+t; (22) 0?d(i)ifi=j; methodarefullycapturedbytheiterativeprocessgovernedbyd. Withtheaboveformulation,thefeaturesofthesynchronousimplementationofthediusion otherwise: methodcanbemodeledbytheequation Then,fromEq.(3),thechangeoftheworkloaddistributionattimetinthedimension-exchange Letbetheminimumnumberofcolorsrequiredforedgecoloringofthesystemgraph. wherem,calledthedimension-exchangematrix,isgivenby Wt+=MWt+t M=MM?:::M: (23) EachMc(c)reectsthechangeoftheworkloaddistributionofthesystematpairwise balancingstepcoftimet.thus,thefeaturesofthesynchronousimplementationofthe dimension-exchangemethodarefullycapturedbytheiterativeprocessgovernedbym. balancingoperationssimultaneouslyandallcomputationaloperationsaresuspended.this 5.Staticsituation situationistrueofperiodicloadbalancing,asexperimentedin[30,3,25,9].theeciency ofaloadbalancingalgorithminthissituationisreectedbythenumberofcommunication Inastaticsynchronousloadbalancingprocess,allprocessorsareassumedtoperformload stepsrequiredforarrivingataglobalbalancedstatefromanyinitialloaddistribution. FtWt=Wt,itfollowsthat Wt=FtW0,whereFt=FFF LetFbeeitherthedimension-exchangematrixMorthediusionmatrixD.Then, ttimes {z }.SinceWt=W0inthestaticsituation,and Wt?Wt=F(Wt??Wt?)=Ft(W0?W0): 7
Then,bythedenitionoftheworkloadvariance,wehave where(f)isthesub-dominanteigenvalueoffinmodulus.itsaysthattheworkloadvariance isreducedgeometrically,anditsscalefactorisupperboundedby(f).theboundistight, t=jjwt?wjj2=jjft(w0?w)jj22t(f)0; Thesub-dominanteigenvalueinmodulus(F)isthusreferredalsoastheconvergencefactor andtsatises ofaloadbalancingalgorithm. t'2t(f)0 forlarget. (24) initialstatetosomeprescribedbound.then,fromeq.24,itfollowsthat LetTbethenumberofiterationstepsrequiredtoreducetheworkloadvarianceofan Hence, T=ln?ln0 T=O(=ln(F)): 2ln(F): (26) (25) algorithm,adf,whenappliedtoabroadvarietyofstructures.in[5,5],xuandlauanalyzed atedbyanumberofresearchers.in[2],boillatpresentedtheconvergencefactorsoftheadf theeectsofthedimension-exchangeandthediusionparametersontheeciencyofload Theconvergencefactorsofthedimension-exchangeandthediusionmethodswereevalu- balancing,andproposedtheodeandodfalgorithmsbychoosingtheoptimalvaluesforthe parametersand.thecorrespondingconvergencefactors,odeandodf,arehencereadily work.wesummarizetheconvergencefactorsintable. availablefromtheirproofs.also,theconvergencefactoradecanbederivedeasilyfromthe isthemaximumnumberofnodesoveralldimensionsofann-dnetwork Table:Convergencefactorsofthedimension-exchangeandthediusionmethods,wherek toruscos2(2=k)?sin(2=k) ADE DEmethod ODE ADF Diusionmethod meshcos2(=k) +sin(2=k)2n?+2cos(2=k)?sin(=k) +sin(=k) 2n?+2cos(=k) 2n+ 2n?+cos(2=k) 2n+?cos(2=k) n?+cos(=k) ODF Noticethattheconvergencefactorisiniterationsteps,eachofwhichiswhatwecalleda n requiresonlyonecommunicationstepwhileadimension-exchangeoperationstillrequires2n boththedimension-exchangeandthediusionmethodsrequires2ncommunicationstepsin ann-dnetwork.intheall-portcommunicationmodel,adiusionloadbalancingoperation loadbalancingoperationbefore.intheone-portcommunicationmodel,suchanoperationin Byg(t)'h(t)forlarget,wemeanthatg(t)=h(t)?!forlarget. 8
steps.therefore,tableandtheeq.(26)leadtotable2.itpresentsthetimecomplexities incommunicationstepsnecessaryforvariousloadbalancingalgorithmsinbothone-portand all-portcommunicationmodels. themaximumnumberofnodesoveralldimensionsinann-dnetworkand?portmeansthe all-portcommunicationmodel. Table2:Timecomplexitiesofthedimension-exchangeandthediusionmethods,wherekis toruso(nk2)o(nk2)o(nk)o(nk)o(n2k2)o(nk2)o(n2k2)o(nk2) mesho(nk2)o(nk2)o(nk)o(nk)o(n2k2)o(nk2)o(n2k2)o(nk2) -port ADE*-port-port*-port ODE -port ADF*-port -port ODF*-port example,theo(nk)estimatefortheodealgorithmfollowsfromthefollowingderivation. ThetimecomplexitiesgiveninTable2areinferredfromtheconvergencefactors.For ln(ode)=ln(?sin(2=k) =ln(?2sin(2=k) +sin(2=k)) 'ln(?4 +sin(2=k)) ' k+2 k+2) forlargek FromEq.(26),wehaveTode=O(k)inbalancingoperations.SinceanODEloadbalancing O(nk)isthusproved. operationrequireso(n)communicationstepsinbothcommunicationmodels,theestimate Theorem5.Supposeprocessorsarerunningsynchronousloadbalancingprocessesinthe staticsituation.then,boththeadeandtheodealgorithmsconvergeasymptoticallyfaster Theentriesofthetableshowthefollowing. k. thanthediusionmethodintheone-portcommunicationmodel.intheall-portcommunication model,theodealgorithmconvergesalsofasterthantheotherthreealgorithmsbyafactorof 5.2Dynamicsituation chronousimplementationofthediusionmethodinthedynamicsituationhasbeenevaluated anceofprocessors'workloadsboundedastightlyaspossible.theperformanceofthesyn- computationconcurrently.dynamicloadbalancinginthissituationaimsatkeepingthevari- Indynamicsynchronousimplementations,allprocessorsareperformingloadbalancingand in[8,3,4].in[8],cybenkoshowedthatthediusionmethodkeepstheasymptoticvariance 9
this,wearestillunabletodrawaconclusionregardingthesuperiorityoftheadealgorithm thevariancefromtheadealgorithmwhenbothareappliedtothehypercubenetwork.given intermsofthebalancequalityduringloadbalancing.in[3],hong,tanandchenreported bounded.healsoprovedthattheasymptoticvariancefromthediusionmethodislargerthan aconstantboundfortheworkloadvariancewhentheadfalgorithmrunsinthehypercube network.thisresultwasextendedlaterbyqianandyangtogeneralizedhypercubesandmesh structures[4].althoughtheboundstheyderivedareindependentoftime,theyaretooloose tobeusedforthecomparisonofbalancequalitiesduringloadbalancing.also,theapproaches theirdierentoperationalbehaviors. usedin[3,4]areunsuitablefortheanalysisofthedimension-exchangemethodbecauseof algorithms.wepresentaclosedformoftheworkloadvariancewhenaloadbalancingprocess runsinthetorusandthehypercubenetworks.theapproachisnotapplicabletothecase ofthemeshnetworksastheyarenotregularnetworks.nevertheless,sinceann-dmeshhas Inthissubsection,wedevelopanewapproachforanalyzingthebalancequalitiesofdierent onlyafractionofitsprocessorswhosedegreeissmallerthan2n,ourresultsasareasonable resultstobepresentedinthenextsection. approximationcanbeappliedtothemeshstructureaswell;thisissupportedbyoursimulation Lemma5.Supposeprocessorsarerunningasynchronousdiusionloadbalancingprocess d=2nbethedegreeofthenetwork. Throughoutthesubsection,weassumeloadbalancinginann-Dtorusnetwork,andlet underassumption2..then,e(wt)isauniformdistributionatanytimetand wherea=(?d)2+d2. E[tdf]=(at+20+?at+?a2)N?(t+)2?20; (27) Proof.TheuniformdistributionofE(Wt)resultingfromthediusionmethodcanbeeasily shown.weomititsproofbecauseitisalsoavailableasaspecialcaseintheproofoftheuniform distributionofe(wt)resultingfromthedimension-exchangemethodinthenextlemma. haveconsidertheexpectedworkloadvariancee[tdf].byitsdenitionandassumption2.,we E[tdf]=E(jjWt?Wtjj2) =E(jjDt+W0?W0jj2+tXi=0E(jjDit+?i?t+?ijj2) =E(jjDWt??Wt?jj2)+E(jjt?tjj2) =(at+20+?at+ =N(at+?N)20+tXi=0(ai?N)2?a2)N?(t+)2?20; 20
nent'sd+sub-componentswithcoecients?d;;;:::;;andasequenceofoperation distributionchangeseachofitscomponentstobecomealinearcombinationofthecompo- wherethefourthstepisbasedonthefollowingobservations.anoperationdontheworkload Dtchangeseachcomponentoftobecomealinearcombinationofits(d+)tsub-components. N(a?=N)2,andE(jjDt?jj2)=N(at?=N)2,wherea=(?d)2+d2. terminedonlybytheircombinatorialcoecients.therefore,wehavee(jjd?t+?ijj2)= FromLemma4.,itisknownthatthevarianceofacombinationofrandomvariablesisde- allpossiblechoicesoftheparameter,whichhappenstobethechoiceoftheadfalgorithm inn-dmeshesandtori.immediately,weobtain Considertheterma=(?d)2+d2inLemma5..Itisminimizedat==(d+)over presentacompaniontolemma5.inthefollowing. Next,weconsidersynchronousimplementationsofthedimension-exchangemethod.We E[tadf]E[todf]: (28) Lemma5.2Supposeprocessorsarerunningasynchronousdimension-exchangeloadbalancingprocessunderAssumption2.,exceptthatprocessorsgenerate/consumeiworkloadata pairwisebalancingstep.then,e(wt)isauniformdistributionatanytimet,and whereb=(?)2+2ands=+b+b2++bd?. E[tde]=(sbtd+d20+s?btd+d?bd2)N?(t+)d2?20; (29) algorithm.aloadbalancingoperationcomprisesdpairwisebalancingstepsinboththetorus andthemeshstructures.toexaminecloselythedynamicbehaviorofthedimension-exchange algorithminthelevelofpairwiseoperations,weintroduceonemorevariablet0todenotethe Proof.Recallthattistheindexofloadbalancingoperationsinthedimension-exchange indexofpairwisesteps.t=0ifandonlyift0=0,andtindexesthetimeinstancest0thatare integermultipliesofd.then,attimet0thatt0=dt, Wt0=MdWt0?+t0 =MdMd?Wt0?2+Mdt0?+t0 =c=dmcwt0?d+2c=dmct0?d+++mdt0?+t0 wherecj=dmj=mdmd?mc,andd+ =MWt0?d+dXc=(c+ j=dmjt0?d+c); j=dmj=. (30) fromtimet0?dtot0,i.e.,,thetthdimension-exchangebalancingoperation.usingindextinsteadoft0,eq.(30)leadsto Let t=pdc=(c+ j=dmjt0?d+c)bethedistributionofworkloadswhicharegenerated/consumed Wt=MWt?+ =MtW0+tXj=Mt?j t 2 j:
Usingthelinearityoftheexpectationoperations,E,weobtainthat E(Wt)=E(MtW0+tXj=Mt?j =MtE(W0)+tXj=Mt?jE( j) =0u+dtu; j) whereuisaunitaryvectorofsizen.itisauniformdistribution.therstpartofthelemma tionofworkloadsthataregenerated/consumedintheroundt.then, isthusproved. Wt=Wt?+ Next,weconsidertheworkloadvarianceattimet,E[tde].Let t:bythedenitionofworkloadvariance,wehavetbetheuniformdistribu- t=pdc=t0?d+c,and E[tde]=E(jjWt?Wtjj2) =E(jjMt+W0?W0jj2+tXi=0E(jjMi =E(jjMWt??Wt?jj2)+E(jj t? t+?i? tjj2) t+?ijj2): ToprovethelemmaregardingE[tde],itsucestoshowthatfor0it, Itcanbeshownbyinduction.WerstconsiderE(jj E(jjMi t+?i? t+?ijj2)=bidsn2?d2: of,wehavethat augmentedinaroundofdimension-exchangepairwisebalancingoperations.bythedenition? jj2).itistheworkloadvariance E(jj t? tjj2)=e(jjdxc=(c+ =dxc=[e(jjc+ j=dmjt0?d+c?t0?d+cjj2)] j=dmjt0?d+c?t0?d+c)jj2) =sn2?d2; =d? Xc=(bcN?)2+(N?)2 wherethesecondstepisduetothefactthatc+ ofc+ dentrandomvariablesforcd,andthethirdstepisduetothefollowingreasons.each componentofcj=dmjforanyc,c<d,isrecursivelyacombinationoftwocomponents j=dmjwithcoecients?and.itcanthusbeinferredthatacomponentofcdmj j=dmjt0?d+c?t0?d+carezeromeanindepen- isacombinationof2d?c+componentsofwithcoecientsasfollows. 22
3 2 2 2 3 b2 b3 d 2222 2 d?2 d? d dd?2d?2 Combinatorialcoecientsai,where=? d d?22d?d?pa2i b Consequently,fromLemma4.,itfollowsthatE(jjcdMj?jj2)=Nbd?c+2?2. Weproceedbyinductiononi.AssumeE(jjMi d t+?i? t+?ijj2)=bidsn2?d2.we bd tisindependentoftaswell.then,e(jjmi+ thenconsidere(jjmi+ t?i? t?ijj2).sincetiisassumedtobeindependentoftimet, suxoperatorsj=dmjredistributestheworkloadsofmi t?i+jj2).fromtheargumentintheprecedingparagraph,itisknownthatasequenceof t?i? t?ijj2)=e(jj(j=dmj)mi insuchawaythateachofits t?i+? whichconcludestheinductionandprovesthesecondpartofthelemma. ofthetable.consequentlye(jjmi+ componentsbecomesacombinationofits2dcomponentswithcoecientsasinthelastrow Fromthelemma,itisevidentthatE[tde]isminimizedat==2overallpossiblechoices t?i? t?ijj2)=bdbidsn2?d2=bid+dsn2?d2, ofthedimension-exchangeparameter.thus,wehave models.noticethatlemma5.2holdsundertheassumptionthattheworkloadgeneration/consumptionratiostiineachpairwisebalancingstepofaroundofdimension-exchange operationhasthesamestatisticalcharacteristicsasthoseinadiusionoperation.itistherefore fairtocomparee[tdf]ofeq.(27)withe[tde]ofeq.(29).considertheall-portcommunication model.substituting=d+forine[tdf]and=2forine[tde],weobtain WefurthercompareE[tade]withE[tadf]inbothone-portandall-portcommunication E[tade]E[tode]: (3) E[tade]=(2? E[tadf]=d+ d[?(d 2d?)?=2t+ d+)t+]n2?(t+)2+?=2dn2?(t+)d2+(2? (d+)t+20?20 Itcanbeeasilyveriedthatthecoecientof20inE[ade]issmallerthanthatinE[tadf],and 2d?) thatthecoecientof2ine[tade]issmallerthanthatine[tadf]whentn=d.hence,for 2td+d20?20: tn=d, interestinpractice. SinceNdinthemeshandthetorus,theaboverelationshipholdsforanytimeinstantof E[tade]E[tadf]: processorinasinglediusionoperationisexpectedtobedwithvarianced2.then,e[tdf] ofeq.(27)becomes Inthecaseoftheone-portcommunicationmodel,theworkloadgenerated/consumedbya (at+20+?at+?ad2)n?(t+)d2?20: 23
Clearly,E[tade]E[tadf]atanytimet.Conclusively,weobtainthefollowingtheorem. Theorem5.2Supposeprocessorsarerunningsynchronousdimension-exchangeanddiusion loadbalancingprocessesunderassumption2..then,e[tade]e[tode],e[tadf]e[todf], ande[tade]e[tadf]inbothone-portandall-portcommunicationmodels. Intheprecedingtwosections,weexploredanumberofrelationshipsbetweenthedimension- Numericalexperiments 6exchangeandthediusionmethodswithrespecttotheirecienciesandbalancingqualities. Inordertoobtainanideaofthemagnitudeoftheirdierences,weconductedastatistical networksandusingsyntheticworkloaddistributions.theexperimentalresultsalsoserveto simulationoftheseloadbalancingalgorithmsonvarioustopologiesandsizesofcommunication verifythetheoreticalresults. inastaticworkloadsituation,asimulationofasynchronousloadbalancinginthedynamic situation,andasimulationofsynchronousloadbalancinginthedynamicsituation.ineach simulation,theinitialworkloaddistributionwisassumedtobearandomvector,eachelement Theexperimentincludesthreeparts.Theyareasimulationofsynchronousloadbalancing workloaddistributionsanddierentworkloadgenerationratios.wealsoassumethattheunderlyingsystemimplementstheall-portcommunicationmodelsothatadimension-exchange wofwhichisdrawnindependentlyfromanidenticaluniformdistributionin[0;000].each datapointobtainedintheexperimentistheaverageof20runs,usingdierentrandominitial balancingoperationtakes2ndiusionoperationsinann-dmeshortorus.adiusionoperationistakenasabasictimestepinaloadbalancingprocess. communicationsteps,denotedbyt,necessaryforarrivingatagloballybalancedstate.inthe simulation,wedenetheglobalbalancedstatetobethestateinwhichthesystemworkload varianceislessthanorequaltoone.figure4andfigure5plotthesimulationresultsfrom Inthesimulationofstaticsynchronousloadbalancingprocesses,wemeasurethenumberof dierentloadbalancingalgorithmsexecutedintheringofsizes(n)varyingfrom2to28 nodesandinthe2-dmeshofsizesvaryingfrom22to3232,respectively.thesetwo guresclearlyindicatethatthedimension-exchangemethodoutperformsthediusionmethod acceleratethedimension-exchangeloadbalancingprocesssignicantly.intheringof64nodes, evenintheall-portcommunicationmodel.inparticular,weseethattheodealgorithmdoes forexample,tode=98withtheodealgorithmwhiletade'todf=305andtadf=684with theothers.itsimprovementovertheadealgorithmreachesashighas92:5%.infigure5, balancingprocessina64-ary2-cubeonlyrequiresabout96communicationstepsforarriving observationwasprovedtobetrueinboththemeshandthetorusin[9].thus,anodeload wealsoseethatthenumberofcommunicationstepstina2-dmeshisdependentonlyon thesizeofitslargerdimensionandisinsensitivetothesizeofitssmallerdimension.this ataglobalbalancedstate.itreallyputsforththeodealgorithmasapracticalmethodfor dynamicglobalbalancinginrealmulticomputers. 24
8 6 ADE ODE ADF ODF 4 2 log2(t) 0 8 6 Figure4:Thenumberofcommunicationstepsnecessaryforreachingagloballybalancedstate 4 duringastaticsynchronousloadbalancingprocessintheringofsizesvaryingfrom4to28 2 nodes 0 2 3 4 5 6 7 log2(n) 8 6 ADE ODE ADF ODF 4 2 log2(t) 0 8 6 Figure5:Thenumberofcommunicationstepsnecessaryforreachingagloballybalancedstate 4 duringastaticsynchronousloadbalancingprocessinthe2-dmeshofsizesvaryingfrom22 to3232 2 0 2x2 4x4 8x4 8x8 6x8 6x6 32x8 32x6 32x32 25
Figure6thesystemworkloadvarianceintherst00stepsofvariousloadbalancingprocesses intheringof32nodes.thegureillustratesthattheodealgorithmpullsdownthesystem Furthermore,inordertoexaminetheeectsofasingleloadbalancingoperation,weplotin 200 000 ADE ODE ADF ODF 800 600 400 Figure6:Reductionoftheworkloadvarianceduringastaticsynchronousloadbalancing 200 workloadvariancesharplyalthoughitsinitialreductionratioseemstobenotassatisfactory processintheringof32nodes 0 0 20 40 60 80 00 intheirreductionratios.thissaysthatboththeodeandtheodfalgorithmsmaynot outperformtheirlocalaveragebalancingcounterpartsintheshortterm. asthatoftheadealgorithm.theodfandtheadfalgorithmshavethesamerelationship policysuchthatonceaprocessor'sworkloaddropsorrisesbeyondapairofpresetbounds, processorateachtimestepis00withthevarianceof30andtheconsumptionratioisa constant00.inthesimulationofasynchronousloadbalancing,weuseasimpleinvocation Inthedynamicsituation,weassumethattheexpectedworkloadgenerationratioofa pairofthresholdsdeterminethedegreeofasynchronismofaloadbalancingprocess.suppose wunderloadandwoverload,theprocessorwouldactivatealoadbalancingoperation.evidently,the wunderloadandwoverloadaresymmetricwithrespecttotheexpectedworkloadofaprocessor E(w)=500atanytime,itfollowsthatwunderload=500?range=2andwoverload=500+ range=2.figures7and8plotthesystemworkloadvariancesresultingfromdierentload E(w).Wethenmeasuretherangebetweenwunderloadandwoverloadbyanindexrange.Since balancingalgorithmsinaringof64nodesandameshofsize66forthecaseofrange=600. tunedalgorithmsforglobalsynchronousloadbalancing,donotgainsignicantbenetsin workloadvariancemorerapidlythanthediusionmethodandkeepsitboundedatamuch lowerlevel.itcanalsobeobservedthatboththeodeandtheodfalgorithms,theoptimally Fromthesetwogures,itcanbeseenthattheADEalgorithmreducestheinitialsystem asynchronousimplementations. 26
3000 2500 ADE ODE ADF ODF 2000 500 000 Figure7:Changeoftheworkloadvarianceintherst200stepsofadynamicasynchronous loadbalancingprocessintheringofsize64 500 0 0 50 00 50 3000 2500 ADE ODE ADF ODF 2000 500 000 Figure8:Changeoftheworkloadvarianceintherst200stepsofadynamicasynchronous loadbalancingprocessinthemeshofsize66 500 0 0 50 00 50 27
3000 2500 ADE ODE ADF ODF 2000 500 000 Figure9:Changeoftheworkloadvariancesduringadynamicsynchronousloadbalancing processina66torus 500 0 0 50 00 50 200 3000 2500 ADE ODE ADF ODF 2000 500 conductedintherstexperiment,anditsresultsinaringof32nodesarereportedinfigure6. simultaneously.thesimulationofsynchronousimplementationsinthestaticsituationwas mentationsinwhichrangeissettozerosothatallprocessorsparticipateinloadbalancing Synchronousimplementationsofloadbalancingarespecialcasesofasynchronousimple- Figures9and0presentthesimulationresultsofdynamicsynchronousimplementationsinthe 66torusandthe66mesh.InagreementwiththendingsfromFigure6,Figures9 000 Figure0:Changeofthesystemworkloadvarianceduringadynamicsynchronousloadbalancingprocessina66mesh 500 and0showthatthesuperiorityofthedimension-exchangemethodoverthediusionmethod 0 0 50 00 50 200 28
holdsunderthesynchronousinvocationpoliciesaswell,andthattheadealgorithmhasan 7advantageoverthediusionmethodinbothshortandlongterms. algorithms,thedimension-exchange(de)andthediusion(df)methods,withrespectto Inthispaper,wemadeacomparisonbetweentwoclassesofnearestneighborloadbalancing Conclusions theireciencyindrivinganyinitialworkloaddistributiontoauniformdistributionandtheir abilityincontrollingthegrowthofthevarianceamongtheprocessors'workloads.wefocused ontheirfourinstances,theade,theode,theadfandtheodf,whicharethemost synchronous/asynchronousinvocationpoliciesandstatic/dynamicrandomworkloadbehaviors. commonversionsinpractice.thecomparisonwasmadecomprehensivelyinbothone-port andall-portcommunicationmodelswithconsiderationofvariousimplementationstrategies: thataisapproximatelyequivalenttobinperformance.then,ourcomparativeresultscanbe summarizedasintables3and4. Let\ab"denotetherelationshipthataoutperformsb,and\ab"therelationship andn-dtori. Table3:Summaryofcomparativeresultsintheone-portcommunicationmodelinn-Dmeshes Synchronous ODEADEODFADF Staticloadbalancing Dynamicloadbalancing ADEfADF;ODFg ADEODE ADFODF ADEADF Asynchronous ADFODFincasen= ODFADFincasen2 sameasleft Table4:Summaryofcomparativeresultsinall-portcommunicationmodelinn-Dmeshesand n-dtori. Synchronous ODEADEODFADF Staticloadbalancing Dynamicloadbalancing fadf;odfgadeode ADEODE ADFODF ADEADF Asynchronous ADFODFincasen= ODFADFincasen2 sameasleft 29
besttosynchronousimplementationinthestaticsituation.wealsorevealedthesuperiority ofthedimension-exchangemethodinsynchronousloadbalancingevenintheall-portcommunicationmodel.thestrengthofthediusionmethodisinasynchronousimplementationin methodintheone-portcommunicationmodel.inparticular,theodealgorithmlendsitself Specically,weshowedthatthedimension-exchangemethodoutperformsthediusion theall-portcommunicationmodel.theodfalgorithmperformsbestinthatcase. algorithms,butalsooerspracticalguidelinestosystemdevelopersindesigningloadbalancing architecturesforvariousparallelcomputationalparadigms.weappliedboththediusionand thedimension-exchangemethodsindistributedbranch-and-boundcomputations,andpartly Thecomparativestudynotonlyprovidesaninsightintonearestneighborloadbalancing intheplatformsofparsytecgcpp(powerpc-based)andparsytecgcel(transputer-based) veriedourcomparativeresultsinbothstaticanddynamicasynchronousimplementations multicomputers[7].wealsoevaluatedtheirsynchronousperformancesinrealapplicationsin periodicre-mappingofdataparallelcomputationsin[9]. ThisworkissupportedinpartbyNSFMIP-9309489andtheDFG-Forschergruppe\Eziente Acknowledgments NutzungmassivparallelerSystems".WearegratefultoH.L.Xieforhiscarefulproofreading References andtheanonymousrefereesfortheirvaluablecomments. []I.Ahmad,A.Ghafoor,andKMehrotra.Performancepredictionfordistributedload [2]L.V.Kale.Comparingtheperformanceoftwodynamicloaddistributionmethods.In balancingonmulticomputersystems.inproceedingsofsupercomputing'99,pages830{ 839(99). [3]V.Kumar,A.Y.Grama,andN.R.Vempaty.Scalableloadbalancingtechniquesfor ProceedingsofInternationalConferenceonParallelProcessing,pages8{2(988). [4]M.Willebeek-LeMairandA.P.Reeves.Strategiesfordynamicloadbalancingonhighly parallelcomputers.journalofparallelanddistributedcomputing,22():60{79(994). [5]C.-Z.XuandF.C.M.Lau.Analysisofthegeneralizeddimensionexchangemethodfor parallelcomputers.ieeetransactionsonparallelanddistributedsystems,4(9):979{993 (993). [6]C.-Z.XuandF.C.M.Lau.Iterativedynamicloadbalancinginmulticomputers.Journal dynamicloadbalancing.journalofparallelanddistributedcomputing,6(4):385{393 (992). ofoperationalresearchsociety,45(7):786{796(994). 30
[7]D.P.BertsekasandJ.N.Tsitsiklis.Parallelanddistributedcomputation:Numerical [8]G.Cybenko.Loadbalancingfordistributedmemorymultiprocessors.JournalofParallel methods.prentice-hallinc.(989). [9]C.-Z.XuandF.C.M.Lau.Thegeneralizeddimensionexchangemethodforloadbalancing anddistributedcomputing,7:279{30(989). [0]S.L.JohnssonandC.-T.Ho.Spanninggraphsforoptimumbroadcastingandpersonalized ink-aryn-cubesandvariants.journalofparallelanddistributedcomputing,24():72{85 (995). []D.W.Krumme,G.Cybenko,andK.N.Venkataraman.Gossipinginminimaltime.SIAM communicationinhypercubes.ieeetransactionsoncomputers,38(9):249{268(989). [2]J.B.Boillat.Loadbalancingandpoissonequationinagraph.Concurrency:Practice JournalonComputing,2():{39(992). [3]J.-W.Hong,X.-N.Tan,andM.Chen.Fromlocaltoglobal:ananalysisofnearestneighbor andexperience,2(4):289{33(990). [4]X.-S.QianandQ.Yang.Loadbalancingongeneralizedhypercubeandmeshmultiprocessorswithlal.InProceedingsofthInternationalConferenceonDistributedComputing balancingonhypercube.inproceedingsofacm{sigmetrics,pages73{82(988). [5]C.-Z.XuandF.C.M.Lau.Optimalparametersforloadbalancingwiththediusion methodinmeshnetworks.parallelprocessingletters,4(2):39{47(994). Systems,pages402{409(99). [6]S.H.Hosseini,B.Litow,M.Malkawi,J.Mcpherson,andK.Vairavan.Analysisofagraph [7]C.-Z.Xu,S.Tschoeke,andB.Monien.Performanceevaluationofloaddistributionstrategiesinparallelbranch-and-boundcomputations.Technicalreport,Dept.ofElectricaland coloringbaseddistributedloadbalancingalgorithm.journalofparallelanddistributed Computing,0:60{66(990). [8]R.Diekmann,D.Meyer,andB.Monien.ParalleldecompositionofunstructuredFEMmeshes.Technicalreport,Dept.ofMathematicsandComputerScience,Universityof Paderborn,Germany(995). ComputerEngg.,WayneStateUniversity(995). [9]C.-Z.XuandF.C.M.Lau.Decentralizedremappingofdata-parallelcomputationswith [20]J.Song.Apartiallyasynchronousanditerativealgorithmfordistributedloadbalancing. formancecomputingconference,pages44{42.ieeecomputersocietypress(994). thegeneralizeddimensionexchangemethod.inproceedingsof994scalablehighper- ParallelComputing,20(6):853{868(994). 3
[2]R.LulingandB.Monien.Adynamicdistributedloadbalancingalgorithmwithprovable [22]W.J.Dally.Performanceanalysisofk-aryn-cubeinterconnectionnetworks.IEEETransactionsonComputers,39(6):775{785(990). goodperformance.inproceedingsof5thacmsymposiumonparallelalgorithmsand Architectures,pages64{72(993). [23]L.M.NiandP.K.McKinley.Asurveyofwormholeroutingtechniquesindirectnetworks. [24]G.RamanathanandJ.Oren.Surveyofcommercialparallelmachines.ACMComputer ArchitectureNews,2(3):3{33(993). IEEEComputer,26:62{76(993). [26]S.Ranka,Y.Won,andS.Sahni.Programmingahypercubemulticomputer.IEEE [25]D.M.NicolandJ.H.Saltz.Dynamicremappingofparallelcomputationswithvarying Software,5:69{77(988). resourcedemands.ieeetransactionsoncomputers,37(9):073{087(988). [27]Y.ShihandJ.Fier.Hypercubesystemsandkeyapplications.InK.HwangandD.Degroot,editors,ParallelProcessingforSupercomputersandArticalIntelligence,pages 203{243.McGraw-HillPublishingCo.(989). [28]S.FioriniandR.J.Wilson.Edge-coloringofgraphs.InL.W.BeinekeandR.J.Wilson, [29]B.GhoshandS.Muthukrishnan.Dynamicloadbalancingindistributednetworksby randommatchings.inproceedingsof6thacmsymposiumonparallelalgorithmsand editors,selectedtopicsingraphtheory,pages03{25.academicpress(978). [30]A.N.Choudhary,B.Narahari,andR.Krishnamurti.Anecientheuristicschemefor dynamicremappingofparallelcomputations.parallelcomputing,9:62{632(993). Architectures(994). [3]J.DeKeyserandD.Roose.Loadbalancingdataparallelprogramsondistributedmemory computers.parallelcomputing,9:99{29(993). 32