ToappearinJ.ofParallelandDistributedProcessing. TheGeneralizedDimensionExchangeMethodforLoad Balancingink-aryn-cubesandVariants



Similar documents

DECLARATION OF PERFORMANCE NO. HU-DOP_TN _001

DECLARATION OF PERFORMANCE NO. HU-DOP_TD-25_001

Matrix Multiplication

Wealth Management Formula

Time Value of Money PAPER 3A: COST ACCOUNTING CHAPTER 2 BY: CA KAPILESHWAR BHALLA

Factors to Describe Job Shop Scheduling Problem

PartJoin: An Efficient Storage and Query Execution for Data Warehouses

Zachary Monaco Georgia College Olympic Coloring: Go For The Gold

INTERNATIONAL UNIVERSITY COLLABORATION CENTRE

Section 9.1 Vectors in Two Dimensions

High Performance Computing Lab Exercises

A Graph-Theoretic Network Security Game

Brief Introduction to Vectors and Matrices

MapReduce Approach to Collective Classification for Networks

2 Signals and Systems: Part I

MULTICRITERIA MAKING DECISION MODEL FOR OUTSOURCING CONTRACTOR SELECTION

2013 Employee Engagement Survey


5.1 Bipartite Matching

Practical Application Fly Line Recommendations Freshwater

UV Disinfection Systems A Series. Main Applications: Drinking Water, Process Water, Warm Water / Legionellae. IT T Industries Engineered for life

Branch and Cut for TSP


SEMIN- Dessine-moi un arbre : représentation graphique des arbres phylogénétiques avec le package ape

= y y 0. = z z 0. (a) Find a parametric vector equation for L. (b) Find parametric (scalar) equations for L.

2 The Mathematics. of Finance. Copyright Cengage Learning. All rights reserved.

12.10 P.M. V P.M P.M. II (H) Prac Based on Paper 09 (41) II(H) Prac Based. on Paper 09 (av) III(H) Internship / Based on Paper 09 (39)

Department of Economics

1 Norms and Vector Spaces

Oplossingen uit het vorige nummer

!!! 2014!!2015!NONPROFIT!SALARY!&!STAFFING!REPORT! NEW$YORK$CITY$AREA$ $ $ $ $ $ $


ADVANCED SCOOTER CONTROL SYSTEMS PG DRIVES TECHNOLOGY

PRICE/SALES CATALOGUE MESSAGE PRICAT. Version 1.0. agreed-upon by EDI Working Group of ECR Poland

18.06 Problem Set 4 Solution Due Wednesday, 11 March 2009 at 4 pm in Total: 175 points.

SIMULTRAIN BEST PRACTICE

Lightweighting Custom enewsletter

Performance Comparison of Dynamic Load-Balancing Strategies for Distributed Computing

Test report No /06

Dynamic Smooth Safelite Safelite WASHABLE - NURSES safety toe cap Delivered with insoles SAFETY - CATERING/ CLEANING STAFF

Both variants can be used for measuring the positive sequence voltage Up or the negative sequence voltage Un.

Solution of Linear Systems

Certificate of Compliance

Requisition & Purchase Order Encumbrances

Natural Convection. Buoyancy force

Topological Properties

Database storage management with object-based storage devices

Admin stuff. 4 Image Pyramids. Spatial Domain. Projects. Fourier domain 2/26/2008. Fourier as a change of basis

Structured Representation Models. Structured Information Sources

March 2, British Columbia Utilities Commission 6th Floor, 900 Howe Street Vancouver, B.C. V6Z 2N3

SZI IR Audio Transmission Technology Modulators/Radiators

CHAPTER 49 RELIABILITY & MAINTAINABILITY (R&M) PLANS & PROGRAMMES CONTENT

Master of Arts In Counseling Master s degree only

Effect of Psychological Interventions in Enhancing Mental Toughness Dimensions of Sports Persons

OpenStack Networking: Where to Next?

Cost Models for Vehicle Routing Problems Stanford Boulevard, Suite 260 R. H. Smith School of Business

Master Panel 1000 W Wall

Definition of a Software Product Line Portfolio Using the Kano Model

How To Increase Availability At A Lower Cost

Tuition fees, payment plans and refund regulations

Matrix Differentiation

Dušan Bernát

3-Channel Supervisor IC for Power Supply

Fire Science One-Year Certificate

Calculation of Return on Equity (Ke) Presentation to Stakeholders 8 th October 2008

slide flange slide ring

Pneumatic Control and Shut-off Butterfly Valve Pfeiffer Type BR 14b/31a and Type BR 14c/31a

T c k D E GR EN S. R a p p o r t M o d u le Aa n g e m a a k t o p 19 /09 /2007 o m 09 :29 u u r BJB M /V. ja a r.

SURI. SHELTER UNIT for RAPID INSTALLATIONS SURICATTA SYSTEMS

Master Panel 1000 R5 Roof

Sunny 1, Rinku Garg 2 Department of Electronics and Communication Engg. GJUS&T Hissar, India

POWER SUPPLY PWM SUPERVISOR

Transcription:

ToappearinJ.ofParallelandDistributedProcessing TheGeneralizedDimensionExchangeMethodforLoad Balancingink-aryn-cubesandVariants DepartmentofComputerScience,TheUniversityofHongKong,HongKong DepartmentofComputerScience,ShantouUniversity,P.R.China FrancisC.M.Lau Cheng-ZhongXu 1

changeparameterthatgovernsthesplittingofload Abstract TheGeneralizedDimensionExchange(GDE)methodis afullydistributedloadbalancingmethodthatoperates inarelaxationfashionformulticomputerswithadirect communicationnetwork.itisparameterizedbyanex- loadbalancing.anoptimalwouldleadtothefastest betweenapairofdirectlyconnectedprocessorsduring resultedintheoptimalforthebinaryn-cubes.inthis convergenceofthebalancingprocess.previousworkhas paper,wederivetheoptimal'sforthek-aryn-cube networkanditsvariants thering,thetorus,thechain, andthemesh.weestablishtherelationshipsbetween theoptimalconvergenceratesofthemethodwhenappliedtothesestructures,andconcludethatthegde relaxation-basedmethod,thediusionmethod. revealthesuperiorityofthegdemethodtoanother theoptimal'sdospeedupthegdebalancingproceduresignicantly.becauseofitssimplicity,themethod isreadilyimplementable.wereportontheimplemen- inwhichtheimprovementinperformanceduetogde balancingissubstantial. tationofthemethodintwodata-parallelcomputations Wefurthershowthroughstatisticalsimulationsthat methodfavorshighdimensionalk-aryn-cubes.wealso ing,k-aryn-cubenetworks,messagepassingmulticom- puters,circulantmatrices. Keywords.dimensionexchangemethod,loadbalanc- Method Correspondence:DrF.C.M.Lau,Departmentof ComputerScience,TheUniversityofHongKong,PokfulamRoad,HongKong.Fax:(852)5598447,email: Runninghead.GeneralizedDimensionExchange fcmlau@csd.hku.hk 2

ListofSymbols D() Pij M() G=(V;E)systemgraph (G) G=(V;E)-colorgraph;isthechromaticindexofthegraphG diusionmatrix generalizeddimensionexchangematrix setofdistinctcolorpathsfromvertexitovertexj,withtypicalelementpij. degreeofthegraphg (M) Wt opt wti opt workloaddistributionattimet workloadofnodepiattimet optimaldiusionparameter optimalexchangeparameter (D) R1(M())asymptoticconvergencerateofthesequencefMtg convergencefactorofthegdemethod convergencefactorofthediusionmethod MCk MRk MMk1;k2;:::;knGDEmatrixofthecolormeshofsizek1k2kn MTk1;k2;:::;knGDEmatrixofthecolortorusofsizek1k2kn DCk GDEmatrixofthecolorchainofsizen GDEmatrixofthecolorringofsizen DTk1;k2;:::;kndiusionmatrixofthecolortorusofsizek1k2kn DRk DMk1;k2;:::;kndiusionmatrixofthecolormeshofsizek1k2kn improvementduetoremapping diusionmatrixofthechainofsizen diusionmatrixoftheringofsizen 3

1Introduction Weconsidertheproblemofdynamicloadbalancingin multicomputers.multicomputersareaclassofparallel machinesthatarecomposedofmanyautonomousprocessorsinterconnectedbyacommunicationnetwork[1]. Theprocessorsdonotshareanymemoryandtheycommunicateamongthemselvesviamessagepassing.From timetotime,theworkloadthatisspreadacrossthe processorsisfoundtobeinanunbalancedstate;load balancingistheninitiatedtobalancetheworkload.dimensionexchange(de)isoneofthefewdistributedload balancingmethodsthatoperateinarelaxationfashion forpoint-to-pointnetworks(adetailedsurveycanbe foundin[20]).withthedemethod,aninstanceof loadbalancingiscarriedoutasasequenceof\sweeps". Duringeachsweep,aprocessorcomparessuccessively itsworkloadwiththatofeachofitsnearestneighbors; followingeachsuchcomparison,anexchangeoperation isexecutedtoequalizetheworkloadbetweenthisnode andtheneighboringnodeconcerned.alternatively,insteadofexchangingworkloadson-the-y,theloadbalancingprocedurecanbedividedintotwophases:inthe rstphase,deisemployedtoworkouttherevisedload \indices"thatcorrespondtoabalancedstate;thenin thesecondphase,theactualloadmigrationswouldtake place.thismakesthemethodmoreapplicabletosituationsinwhichtheworkloadinvolveslargeamountsof data. TheDEmethodwasinitiallyintensivelystudiedin hypercube-structuredmulticomputers[14,15,5].since thesetofneighborsofaprocessorcorrespondexactlyto thedimensionsofthehypercube,asweepoftheiterativeprocessisequaltogoingthroughallthedimensions once.cybenkoprovedthatregardlessoftheorderin whichthesedimensionsareconsideredinasweep,this simpleloadbalancingmethodyieldsauniformdistributionfromanyinitialworkloaddistributionafterone sweep[5].healsorevealedthesuperiorityofthede methodtoanotherrelaxation-basedmethod,thediusionmethod[3,5],whenappliedtohypercubes.this theoreticalresultwassupportedinpartbytheexperimentcarriedoutbywillebeek-lemairandreeves[18]. TheDEmethodisnotlimitedtohypercubestructures.Hosseinietal.analyzedthemethodasapplied toarbitrarystructuresbasedonedge-coloringofundirectedgraphs[8].withedge-coloring,theedgesofa givengrapharecoloredwithsomeminimumnumberof colorssuchthatnotwoadjoiningedgesareofthesame color.a\dimension"isthendenedtobethecollectionofalledgesofthesamecolor.obviously,anndimensionalhypercubecanbecoloredwithaminimum ofncolors.duringeachiterationsweep,alldimensions (colors)areconsideredinturn.sincenotwoadjoining edgeshavethesamecolor,eachprocessorneedstodeal withonlyoneneighboratatimeduringasweep.clearly, foranarbitrarystructure,thedemethodcannolonger yieldauniformworkloaddistributioninasinglesweep. Nonetheless,Hosseinietal.showedthatgivenanyarbitrarystructure,theDEmethodconvergeseventuallyto auniformdistribution[8]. TheDEmethodischaracterizedby\equalsplitting" ofworkloadbetweenapairofneighboringprocessorsat everycomparison,whichwasshowntobeoptimalinhypercubestructuresbutnotnecessarilysoinotherstructuresthroughouranalysis[21].inthatpaper,wegeneralizedthedemethodbyaddinganexchangeparametertogoverntheamountofworkload(insteadofalways half)exchangedateverystep.thismethodiscalled thegeneralizeddimensionexchange(gde)method.we modeledthisgeneralizeddemethodusingamatrixiterativeapproachandderivedthenecessaryandsucient conditionforitsconvergence. Inthispaper,wecontinueouranalysisoftheGDE methodasappliedtothefamilyofk-aryn-cubeswhich includethering,thechain,thetorus,andthemesh.a k-aryn-cubeisastructurewithndimensions,knodes ineachdimensions[6,10].theringandthehypercubearespecialcasesofthek-aryn-cube.aringof knodesisak-ary1-cube,andann-dimensionalhypercubeisa2-aryn-cube.then-dimensionaltorusisa generalizationofthek-aryn-cube,whichallowsdierentnumbersofnodesindierentdimensions.takea ringandatorusandstripthemofalltheend-round connections,wegetachainandaringrespectively.we 4

limitourscopetothesestructuresbecausetheyarethe mostpopularchoicesoftopologiesincommercialparallel computers[10,13,16].examplesincludethehypercubestructuredintelipsc/860andncube/2,themeshstructuredintelparagon,inteltouchstonedelta,iwarp, andametek2010. Themaincontributionofthispaperisthederivation oftheoptimalexchangeparametersinclosedformfor thefamilyofk-aryn-cubes.theoptimalsolutionsfor thesestructuresareofconsiderablevaluebecausethere isthisrealneedinpracticalsituationsofchoosingan exchangeparameterthatwouldleadtothefastestconvergenceofthebalancingprocedure.apreviewofthese optimalparameterswithoutproofshasbeenincludedin ourpreviouspaper[21].asubsetoftheproofs,whichare basedoncirculantmatrixtheory[7],willbepresentedin thispaper.wecapitalizeonthemodelingpowerofcirculantmatrices,whichismostevidentincasesinwhich thestructuresconcernedcanberecursivelydened. Theotherimportantcontributionsofthispaperincludetheestablishmentoftherelationshipsbetweenthe convergenceratesofthesestructuresandaproofofthe superiorityofthegdemethodtothediusionmethod. Thelatteriswithrespecttotheconvergenceratesof thetwomethodswhenappliedtothefamilyofk-aryncubes.thematrixanalysisrevealstheasymptoticconvergenceratesbutshedslittlelightontheexactnumber ofsweepsneededforbalancing.therefore,weusestatisticalsimulationstoobtaintheactualnumberofsweeps requiredbygdebalancinginthesestructures.this numberturnsouttobeencouraginglysmallinallthe caseswetriedwhenusingtheoptimalparameterswederived.thisaddsalotofweighttothepracticalityofthe GDEmethod.Infact,themethodhasbeenemployedin theimplementationsoftworealisticdata-parallelcomputations,andtheimprovementovertheversionswithoutloadbalancingissubstantial.wegiveabriefreport ontheseimplementations. Therestofthepaperisorganizedasfollows.InSection2,wereviewtheGDEmethodanditsconvergence propertiesforthegeneralcase.insection3,weanalyze thegdemethodforthek-aryn-cubeanditsvariants, andderivetheiroptimalexchangeparameters.insection4,wemakeacomparisonbetweenthegdemethod andthediusionmethod.section5reportsontheresultsofastatisticalsimulationconcerningthenumberof iterationsweepsaswellasndingsfrompracticalimplementations.weconcludeinsection6withasummary oftheresultsanddiscussionoffurtherwork. 2TheGDEmethod Themodeloftheunderlyingsystemandcomputation assumedinthisstudyissimilartothatin[3,5,8,21]. Specically,themulticomputerweconsiderconsistsof anitesetofhomogeneousprocessorsinterconnectedas apoint-to-pointnetwork.thecommunicationlinksare bi-directionalandtheprocessorsinteractsynchronously withoneanother.werepresentsuchasystembyasimpleconnectedgraphg=(v;e),wherevisasetof processorslabeled1throughn,andevvisa setofedges.everyedge(i;j)2ecorrespondstothe communicationlinkbetweenprocessorsiandj.the underlyingparallelprogramisassumedtocomprisea largenumberofindependentprocesseswhicharethe basicunitsofworkload.oneormoreprocessesmay berunninginaprocessoratanytime.thetotalworkloadisassumedxed i.e.,noprocessesarecreatedor killed duringtheexecutionoftheloadbalancingprocedure.wequantifytheworkloaddistributionbyavector W=(w1;w2;:::;wN)T,wherewidenotestheworkload ofprocessoriwhichisintermsofthenumberofresiding processes.weassumethenumberofprocessesislarge sothattheworkloadofanodeisaninnitelydivisiblerealquantity.itisnotdiculttoseethatwithout thisassumption(resultinginthe\integerversion"which willbediscussedinsection5)ourresultsstillhold.the loadbalancingtaskistoredistributethesystemworkloadsuchthateachnodewouldendupwiththesame w=pwi=n,i=1;2;:::;n. Notethattheloadbalancingproblemresemblesin certainwaysanotherdistributeddecisionproblem,the agreementproblem[2].thislatterproblemrequiresthe nodesofasystemtoreachanagreementonacommon 5

scalarvalue,suchastheaverage,themaximum,orthe minimum,basedontheirownvalues.theloadbalancingproblem,however,requiresthenodesnotonlyto ecientmanner. justtheirworkloadsaccordinglyinanautomaticand reachanagreementontheaverageload,butalsotoadlems.thestaticworkloadassumptionisvalidincasestrictive,butyetitisapplicabletomanypracticalprob- wherethecomputationistemporarilysuspendedforload balancingandresumedafterloadbalancing.thisiswhy tuningtheeciencyoftheloadbalancingisoftoppriority.exampleofsuchcasescanbeeasilyfoundindy- Thecomputationmodeljustdescribedmightseemre- isdominatedbytheexecutiontime.thepracticalapplicabilityofiterativeloadbalancingwillbedemonstratenamicremappingofmulti-phasedata-parallelcomputations[12,11].theassumptionofindependentprocesses throughimplementationofdataparallelcomputations andcommunicationineachphaseandtheperformance isalsoreasonableinthiskindofcomputationsbecause theprocessingnodeswouldalternatebetweenexecution edgesofgaresupposedtobecoloredbeforehandwith insection5. theleastnumberofcolors(,say),andnotwoadjoining withintegersfrom1to,andrepresentthe-colorgraph edgesareassignedthesamecolor.weindexthecolors asg=(v;e),ofwhicheisasetof3-tuplesofthe form(i;j;c),(i;j;c)2eifandonlyifcisthechromatic TheGDEmethodisbasedonedge-coloringofG.The indexoftheedge(i;j)2e.figure1showsexamples ofcolorgraphsofringsandchains.thenumbersin parenthesesaretheassignedchromaticindices. Figure1:Examplesofcoloredgraphs 6 1 (3) 2 3 5 1 2 3 4 (c) (a) 4 1 2 4 3 (b) 1 2 3 4 (d) i,theexchangeofworkloadwithaneighborjisexecuted goingthroughallthechromaticindicesonce thisiswhy sweepoftheiterativeprocess.asweepcorrespondsto weneedtousetheleastnumberofcolors.forprocessor loadwitheachofitsneighborsinturnaccordingtothe orderinwhichchromaticindicesareconsideredineach WiththeGDEmethod,aprocessorwouldexchange wherewiandwjarethecurrentworkloadsofprocessorsi andjrespectively,andistheexchangeparameterchosenbeforehandforthegivennetwork.notethatwhen in[5,8,15]. =1=2,theGDEmethodisreducedtotheDEmethod wi=(1?)wi+wj calworkloadofprocessoriatsweept.thentheoverall workloaddistributionatsweeptisdenotedbythevectorwt=(wt1;wt2;:::;wtn)t.supposew0istheinitial Fora-colorgraph,asweepoftheGDEalgorithm index,t=0;1;2;3;:::,andwti(1in)bethelo- comprisesstepswhichwillcoveralltheneighborsof everynodeforworkloadexchange.lettbethesweep distributioninthesystematsweeptcanbemodeledby wherem()iscalledthegdematrixofthe-color workloaddistribution.thenthechangeoftheworkload graphg,andm()=m()m?1():::m1(). theequationwt+1=m()wt loaddistributionofthesystematstepcofsweept. EachMc()(1c)reectsthechangeofthework- Tomakedynamicloadbalancingwork,therearetwo R1(M()). latterisreectedbytheasymptoticconvergencerate, atitsterminated,load-balancedstate.theformerconcernstheconvergenceofthesequencefmt()g,andthe mainissues.oneistheterminationcondition;theother iseciency,i.e.,thetimeneededforthesystemtoarrive andsucientconditionfortheterminationofthegde 1,itcanthenbeshownthat0<<1isanecessary stochasticwhen01andprimitivewhen0<< GiventhefactthatM()isnon-negativeanddoubly

methodfromanyinitialloaddistribution(theorem3.1 of[21]). Regardingtheconvergencerate,theeigenvaluesof M()playafundamentalrole.Letj(M())(1 jn)betheeigenvaluesofm(),and(m())and (M())bethedominantandsubdominanteigenvaluesrespectivelyofM()inmodulus.Since(M()) isuniqueandequalto1,itfollowsthatr1(m())=?ln(m()).(m())isalsoreferredtoastheconvergencefactorofthegdemethodinthecorresponding -colorgraph.thus,thetaskhereistochooseaso that(m())isascloseto0aspossible,i.e.,r1(m()) aslargeaspossible. Tondtheminimum(M()),wehavetorstconstructtheGDEmatrixM().Therepresentationof eachelementmijinm()isbasedontheconceptof colorpath.acolorpathoflengthlfromvertexito vertexjisasequenceofedgesoftheform (i=i0;i1;c1);(i1;i2;c2);:::;(il?1;il=j;cl) whereallintermediateverticesis(1sl?1)aredistinctandc1>c2>:::>cl1.itindicatesthat processoriwillreceivesomeworkloadfromprocessorj alongthepathinaniterationsweep.twocolorpaths fromitojingaresaidtobedistinctiftheirintermediateverticesdonotcoincideatall.allthedistinct colorpathsfromitojcompriseasetpij.sincethe computationformulasfortheelementsofm()willbe referredtofrequentlyintheremainderofthispaper,we reproducebelowthelemmathatdenesthem.examplesofgdematricescanbefoundinlatersectionsof thispaper. Lemma2.1(Lemma3.2of[21])LetM()betheGDE matrixofa-colorgraphg.if0<<1,thenfor 1i;jN mij=pp2pij((1?)rplp) i6=j mii=(1?)(i)+pp2pii((1?)rplp) wherelpisthelengthofthecolorpathp2pijofthe form(i=i0;i1;c1);(i1;i2;c2);:::;(ilp?1;ilp=j;clp); andrp=plp s=0ns,wheren0isthenumberofincident edgesofiwhosechromaticindexislargerthanc1;nlpis thenumberofincidentedgesofjwhosecolornumberis smallerthanclp;andns(1slp?1)isthenumber ofincidentedgesofiswhosechromaticindexislarger thancs+1andsmallerthancs. Ourobjectiveistodeterminetheoptimalexchange parameteroptforagivengdematrix,whichwould minimize(m())andmaximizetheconvergencerate. Forarbitrarynetworks,thisisanopenprobleminmatrix theory[17].forsomenetworkswitharegulartopology, however,itispossibletoanalyzeexactlytheeectof ontheconvergencerateaswellastoderivetheoptimal exchangeparameter.in[21],weprovedthat=1=2 (equalsplittingofworkloadbetweenapairofnearest neighbors),whichwas\built-in"inthedemethod,is indeedtheoptimalchoiceforcertainstructures(thehypercube,forexample).forotherstructures,suchasthe k-aryn-cube,however,=1=2isnottheoptimalchoice, aswewillshowinthenextsection. 3Analysisofthek-aryn-cube andvariants Webeginwiththeringstructure,i.e.,ak-ary1-cube, andthengeneralizeittothen-dimensionalk1k2 kntorus.theanalysisofthetorusdependsonthe analysisoftheringastheformercanbetreatedasan assemblyofrings.themainresultforthek-aryn-cube followstriviallyfromtheresultsforthetorus. Themodelingtoolweusefortheanalysisisaspecialkindofmatricescalledblockcirculantmatrices.It happensthatthegdematricesofthe\even"casesof theabovestructures evennumberofnodesinevery dimension areblockcirculantmatrices.weconcentrateontheseevencasesfortheremainderofthispaper,andmaketheremarkherethattheresultsforthe evencasesshouldbeapplicable(approximately)tothe non-evencases.wegivesimulationresultsandasimple argumentinsection5tosupportthis.nevertheless,the analysisofthenon-evencasesshouldstillbeaninterest- 7

ingtheoreticalproblemtotackle. culantmatrices[7]. whichcanbeeasilyderivedbasedonthetheoryofcir- thefollowingtwolemmasconcerningblockcirculantmatricesanddirectproductsofmatrices.weomittheproofs Forthesubsequentanalysis,weneedtomakeuseof Ifr=1,ablockcirculantmatrixdegeneratestoacirculantmatrix. Lemma3.1LetmatrixA=(A1;A2;:::;Am).Then A2A3:::A11CA AmA1::::::Am?1 Thenablockcirculantmatrixisamatrixoftheform (A1;A2;:::;Am)=0B@A1A2::::::Am LetA1;A2;:::;Ambesquarematricesoforderr. theeigenvaluesofthematrixaarethoseofmatrices matrixa=(a1;a2),thentheeigenvaluesofaare thoseofa1+a2,togetherwiththoseofa1?a2. A1+!jA2+:::+!j(m?1)Am,j=0;1;:::;m?1,where matrixofordermndenedby respectively.thenthedirectproductofaandbisa!j=cos2j LetAandBbesquarematricesofordermandn, m+isin2j m,i=p?1.inparticular,if Lemma3.2LetA,Bbesquarematricesoforderm j=1;2;:::;n.thentheeigenvaluesofabare AB=0B@a0;0Ba0;1Ba0;m?1B i(a)j(b). andnwitheigenvaluesi(a)andj(b),i=1;2;:::;m, a1;0ba1;1ba1;m?1b am?1;0bam?1;1bam?1;m?1b1ca Forsimplicityofnotation,welet U= V= Q= Q2!= Q1 U2!= U1 V1 V2!= 1? (1?)2 (1?)(1?)2 (1?)2(1?)!; 1?!: 2(1?)!; 8 3.1Theringstructure Theringisak-ary1-cubestructurewhosenodeswelabel 1throughk.Anevenringcanbecoloredwith2colors, asinfigure1(b).thiscoloringisunique i.e.,there isonlyonewayofcoloringtheedgeswithoutrespectto thepermutationoflabels.thegdematrixofaneven Lemma3.3LetGRkbeacolorringofevenorderk, theform0b@v2 ringisinblockcirculantform,asfollows. andmrkbeitsgdematrix.thenmrkisamatrixof V1 UVUṾ... UV U2 Proof.TheproofisbyinductionontheorderofGRk. (3).WeneedtoshowthatiftheGDEmatrixMR2m First,itiseasytoverifythatMR4isintheformof U1 1 CAkk (3) ofgr2m,m2,isintheformof(3),thenthegde vertexlabeled2mingr2misrelabeled2m+2,andthe withtwoextravertices,asillustratedinfigure2.the matrixmr2m+2ofgr2m+2isnecessarilyinthatform aswell.let'sviewgr2m+2asanexpansionofgr2m newlyaddedverticesarelabeled2mand2m+1.then, accordingtothecomputationformulasinlemma2.1, weobtainthefollowing. Forj=2m+1;2m+2, Figure2:AnexpansionoftheringGR2m (MR2m+2)1;j?2=0; (MR2m+2)1;j=(MR2m)1;j?2; (MR2m+2)2m+2;j=(MR2m)2m;j?2; 1 2 3 m 2m 2m-1 2m-2 m+1 1 2 3 m 2m-1 2m+2 GR 2m GR 2m+2 2m+1 2m 2m-2 m+1

forj=1;2,(mr2m+2)2m+2;j=(mr2m)2m;j; (MR2m+2)2m;j=0; fori=2m;2m+1;2m?1j2m+2, (MR2m+2)i;j=(MR2m)i?2;j?2: Hence,MR2m+2,andthereforeMRk,areintheform of(3).2 Asanexample,theGDEmatrixoftheringoforder 8,MR8,isasinFigure3. GiventhisparticularstructureofthematrixMRk,we canthenderivetheoptimalexchangeparameteranddeterminetheeectoftheringorderkontheconvergence rate. Theorem3.1LetGRkbeacolorringofevenorderk,MRkbetheGDEmatrixofGRk,andk= 2m.Thentheoptimalexchangeparameteropt(MRk) isequalto1 1+sin(=m);andforagiven,R1(MRk) R1(MRk?2). Proof.ConsidertheparticularformofMRkasshownin Lemma3.3.ItiseasytoseethatMRkcanberepresented byablockcirculantmatrix(a1;a2;0;:::;0;am), wherea1= (1?)2(1?) (1?)(1?)2!; A2= 00 (1?)2!; Am= 2(1?) 00!: Andforj=0;1;:::;m?1, A1+!jA2+!j(m?1)Am= (1?)2+!m?j2(1?)+!m?j(1?) (1?)+!j(1?)(1?)2+!j2! because!j(m?1)=!m?j.fromlemma3.1,theeigenvaluesofthematrixmrkaretherootsoftheequation 2?2?(1?)2+2cos(2j=m)+(1?2)2=0 Thatis, =(1?)2+2cos(2j r(1+cos(2jm) m))((1+cos(2j m))2?4+2); wherej=0;1;:::;m?1. Clearly,thedominanteigenvalueofMRkinmodulus, (MRk),isequalto1,andwhenk6=4,thesubdominant eigenvalueofmrkinmodulus,(mrk),isequalto 8><>:2?1 if2?p2(1?e) 1+e<1 (1?)2+2e+p(1+e)((1+e)2?4+2) if0<2?p2(1?e) 1+e wheree=cos(2=m).therefore,foragivenringof ordern=2m,n6=4,itsoptimalexchangeparameteris asfollows opt(mrk)=2?p2(1?cos(2=m)) 1+cos(2=m)= 1 1+sin(=m): Whenk=4,wehaveabinary2-cube(two-dimensional hypercube)forwhichopt(mr4)=1=2and(mr4)= 0,whichisinagreementwithpreviousresultsforthe hypercube[5].moreover,(mrk)increaseswithmfor agiven.hence,r1(mrk)decreasesasmincreases. Thatis,foragiven,R1(MRk)R1(MRk?2).2 Theabovetheoremsaysthatforagiven,themore verticesanevenringhas,thesloweritsconvergencerate. Italsogivestheformulafortheoptimalforanyeven ring,whichcanbeusedinpracticetocomputetheexact optimalvaluefortheexchangeparameterforagiven ring.forexample, opt(mr16)=2=(2+q2?p2)0:723: 3.2Thetorusstructure Thek-aryn-cubeisaspecialcaseofthen-dimensional k1k2kntorus.inthiscase,k1=k2== kn=k.thegeneralcaseappearstobemoreinterestingintermsofitsanalysis.werstconsiderthetwodimensionalk1k2toruswithevennumberofnodesin bothdimensions.itcanbeviewedasacollectionofverticalandhorizontalevenrings(seefigure4),andhence 9

0B@ (1?)(1?)2(1?)2 2(1?)0 2(1?) (1?)2 2(1?)(1?)2(1?)0 0 0(1?)(1?)2(1?)2 0 2(1?)(1?)2(1?)0 0(1?)(1?)2(1?)2 2(1?)(1?)2(1?) 0 0 0 Figure3:TheGDEmatrixoftheringoforder8,MR8 0 0 0 1 CA: let mensionexchangeoperatorisconcerned.therefore,we resultsintheprevioussectionfortheringcanbeapplied totheanalysishere.tohandlethedegeneratecaseofk1 isequivalenttoachainoftwonodesasfarasthedi- order2.thereasonforthisisthataringoftwonodes ork2equalto2,weusethegdematrixforachainoftorusiscoloredintherow-majorway.notethatwith MR2= 1? 1?!: thiscoloring,allhorizontaledgesaresmallerthanthe Lemma3.4LetMTk1;k2betheGDEmatrixofak1 becauseoftheparticularstructureofthegdematrix verticaledgesinchromaticindices.wewillsoonseethat MRk1. asrevealedbythefollowinglemma,thesetwocolorings Proof.Givenanorderedpairofvertices<i;j>in k2evencolortorusgtk1;k2.thenmtk1;k2=mrk2 havethesameeectontheconvergencerate. GTk1;k2, 1.ifbothiandjareinthesamehorizontalring,i.e., bi=k1c=bj=k1c,thenbecauseofthecoloringwe majorlabeling.similarly,therearetwowaysofcolor- ingtheedges:row-majorandcolumn-majorcoloring, =(MRk2)bi=k1c;bj=k1c(MRk1)imodk1;jmodk1 (1?)2(MRk1)imodk1;jmodk1 ifk2=2 otherwise matrixofanetworkisinvariantforanypermutationof labeling.asthespectrumofeigenvaluesofthegde thenodelabels,wearbitrarilychoosethesnake-likerow- nodesofthetorus:row-majorandsnake-likerow-major Figure4showsthetwocommonwaysoflabelingthe Figure4:Colortoriof44 useinwhichallhorizontaledgesaresmallerthan theverticaledgesinchromaticindices, (MTk1;k2)i;j =8><>:(1?)(MRk1)imodk1;jmodk1 asshowninthegure.wearbitrarilyassumethatthe2.ifiandjareindierenthorizontalrings,i.e., 10andthereexisttwocolorpathspi1jandpii1fromi1 bi=k1c6=bj=k1c,thenthereisacolorpathfrom tojandfromitoi1,respectively,asinillustrated suchthati1modk1=imodk1,bi1=k1c=bj=k1c, itoj,saypij,ifandonlyifthereexistsavertexi1 (3) 1 2 3 4 8 (3) (3) (3) 7 (3) (3) 9 10 11 12 16 15 snake-like row-major node-labeling row-major edge-coloring (a) 6 14 5 (3) (3) 13 (3) (3) 1 2 3 4 5 (3) (3) 9 10 11 12 13 (3) (3) (3) 6 14 (3) (b) row-major node-labeling column-major edge-coloring 7 15 8 16

i 1 (3) (3) (3) (3) j i Figure5:Illustrationofthecolorpathfromvertexito vertexj infigure5.letl1andl2bethelengthsofpi1jand pii1,respectively;r1bethesumofthenumber ofincidenthorizontaledgesofi1thatarebetween, inchromaticindices,theincidentedgesofi1along thepathpij,andthenumberofincidenthorizontaledgesofi1whosechromaticindexislessthan thatofthelastedgeinpij;andr2bethesumof thenumberofincidentverticaledgesofi1thatare between,inchromaticindices,theincidentedgesof i1alongpij,andthenumberofincidentverticaledgesofi1whosechromaticindexislargerthan thatoftherstedgeinpij.then,accordingtothe computationformulasinlemma2.1,wehave (MTk1;k2)i;j =l1+l2(1?)r1+r2 =l2(1?)r2l1(1?)r1 =(MRk2)bi=k1c;bi1=k1c(MRk1)i1modk1;jmodk1 =(MRk2)bi=k1c;bj=k1c(MRk1)imodk1;jmodk1 Byreferringtothedenitionofdirectproductin Lemma3.2,thelemmaisproved.2 Ifinsteadweusecolumn-majorcoloring,thenallhorizontaledgesarelargerthantheverticaledgesinchromaticindices.Byfollowingtheabovesteps,however,we wouldndthattheresultinggdematrixisthesameas MTk1;k2.Wecontinuetoassumerow-majorcoloringin thefollowingdiscussion. WenowturntotheconvergencerateofGDEinthe torus,andseehowitisrelatedtotheconvergencerate inthering. Theorem3.2LetMTk1;k2betheGDEmatrixofa k1k2evencolortorusgtk1;k2.then,theoptimalexchangeparameteropt(mtk1;k2)isequaltoopt(mrk); andforagiven,r1(mtk1;k2)=r1(mrk),where k=maxfk1;k2g. Proof.FromLemma3.2,itisclearthat (MTk1;k2)=maxf(MRk1);(MRk2)g because(mrk1)=(mrk2)=1.moreover,fromtheorem3.1,(mrk1)(mrk2)ifandonlyifk1k2. Hence,(MTk1;k2)=(MRk),wherek=maxfk1;k2g; and(mtk1;k2)isminimizedwhen=opt(mrk).in otherwords,r1(mtk1;k2)=r1(mrk)andbothmrk andmtk1;k2havethesameoptimalexchangeparameter.2 Asaresult,onecancomputetheoptimalexchangeparameteropt(MTk1;k2)usingtheformulainTheorem3.1. Moreover,theabovetheoremshowsthattheconvergence rateintwo-dimensionaltorusstructuresdependsonly onthelargerdimension.forexample,thetorigt16;j, j=2;4;:::;16,allhavethesameconvergenceratefora givenandsharethesameoptimalexchangeparameter opt(mt16;j)=2=(2+p2?p2)0:723: Theresultsforthetwo-dimensionaltorusshownabove canbegeneralizedtomulti-dimensionalcases.consider ak1k2:::kneventorusandassumethatthisndimensionaltorusiscoloredinawaysimilartothatfor thetwo-dimensionaltorus.then,itsgdematrixcan beexpressedintermsofdirectproductsofcolorrings. Lemma3.5LetMTk1;k2;:::;knbetheGDEmatrixof ann-dimensionalk1k2knevencolortorus GTk1;k2;:::;kn.Then,MTk1;k2;:::;kn=MRk1MRk2 MRkn: Weomitthetediousproofherewhichisbasedoninductiononthedimensionn.Fromthislemma,thefollowing resultisimmediatelyinorder. Theorem3.3LetMTk1;k2;:::;knbetheGDEmatrixof ann-dimensionalk1k2knevencolortorus GTk1;k2;:::;kn.Then,theoptimalexchangeparameter 11

opt(mtk1;k2;:::;kn)isequaltoopt(mrk);wherek= max1jnfkjg;andforagiven,r1(mtk1;k2;:::;kn)= R1(MRk): Sinceak-aryn-cubeisaspecialcaseofanndimensionaltorus,wehavethefollowingmajorresult forthek-aryn-cube. Corollary3.1LetMTk;nbetheGDEmatrixofacolor k-aryn-cube,keven.thentheoptimalexchangeparameteropt(mtk;n)isequaltoopt(mrk);andforagiven,r1(mtk;n)=r1(mrk): 3.3Summaryoftheoreticalresults Thelasttheoremanditscorollaryintheprevioussectionequatetheconvergenceratesofthering,thetorus andthek-aryn-cube,thereforeleadingtothisimportantconclusion:givenaxednumberofnodes,thebest waytoconnectthemasfarasgdeloadbalancingis concernedisasak-aryn-cube.thenfromtheabove analysis,wendthatthesmallerthevalueofk,thebettertheconvergencerate;hence,thebinaryn-cubeisthe bestchoice,whichtakesexactlyonesweeptobalance theworkload.however,asdallypointsoutin[6],there areotherpracticalreasonsforwhichak-aryn-cubewith abiggerkispreferable. Theanalysistechniqueasexempliedintheabovesectioncanalsobeappliedtothechainandthemeshwhich canbeviewedasvariantsoftheringandthetorusby deletingtheend-roundconnections.thedetailedproofs forthefollowingtwotheoremscanbefoundin[19]. Theorem3.4LetGCkbeacolorchainofevenorderk, andmckbeitsgdematrix.thentheoptimalexchange parameteropt(mck)isequaltoopt(mr2k);andfora given,r1(mck)=r1(mr2k): Theorem3.5LetGMk1;k2;:::;knbeann-dimensional evencolormesh,andmmk1;k2;:::;knbeitsgde matrix.thentheoptimalexchangeparameter opt(mmk1;k2;:::;kn)isequaltoopt(mrk);andfora given,r1(mmk1;k2;:::;kn)=r1(mck);wherek= max1jnfkjg. Basedonthesetheorems,theoptimalexchangeparametersforevenchainsandmeshescanbeeasilyobtained. Forexample,forj=2;4;6;8, opt(mm8;j)=opt(mc8)=opt(mr16)0:723: Thesetheoremsalsoshowthattheconvergencerateofa meshdependsonlyonitslargestdimension.now,based onthesetheoremsandthoseintheprevioussection,we canestablishtherelationshipsbetweentheconvergence ratesofthering,thetorus,thechainandthemesh.here isasummaryoftheresults.notethattheresultsforthe k-aryn-cubeareimplicitintheresultsforthetorus. 1.Supposeeachki,1in,iseven,andk= maxfki;1ing.then, opt(mr2k)=opt(mt2k1;2k2;:::;2kn) =opt(mck)=opt(mmk1;k2;:::;kn); whichisequalto1 1+sin(=k). 2.Forevenringsandchainsandagiven,themore verticesthestructurehas,theslowertheconvergencerate. 3.Fortoriandmeshesofevenorder(evennumberof nodesineachdimension),theconvergenceratedependsonlyonthelargestdimension. 4.Theconvergenceratesofthesefourstructuresare relatedasfollows. R1(MR2k)=R1(MT2k1;2k2;:::;2kn) =R1(MCk)=R1(MMk1;k2;:::;kn) whereeachki;1in,iseven,andk= maxfki;1ing. 4Comparisonwiththediusion method Wehavederivedtheoptimalexchangeparametersfor leadingtothefastestasymptoticconvergencerateinthe k-aryn-cubeanditsvariants.here,wewouldliketo 12

comparethegdemethodwithanotherrelaxation-based method,thediusionmethod[3,5,23].themeasureof interestisstilltheconvergencerate. Withthediusionmethod,aprocessorwouldinteract withallitsneighborssimultaneouslyateachstep.for processori,thechangeofworkloadinaprocessoriis executedaswi=wi+x j2a(i)(wj?wi) wherea(i)isthesetofnearestneighborsandisthe diusionparameterwhichdeterminestheportionofexcessofworkloadtobediusedaway.asawhole,the changeoftheworkloaddistributionatsteptismodeled bytheequationwt+1=d()wt (5) whered()isthediusionmatrix1,asgivenin[5]. Theeciencyofthediusionmethodisreectedby theasymptoticconvergencerate,r1(d()),whichis equalto?ln(d()).(d())isthesubdominant eigenvalueofd(),andisalsoreferredtoastheconvergencefactorofthediusionmethod.wehavepreviouslyderivedtheoptimaldiusionparametersforthekaryn-cubeanditsvariants[23].table1summarizesthe optimalparametervaluesandtheircorrespondingconvergencefactors.forcomparison,theresultsofthegde methodarealsoincluded.clearly,(m())<(d()) inbothtorusandmeshstructures. Inmulticomputers,therearetwobasiccommunication models.oneistheserialcommunicationmodelwhich restrictsanodetocommunicatingwithatmostonenearestneighboratatime;theotheristheparallelcommunicationmodelwhichallowsanodetocommunicate withallitsnearestneighborssimultaneously.clearly, theserialcommunicationmodelfavorsthedimensionexchangemethodandtheparallelmodelfavorsthediusionmethod.recallthatthegdematrixm()reects acompletesweep i.e.,consecutiveiterationsteps, eachofwhichinvolvesani/ocommunicationatanode. 1Thesameisusedfortheentirenetwork.Itispossibletouse dierent'sfordierentedgesofthenetwork,asdiscussedin[5]. Ontheotherhand,thediusionmatrixD()reectsa singleiterationstepwhichinvolves(g)i/ocommunicationsatanode,where(g)isthemaximumdegree ofthenodes.hence,r1(m(opt))>r1(d(opt))in theserialcommunicationmodel. Intheparallelcommunicationmodel,anndimensionaltorusormeshcanbecoloredby2ncolors, andhenceadimensionexchangesweepwouldtakeas muchtimeasthatfor2ndiusionsteps;these2ndiffusionstepsinfactcorrespondtothematrixd2n.we shouldthereforecompare(d2n(opt))with(m(opt)). Sincethediusionmatrixissymmetric,itfollowsthat (D2n(opt))=2n(D(opt)).Thus,inthecaseofthe torus, (D2n(opt))=( 4n 2n+1?cos(2=k)?1)2n (4 3?cos(2=k)?1)2 >(M(opt)); andinthecaseofthemesh, (D2n(opt))=(1?1n+1ncos(=k))2n (cos(=k))2 >(M(opt)): Wehavethusprovedthefollowing,whichisvalidfor boththeserialandtheparallelcommunicationmodels. Theorem4.1TheGDEmethodconvergesasymptoticallyfasterthanthediusionmethodwhenappliedto thek-aryn-cubeanditsvariants. In[5],Cybenkocomparedtheecienciesofthesetwo methodswhenappliedtothebinaryn-cubestructure, andrevealedthesuperiorityofthedimensionexchange methodinbothcommunicationmodels.theabovetheoremextendshisresulttothefamilyofk-aryn-cube structures. 13

torusandmesh;kiiseven,andk=maxfkig;i=0;1;:::;n. Table1:ComparisonbetweentheconvergencefactorsoftheGDEmethodandthediusionmethodink1k2kn 5Simulationandpracticalim- 1+sin(2=k) 1+sin(=k) opt 1 1+sin(2=k)?1 1+sin(=k)?1 (M(opt)) 2 2n+1?cos(2=k) opt 1Diusionmethod 1?1n+1ncos(=k) 2n+1?cos(2=k)?1 (D(opt)) plementations 4n Forpracticalapplications,itwouldbeofconsiderable iterationsweeps,theexperimentsrevealinmeasurable ticalsimulationexperimentsonanumberoftestcases. Inadditiontogivingusinformationonthenumberof termstheeciencygainsduetotheoptimalexchange GDEproceduretobalancethesystem'sloadcanbeobtainedorestimated.Tothisend,weconductedstatis- valueifthenumberofiterationsweepsrequiredbytheitsworkloadaccordingtotherevisedformula Asdiscussedin[8],theintegerversionoftheoriginalDE wi=(d(1?)wi+wjeifwiwj b(1?)wi+wjcotherwise (6) theoreticalresultsconcerningtheequivalenceofthevariousconvergenceratesandtheoptimalityofthederived grainparallelismswhicharemorerealisticandmore commoninpracticalparallelcomputingenvironments, onecantreattheworkloadsoftheprocessorsmorecon- Inthetheoreticalanalysis,werepresenttheworkloadloadbalancingproceduretoendwithavarianceofsome method(i.e.,=1=2)isjustaperturbationofitsreal parameters.thesimulationresultsalsoconrmedour parametervalues. ofaprocessorbyarealnumber,whichisreasonableundertheassumptionofverynegrainparallelismasexhibitedbythecomputation.tocovermediumandlarge thresholdvalue(inworkloadunits)betweenneighboring conclusion. counterpartandwillconvergetoanearlybalancedstate. Applyingtheperturbationtheorytotherealversionof ourgdemethodverbatim,wecancometoasimilar processors.thisthresholdvaluecanbetunedtosatisfactoryperformanceoftheprocedureinpractice,as illustratedin[9].inalloursimulationexperiments,this Becauseoftheuseofintegerworkloads,weallowthe simulationexperiments.allweneedtodoistomodify stripoftheoceanforwhichtheprocessorisresponsible. WeusedtheintegerversionoftheGDEmethodinour venientlyasnon-negativeintegers,asisdonein[8].forvalueissettooneworkloadunitwhichisclosesttototal theexchangeoperatorofeq.insection2.during tegerwhichcorrespondstothenumberofshesinthe willdiscussshortly,theworkloadofaprocessorisanin- example,inthewatorsimulationexperimentwhichwebalancing.then,itisclearthat0:5<1becausea pairofneighboringprocessorswithavarianceofmore thanoneworkloadunitwouldnotbalancetheirworkloadsanymorewhen<0:5.sincethetermination mechanismforglobalterminationdetectiontothesimu- conditionofaprocessorisratherlocalized,weadda lation[22].weexcludetherathersmalldelayfortermi- ectspurelytheeciencyofthegdemethod.this results,andsothenumberofsweepsreportedbelowre- exchangewithaneighborj,processoriwouldupdatenationdetectionusingthismechanisminoursimulation 14helptheusersofthismethodtosetsuchalimit. beforehand.infact,oursimulationresultsbelowcan terminationdetectionmightnotbenecessaryifwecan setalimitonthenumberofsweepsforagivenstructure

Table2:ExpectednumberofsweepsE(NS)forRing16,Chain8,16-ary2-cube,andMesh84 0.500.853621.33 0.550.820620.30 0.600.777716.79 0.650.717015.17 0.700.611210.76 (M())Ring1616-ary2-cubeChain8Mesh84 16.69 16.22 13.91 12.95 9.19 19.97 19.08 15.87 14.28 10.22 15.78 15.05 12.61 11.37 0.7230.44659.82 0.750.50008.55 0.800.60009.68 0.850.700011.63 0.900.800015.88 0.950.900025.42 11.72 17.31 7.69 7.73 9.14 8.58 11.54 15.86 25.56 8.32 9.44 9.19 10.14 13.56 20.80 8.70 7.67 8.27 Thenumberofiterationsweeps,denotedbyNS,isexpectedtodependuponsuchfactorsastheexchangeparameter,theinitialworkloaddistribution,thetopology andthesizeoftheunderlyingsystemstructure.theinitialworkloaddistributionisarandomvector,eachelementofwhichisdrawnindependentlyfromanidenticalopt=0:723.italsoappearsthattheabsolutevalues bound.themeanworkloadaprocessorgets(i.e.,the workloadaprocessorgetsisdeterminedbythedistributionmean.table2displaystheexpectednumbersof expectedworkload)isthusequaltob.theamountof uniformdistributionin[0;2b],wherebisaprescribedtheoptimumpoint,whichisinlinewithourtheoretical resultsontheequivalenceoftheconvergencerates. turesareveryclosetoeachother,especiallywhennear 0.8,whichisinagreementwiththetheoreticalresultof 5.1Numberofiterationsweeps 8.25 sweepsgeneratedbytheexperimentsforthestructures 0.5to0.9instepsof0.05.Eachdatapointistheaverageof100runs,eachusingadierentrandominitial valuesofrisesanddropswiththevalueof(m()), loaddistribution.thesecondcolumninthetableshows ofthevariouscases.fromthetable,itisclearthatture.thisreallyputsforththegdemethodasaprac- 2-cube!).Wealsotrieddierentnumbersofprocessors ameanof128unitsperprocessor,andvariesfrom ofring16,chain8,16-ary2-cube(i.e.,torus1616), borhoodof8sweeps(evenfora256-processor16-ary optimalsweepnumbersarerathersmall intheneigh- oftheexpectednumberofsweepsforthevariousstruc- Furthermore,itismostencouragingtoseethatthe andmesh84.theinitialworkloaddistributionhas theexpectednumberofsweepsineachcasefordierent theconvergencefactors,(m()),ofthegdematrices ticalmethodforloadbalancinginrealmulticomputers. portionaltothenumberofprocessorsforthechain,and foreachkindoftopology.theresultsforchainsofup showsthattheoptimalnumberofsweepsislinearlypro- hencetothedimensionorderkofthek-aryn-cubestruc- to128processorsaredepictedinfigure6.thegure andthattheoptimalexchangeparameteroptofeachnoticethattheconvergenceratesinthetheoretical caseisnotequalto0.5,butsomewherebetween0.7andanalysisareintermsofsweepsovertime.asweepof thegdemethodmayinvolvedierentnumbersofnear- k-aryn-cube(keven),asweepcomprises2ncommunicationsteps(whenn>2)ornsteps(whenn=2).thus, 15estneighborcommunicationsinvariousstructures.Ina foragivennumberofprocessors,ahigherdimensional

1 2 4 8 16 32 64 128 256 2 16 32 8 4 64 128 E(NS) N Figure6:ExpectednumberofsweepsE(NS)usingopt forchainsofvarioussizes k-aryn-cube,eventhoughittakesfewersweepstobalancetheload,requiresmorecommunicationstepswithin asweepinreality.however,fromfigure6,wepointout thattheminimalnumberofsweepsnecessaryforconvergencewoulddecreaseatalogarithmicratewiththe increaseinthenumberofdimensions;thisisbecause thedimensionorderdecreasesatthesamerateasthe increaseinthenumberofdimensionsforagivennumberofprocessors.asanexample,consideraclusterof 4096processors,whichcanbeorganizedasastructureof 64-ary2-cube,16-ary3-cube,8-ary4-cube,or2-ary12 -cube.theminimalsweepnumbersforthesestructures areabout35,8,4,and1sweep,respectively.sincethe numberofcommunicationstepswithinasweepwould onlydoublewitheveryaddeddimension,itisjustied tomaintainthatthegdemethodismosteectivein highdimensionalk-aryn-cubes(inparticular,thebinary n-cubeandthe4-aryn-cube). 5.2Thenon-evencases Inadditiontotheevencases,wealsosimulatedafew non-evencasesforwhichthetheoreticalanalysishasnot beenabletodealwithbecauseoftheirstructuraldifferencesfromtheevencases.weapproximatedtheir optimalexchangeparametersusingtheformulasforthe evencases.table3summarizesthesimulationresults. Thenumbersinparenthesesinthebottomrowofthe tablearetheapproximatedoptimalparameterandthe resultingnumberofsweepsrespectively.inlinewith ourremarksbefore,theoddornon-evencasesdonot behavedierentlyfromtheevencasesintermsoftheir convergencepatternandtheoptimalvaluesfortheexchangeparameter,whichcanbeseenbycomparingtable3withtable2.hence,itisreasonabletoconclude thattheresultsfortheevencasescanbeapplied,asa closeapproximation,tothenon-evencases. Notethatanoddringhastobecoloredusingthree colors,asinfigure1(a).thatis,asweepofthegde methodinanoddringcomprisesthreecommunication steps.asonlytwoprocessorsareinvolvedinworkload exchangeinthethirdcommunicationstep,itseemsthat muchcommunicationbandwidthmaybewasted.however,acloseexaminationofthecommunicationpattern ofgdebalancingintheoddring,asshowninfigure7, wouldrevealthatthethirdchromaticindexwouldaccountforonlyasmallfractionofthetotalcommunicationoverheadincurredinthebalancingprocess.in t t t t t 1 2 (3) 5 4 3 Figure7:CommunicationpatternoftheGDEmethod inaringof5nodes thegure,thevebigdotsinthecentrerepresentprocessorsthatareconnectedasaring.weattachatime axisttoeachprocessorforeasyviewing.eachdotted double-arrowrepresentsacommunicationstepbetweena 16

Table3:ExpectednumberofsweepsE(NS)forRing15,Chain7,Torus165andMesh85 0.5020.20 0.5519.24 0.6015.74 0.6514.23 0.7010.05 0.758.26 Ring15Torus165Chain7Mesh85 0.809.88 16.11 15.68 13.28 12.37 15.91 14.98 12.24 10.96 15.45 0.8511.76 9.21 8.10 14.83 0.9016.16 7.95 7.45 12.63 0.9525.50 8.23 9.29 11.42 (0.711,9.75)(0.723,8.70)(0.697,8.23)(0.723,8.05) 10.11 13.31 19.75 11.43 15.79 25.36 10.01 13.29 20.11 8.60 7.54 8.10 pairofnearestneighborsfortheexchangeoftheirworkloads.attimet=1,allprocessorsexceptthefthare becomesidleattimet=2.whiletherstandfth involvedincommunication.then,therstprocessor processorsarebusyexecutingthethirdcommunication sweep.atthistime,onlythesecondprocessorisinthe idlestate.continueon,weseethatthegdeprocedure stepattimet=3,thethirdandfourthprocessorsare alreadyexecutingtherstcommunicationofthenext5.3improvementsduetotheoptimal ringofvenodescostsonlyoneextracommunication inanoddringof2k+1nodescostsonecommunication iterationsweeps.thisexamplecanbegeneralizedtoan oddringofarbitrarysize.ingeneral,thegdebalancing cationsteps.inotherwords,gdebalancinginanodd stepmorethananevenringofcomparablesizefortwo nishestwocompleteiterationsweepsinvecommuni-ofourgdemethodovertheoriginaldemethod,we thechoiceof=1=2whichisusedintheoriginalde method.tofurtherexamineandquantifythebenets parameters deneametricformeasuringimprovements,denoted: Itisclearfromthesimulationresultsthattheoptimalexchangeparameteroptyields(much)betterresultsthan betweentheringandtheotherstructuresinprevious ancinginnon-evenandevencasesofthesestructures. stepmorethanthatintheevenringofcomparablesize forkiterationsweeps.basedontheequivalenceresultswherenshandnsoptaretheexpectednumbersof =NSh?NSopt NSh100% sections,wecanconcludethatthereisanegligiblysmallspectively.theimprovementreectsthesuperiorityof sweepsfromusing=1=2andtheoptimaloptre- dierencebetweentheecienciesofoptimalgdebal-originalone(de).tables4{7showtheresultsfordierentstructures;foreachstructure,dierentsizesofthe structureanddierentworkloadsperprocessorareconnicantunderheavyloads.forexample,inaringwitcreasesastheaverageworkloadperprocessorincreasessidered. 64nodes,ifeachprocessorisloadedwith10units,then NSh=7:39,NSopt=6:91and=6:5%;ifeachprocessorisloadedwith10,000units,thenNSh=543, Theseresultssuggestthattheimprovement()in- theoptimaldimensionexchangemethod(gde)overthe Itisnotsoevidentunderlightloadsbutbecomessig- 17

Table4:Improvements(%)forringsofvarioussizesand workloadsmeanworkloadperprocessor Table6:Improvements(%)forchainsofvarioussizes andworkloadsmeanworkloadperprocessor Size10100100010000 1289.1372.4192.1994.93 1619.2151.7961.9766.49 3212.7567.8179.1484.20 646.5072.8088.4390.80 83.1620.8329.6737.23 Size10100100010000 163.1865.5579.5282.54 810.6848.2061.8866.33 andworkloadsmeanworkloadperprocessor Table7:Improvements(%)formeshesofvarioussizes 1288.3862.5592.2296.95 322.0071.2586.9390.40 641.9869.1591.2794.61 Size10100100010000 Table5:Improvements(%)fortoriofvarioussizesand workloads NSopt=50,and=90:8%.Thesimulationresultsalso 16165.1061.2677.7281.15 32326.8261.3085.5789.57 443.9817.1127.7934.56 8811.6048.0159.0164.39 161611.6247.0557.4463.63 32325.0160.3975.9780.58 440000 883.4316.3626.2232.83 Size10100100010000 Meanworkloadperprocessor astructurewhenworkloadisheavy i.e.,thelargerthe showthattheimprovementisproportionaltothesizeof system,thebettertheperformanceofourgdemethod usingoptimalexchangeparameters.insummary,dynamicloadbalancingusingdimensionexchangedoes Thesimulationexperimentsshedlightonthenumberof eter. 5.4Practicalimplementations iterationsweeps,butignoredtheoverheadthatmightbe benetsubstantiallyfromtheoptimalexchangeparamtherdidtheytellusanythingabouttheexpectedper- incurredinactualimplementationofthemethod.nei- 18dataparallelcomputations.Itturnsoutthattheover- plementeditasthedynamicloadbalancerwithintwo cations.wethereforetookthegdemethodandim- formancegainwhenthemethodisusedinrealappli-

headduetotheperiodicexecutionofthegdemethod isverysmallbutthegaininperformancethroughthe balancingovertheversionswithoutgdebalancingis substantial.indataparallelcomputations,thecomputationalrequirementassociatedwitheachportionofa problemdomainmaychangeasthecomputationproceeds.toreducethepenaltyofloadimbalances,aneffectivewayistoperiodically\remap"(re-decompose) theproblemdomainontotheprocessors;thegoalofthis remappingistotrytocreateabalancedworkloadacross theprocessorsforthenextphaseofthecomputation[11]. ToallowustoevaluatetheGDEmethodaswellasto studygde-basedremapping,weimplementedtwomajorapplicationsinamulticomputer(atransputerarray): thewatorsimulationandtheparallelthinningofimages.theremappingmechanismintheseapplications comprisestwocomponents:thedecisionmakerandthe workloadadjuster.thedecisionmakerusesthegde methodtodrivetheprocessorsintoaconsensusonthe uniformworkloaddistribution;thentheworkloadadjusterwouldcarryouttheactualworkloadredistribution accordingtothedecisionsjustmade.thisremapping mechanismisinvokedperiodically.inthewatorsimulationofa256256toroidaloceanonan16-transputer ring-structurednetwork,itisfoundthatfrequentremapping(onceeverytwosimulationsteps)leadstoa10{ 20%improvementonthetotalsimulationtime(for100 stepsofsimulatingtheocean).inparallelthinningof a128128image(apopularimageofaman'sbody) onan8-transputerchain-structurednetwork,frequent remappingusinggdeyieldsaperformancegainof10% onthinningtimeeventhoughthetestimagedoesnot favorremapping.detailsofanddiscussionaboutthese experimentscanbefoundin[19].theseresultshaveled ustobelievethatthegdemethodwiththeoptimal exchangeparametersisaviabletoolfordynamicload balancinginpracticalimplementationsofdata-parallel computations. 6Concludingremarks WehaveanalyzedtheGDEmethodfordynamicload balancingasappliedtothek-aryn-cubeandits variants thering,thetorus,thechain,andthemesh. Wehavederivedtheoptimalexchangeparametersin closedform,whichmaximizetheconvergenceratesof GDEbalancinginthesestructures.Wehaveshownthat thereexistscloserelationshipsbetweentheirconvergence rates,andconcludedthatthegdemethodfavorshighdimensionalk-aryn-cubesforagivennumberofprocessors.wehavealsorevealedthesuperiorityofthegde methodtothediusionmethodwhenbothareapplied tothesestructures.throughstatisticalsimulationexperiments,wehaveshownthattheeciency(interms ofnumberofstepstoconvergence)ofusingtheoptimal exchangeparametersissignicantlybetterthanthatof thenon-optimalcasessuchastheoriginaldemethod. Thispaperhasanalyzedtheoreticallyonlytheeven casesofthevarioustopologies.thisisduetoourusing matrixpartitioningandcirculantmatricesfortheanalysis.itisconceivablehoweverthattheoddcaseswould behavemoreorlessthesameastheirevencounterparts especiallywhenthenumberofnodesislarge.wefound thistobetrueforthenon-evencaseswesimulated.we alsopresentedanargumentthatsuggestedthatthedifferencebetweenthetwointermsofeciencyshouldbe negligiblysmall.nevertheless,ndingadierentmathematicaltooltoanalyzealsotheoddcaseswouldbean interestingtheoreticalpursuit.inaddition,afterhaving dealtwithsomeofthemostcommonregularstructures, itisnaturaltothinkofarbitrarystructures.unfortunately,thederivationoftheoptimalexchangeparameter forarbitrarystructuresrequiresasolutiontotheproblemofspecifying,inanalyticalform,thedependenceof thesubdominanteigenvalueinmodulusofamatrixon thematrixelements.thisisstillopeninmathematics[4]. Acknowledgements.Wethankstheanonymous refereesfortheirconstructivecomments. 19

References [1]W.C.AthasandC.L.Seitz.Multicomputers:message{passingconcurrentcomputers.IEEE Computer,21(8):9{24,August1988. [2]D.P.BertsekasandJ.N.Tsitsiklis.Parallelanddis- tributedcomputation:numericalmethods.prentice- HallInc.,1989. [3]J.B.Boillat.Loadbalancingandpoissonequation inagraph.concurrency:practiceandexperience, 2:289{313,December1990. [4]A.BroderandE.Shamir.Onthesecondeigenvalue ofrandomregulargraphs.inproc.of28th.ieee FoundationsofComputerScience,pages286{294, 1987. [5]G.Cybenko.Loadbalancingfordistributedmemorymultiprocessors.JournalofParallelandDistributedComputing,7:279{301,1989. [6]W.J.Dally.Performanceanalysisofk-aryn-cube interconnectionnetworks.ieeetransactionson Computers,39(6):775{785,June1990. [7]P.J.Davis.Circulantmatrices.JohnWileyand SonsInc.,1979. [8]S.H.Hosseini,B.Litow,M.Malkawi,J.Mcpherson,andK.Vairavan.Analysisofagraphcoloring baseddistributedloadbalancingalgorithm.journal ofparallelanddistributedcomputing,10:160{166, 1990. [9]R.LulingandB.Monien.Loadbalancingfordistributedbranchandboundalgorithm.InProceedingsof6thInternationalParallelProcessingSymposium,pages543{5448,March1992. [10]L.M.NiandP.K.McKinley.Asurveyofwormholeroutingtechniquesindirectnetworks.IEEE Computer,26:62{76,February1993. [11]D.M.NicolandP.F.Reynolds.Optimaldynamicremappingofdataparallelcomputation. IEEETrans.onComputers,39:206{219,February1990. [12]D.M.NicolandJ.H.Saltz.Dynamicremappingofparallelcomputationswithvaryingresourcedemands.IEEETransactionsonComputers, 37(9):1073{1087,September1988. [13]G.RamanathanandJ.Oren.Surveyofcommercialparallelmachines.ACMComputerArchitecture News,21(3):13{33,June1993. [14]S.Ranka,Y.Won,andS.Sahni.Programminga hypercubemulticomputer.ieeesoftware,5:69{77, September1988. [15]Y.ShihandJ.Fier.Hypercubesystemsandkey applications.ink.hwangandd.degroot,editors, ParallelProcessingforSupercomputersandArtical Intelligence,pages203{243.McGraw-HillPublishingCo.,1989. [16]G.Trew,A.andWilson.Past,PresentandParallel: ASurveyofAvailableParallelComputerSystems. Springer{Verlag,1991. [17]R.S.Varga.Matrixiterativeanalysis.Prentice- Hall,1962. [18]M.Willebeek-LeMairandA.P.Reeves.Localvs. globalstrategiesfordynamicloadbalancing.in ProceedingsofInternationalConferenceonParallelProcessing,volume1,pages569{570,1990. [19]C.-Z.Xu.Iterativemethodsfordynamicloadbalancinginmulticomputers.PhDthesis,Dept.of ComputerScience,TheUniversityofHongKong, 1993. [20]C.-Z.XuandF.C.M.Lau.Iterativedynamicload balancinginmulticomputers.journalofoperationalresearchsociety.toappear.(availableas Tech.ReportTR-92-09,Dept.ofComputerScience, TheUniv.ofHongKong,Sep.1992). [21]C.-Z.XuandF.C.M.Lau.Analysisofthegeneralizeddimensionexchangemethodfordynamicload balancing.journalofparallelanddistributedcomputing,16:385{393,december1992. 20

[23]C.-Z.XuandF.C.M.Lau.Optimalparametersfor [22]C.-Z.XuandF.C.M.Lau.Terminationdetection loadbalancingusingthediusionmethodinthekaryn-cubenetworks.informationprocessingletters,47(5):181{187,september1993.(alongerversionappearsastech.reporttr-93-03,dept.otributedprocessing,pages196{203,december1992ingsof4thieeesymposiumonparallelanddis- forlooselysynchronizedcomputations.inproceed- ComputerScience,TheUniv.ofHongKong,Mar. 1993). 21