thek-aryn-cubestructure. 1



Similar documents
Graphic Algorithms and the Demographic Variations

Mail for OS X Medical School IMAP & Exchange Setup Guide

Components: Interconnect Page 1 of 18


Factoring Special Polynomials

1 Shapes of Cubic Functions

Topological Properties

x 3 x 4 x 2 f Site A Site B Site C

Figure2:Themixtureoffactoranalysisgenerativemodel. j;j z

... Schema Integration


Chapter 2. Multiprocessors Interconnection Networks

Actuarial mathematics 2

Introduction to the Finite Element Method (FEM)

Factoring Trinomials: The ac Method

Interconnection Network

Themethodofmovingcurvesandmovingsurfacesisanew,eectivetoolfor Abstract

Sleeve Yokes / Kayıcı Çatallar

"#$%&'((&)!*+,-./ 0+1$23!4-+5#.-)!!

Using the ac Method to Factor

Introduction: Overview of Kernel Methods

Technical specification bucket JET 53 P

Factoring Trinomials of the Form x 2 bx c

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Change Discussion Guide


Section 4.5 Exponential and Logarithmic Equations

Parallel Programming

Self-piercing riveting

MERITOR. FEKO No : 3164 FH OEM No : Application/Kullanım yeri : ELSA 2 Description/Açıklama : Triangle Seal/ Üçgen Keçe Dimensions/Ölçü mm : 52x6,70


A. Factoring out the Greatest Common Factor.


Manual for SOA Exam MLC.

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Section 3.2 Polynomial Functions and Their Graphs

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Advanced Microeconomics

Attention windows of second level fixations. Input image. Attention window of first level fixation

Question 1a of 14 ( 2 Identifying the roots of a polynomial and their importance )

Problem 1: Computation of Reactions. Problem 2: Computation of Reactions. Problem 3: Computation of Reactions

Polynomial Expression

Computational Physics

How To Lead In A Contract

Real-TimeVericationofStatemateDesigns. applicationsraisesthedemandforprovingtheircorrectness.becauseverication


From DeLuna Tape #002 05:22:00 to DeLuna Tape #002 05:40:54

Mathematics of Life Contingencies MATH 3281

Number of objects k 2k 4k 8k 16k 32k 64k 128k256k512k 1m 2m 4m 8m

On-Chip Interconnection Networks Low-Power Interconnect

F. P. Beer et al., Meccanica dei solidi, Elementi di scienza delle costruzioni, 5e - isbn , 2014 McGraw-Hill Education (Italy) srl

Recitation #5. Understanding Shear Force and Bending Moment Diagrams

Red Hat Enterprprise Linux - New Offerings SYSTEM OPTIONS

SSLV105 - Stiffening centrifuges of a beam in rotation

Design and Implementation of a P2P Cloud System

Deferred Annuities Certain

Networks on Chip. on-chip interconnect: physical. Kees Goossens. Kees Goossens Eindhoven University of Technology 1

Aluminum Capacitors Radial Style

Fast Fourier Transform: Theory and Algorithms

Chapter 12: Multiprocessor Architectures. Lesson 04: Interconnect Networks

3.2. Solving quadratic equations. Introduction. Prerequisites. Learning Outcomes. Learning Style

Interconnection Networks

Technical Note 3175A Fault finding Cooling circuit

MATH 110 Automotive Worksheet #4

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

PARTICIPANT INFORMATION. The Oxford Access for Students Improving Sleep (OASIS) Study

Finding New Opportunities with Predictive Analytics. Stephanie Banfield 2013 Seminar for the Appointed Actuary Session 4 (Life)

threads threads threads

Òàëðåïû, çàêðûòîãî òèïà Turnbuckles, closed type B 1 - B 5

A Source Identification Scheme against DDoS Attacks in Cluster Interconnects

4. Expanding dynamical systems

Support Vector Machines

Problem set 2, Part 2: Generalized Roy Model 2 Factor, no normality

Sensitivity Analysis of Risk Measures for Discrete Distributions

Math 370, Spring 2008 Prof. A.J. Hildebrand. Practice Test 2

Procedural Animation. An introduction

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

PART I: A STANDARD ANALYSIS OF FACTOR MOBILITY

Transcription:

DDE:AModiedDimensionExchangeMethod forloadbalancingink-aryn-cubes StateUniversityofNewYorkatBualo DepartmentofComputerScience Min-YouWuandWeiShu algorithmforthehypercubestructure.ithasbeengeneralizedtok-aryn-cubes.however,the k-aryn-cubealgorithmmusttakemanyiterationstoconvergetoabalancedstate.inthispaper, Abstract Thedimensionexchangemethod(DEM)wasinitiallyproposedasaload-balancing wu,shu@cs.bualo.edu Bualo,NY14260 balancestheloaddirectlywithoutiterativelyexchangingtheload.itisabletobalancetheload moreaccuratelyandmuchfaster.1.introduction (DDE)method,takesloadaverageineverydimensiontoeliminateunnecessaryloadexchange.It weproposeadirectmethodtomodifydem.thenewalgorithm,directdimensionexchange DEMissuperiortootherschedulingmethods[7].DEMforthehypercubenetworkisasimple memorymachines.theexperimentcarriedbywillebeek-lemairandreevesconformedthat algorithm.loadbalancingisperformediterativelyineachofthelogndimensions,inwhichonly rithmforthehypercubestructure[5,1].itbalancestheloadforindependenttasksondistributed nodepairsexchangetheirloadinformationandattempttoaveragethenumberoftasks.aftera sweep(logniterations),theloadisbalanced. Thedimensionexchangemethod(DEM)wasinitiallyproposedasafullyload-balancingalgo- ittakesmanysweepstoconvergetothebalancedload.hosseinietal.extendeditforarbitrary dimensionexchange(gde)method[9].thegdemethodwasextendedtothek-aryn-cube network[10].becauseanodeexchangesworkloadwithonlyoneofitsneighboratatime,gde structuresusingthetechniqueofedge-coloringofgraphs[3].xuandlauproposedthegeneralized isnotabletoreachthebalancedstateinonesweep.thenumberofsweepsforconvergenceis linearlyproportionaltothenumberofnodesinachain,andhencetothedimensionorderkof Unfortunately,whenDEMappliestootherstructures,suchasthemeshorthek-aryn-cube, thek-aryn-cubestructure. 1

beappliedtotwoormoredimensionstobalancetheloadforthemesh,thetorus,andthek-ary n-cube. orunderloadedandsubsequentlyexchangeworkloadwithothernodes.theddemethodcan loadinachainisfullybalancedbyutilizinginformationofthetotalnumberoftasks,whichcan method.unlikeiterativealgorithms,thisdirectmethodcanbalancetheloadinonesweep.the beeasilyobtainedbyasumreduction.eachnodeinthechainknowswhetheritisoverloaded Wepresentadirectmethodforthek-aryn-cube,calledtheDirectDimensionExchange(DDE) respectively.thealgorithmforthek-aryn-cubeispresentedinsection5.insection6,thedirect Then,thedirectmethodforthechainandtheringstructuresisdescribedinsections3and4, methodiscomparedtothegdemethod.section7concludesthepaper. Thispaperisorganizedasfollows.Section2brieyreviewstheDEMandtheGDEalgorithms. Toachievethisgoal,anestimationofthetaskexecutiontimeisneeded,whichcanbedoneeither byaprogrammerorbyacompiler.sometimestheestimationcanbeapplication-specic,and sometimesitisimpossibletoobtainsuchanestimation.duetothesediculties,eachtaskis Thegoalofloadbalancingistoscheduleworkssothateachprocessorhasthesameworkload. 2.TheDEMandGDEAlgorithms presumedtorequiretheequalexecutiontimeandthegoalofthealgorithmistoscheduletasks sothateachprocessorhasthesamenumberoftasks. algorithmistoredistributetaskssothatthenumberoftasksineachnodeisequal.assumethe sumofwiofallnodescanbeevenlydividedbyn.theaveragenumberoftaskswavgiscalculated computingnodesareconnectedbyagiventopology.eachnodeihaswitasks.ascheduling Theschedulingproblemcanbedescribedasfollows.Inaparallelordistributedsystem,N andthencombinedtoformlargerdomainsuntilultimatelytheentiresystemisbalanced.the Eachnodeshouldhavewavgtasksafterscheduling. \integerversion"ofdemisdescribedinfigure1.allnodepairsintherstdimensionwhose DEMwasdesignedforthehypercubestructure.InDEM,smalldomainsarebalancedrst wavg=pn?1 i=0wi addressesdierinonlytheleastsignicantbitbalancetheloadbetweenthemselves.next,all N: nodehasbalanceditsloadwitheachofitsneighbors. nodepairsintheseconddimensionbalancetheloadbetweenthemselves,andsoforth,untileach AfterexecutionoftheDEMalgorithm,theloaddierence D=max(wi)?min(wi) 2

DEM forl=0ton?1 wi=(d(wi+wj)=2eifwi>wj if(wj?wi)>1,receiveb(wj?wi)=2ctasksfromnodej if(wi?wj)>1,sendb(wi?wj)=2ctaskstonodej nodeiexchangeswithnodejthecurrentvaluesofwiandwj,wherej=i2l GDE b(wi+wj)=2cotherwise while(notterminate) forl=1toc foredgecoloredlconnectingnodesiandj Figure1:TheDEMalgorithmforthehypercube. nodeiexchangeswithnodejthecurrentvaluesofwiandwj if(wi?wj)>1,sendb(wi?wj)ctaskstonodej wi=(d(1?)wi+wjeifwi>wj if(wj?wi)>1,receiveb(wj?wi)ctasksfromnodej isboundedbyn,thedimensionofthehypercube[3].thenumberofcommunicationstepsofthe DEMalgorithmis3n[7]. Figure2:TheGDEalgorithmforthek-aryn-cube. b(1?)wi+wjcotherwise graph.the\integerversion"ofthealgorithmisshowninfigure2.anodenishesacomplete Forthehypercube,theoptimal=12,andGDEisequivalenttotheoriginalDEMalgorithm.For c=2nifkisanevennumber.theterminationconditionisthatthedierenceofthenumberof sweepaftercconsecutiveexchangeoperations,wherecisthenumberofcolors.ink-aryn-cubes, theexchangeparameter.thevaluevariesfordierenttopologiesanddierentnetworksizes. tasksbetweenneighboringnodesislessthanorequaltoone.theconvergenceratedependson TheGDEalgorithmoperatesoncolorgraphsderivedfromedge-coloringofthegivensystem loaddierencebetweenanypairofnodesisboundedbynk=2.theconvergenceratedecreases othertopologies,istobeoptimizedtomaximizetheconvergencerate.forthek-aryn-cube,the whenthedimensionorderkincreases.thereisnocommunicationconictinthisalgorithm. topologyby\folding"themeshineachdimensiondlogmetimes[7].thismethodcouldbeapplied Willebeek-LeMairandReevessuggestedanotherapproachtoextendDEMtoanMMmesh 3

pairswouldnolongerbedirectlylinkedtooneanotherandcommunicationswouldconict. tok-aryn-cubestoo.theloaddierenceisboundedbyndlogke.however,inthisapproach,node method.theworkloadinachaincanbebalanceddirectly.thebasicideaistocalculatethe totalnumberoftasksinthechainandtheaveragenumberoftaskspernode.thus,nodesinthe chaincanexchangetaskstobalancetheload. InsteadofusingtheGDEmethodwhichbalancestheloaditeratively,weproposeadirect 3.TheDDEMethodfortheChain DDE-chain Letwibethenumberoftasksinnodei,wherei=0;1;:::;k?1. 2.AverageLoadCalculation:T=W0,wavg=bT=kc,andR=Tmodk,whereTisthe 1.GlobalInformationCollection:Performthescanwithsumoperationofwi: 3.QuotaCalculation:Thequotaofeachnodeqiiscomputed: totalnumberoftasks. Wi=k?1 Xl=iwl qi=(wavg+1ifi<r 4.FlowCalculation:xi?1;i=Qi?Wi,fori=1;2;:::;k?1,wherexi;jistheowon Also,anaccumulationquotaforeachnodeiscomputed: wavgotherwise edge(i;j). Qi=k?1 Figure3:TheDDEalgorithmforthechain. Xl=iql thenodeweightwi(i=0;1;:::;k?1)andoutputsthecalculatedowxi?1;i(i=1;2;:::;k?1) foreveryedgeinthechain.therststepistoobtainthetotalnumberoftasksinthechainby usingthescanwithsumoperationfromnodek?1tonode0,wherekisthelengthofthechain. EachnoderecordsapartialsumWi=Pk?1 oftaskspernodeatnode0.ifthenumberoftaskscannotbeevenlydividedbyk,theremaining TheDDEalgorithmforthechainshowninFigure3isits\integerversion."Ittakesasinput l=iwl.thesecondstepcalculatestheaveragenumber 4

RtasksaredistributedtotherstRnodessothattheyhaveonemoretaskthantheothers. ThevaluesofwavgandRarebroadcasttoeverynode.Inthethirdstep,eachnodecalculatesits asitsquota. EachnodekeepsrecordsofQi,Wi,Qj,andWj,wherej=i+1.Inthefourthstep,theow iscalculatedbytakingdierencebetweenqiandwi.nodeicalculatesxi?1;iandxi;i+1.when quota.theaccumulationquotaqicanbecalculateddirectlyasfollows: theowisavailable,theworkloadisexchangedsothateachnodehasthesamenumberoftasks Qi=wavg(k?i)+min(0;R?i): Example1: readytobescheduled.valuesofwiarecalculatedinstep1.node0calculatesthevalueofwavg andr: shownbelow: Then,eachnodecalculatesthevalueofQiinstep3.Thevaluesofwi,Wi,Qi,andxi?1;iareas AnexampleisshowninFigure4.Atthebeginningofscheduling,eachnodehaswitasks 093737 172832 iwiwiqixi?1;i wavg=4;r=5: 242127 311722 441617 561212 6168 { 46510 i=0 94i=1 76i=2 47554?1 5i=3 11i=4 42i=5 62i=6 11 Aftertaskexchange,nodes0{4havevetaskseach,andnodes5{7havefourtaskseach. Figure4:ExampleforDDE-chain. i=7 5 toitsquota. Lemma1:AfterexecutionofDDEandtaskexchange,thenumberoftasksineachnodeisequal 5

Becausexi?1;i=Qi?Wi;xi;i+1=Qi+1?Wi+1;Wi+1=Wi?wi;andQi+1=Qi?qi Proof:AfterexecutionofDDEandtaskexchange,thenumberoftasksinnodeiis w0i=wi+(qi?wi)?(qi+1?wi+1)=qi?qi+1=qi w0i=wi+xi?1;i?xi;i+1 stepsinstep4isatmostk.therefore,thetotalnumberofcommunicationstepsofthisalgorithm andapplyingthetwaalgorithmin[6].thus,thetotalnumberofcommunicationstepsofthis isnomorethan3k.thisalgorithmcanbefurtherimprovedbyselectingnodek/2astheroot algorithmcanbereducedto2k.whentisevenlydividedbyk,thisalgorithmminimizesthe Inthisalgorithm,steps1and2spend2kcommunicationsteps.Thenumberofcommunication 2 Receive-before-send totalnumberoftasktransfersandthetotalnumberofcommunications.thisalgorithmalso maximizeslocality.thatis,itminimizesthenumberoftasksthataremigratedtoothernodes. Fornodei exchangealgorithms.therstone,calledreceive-before-send,isshowninfigure5. 1.ifi>0andxi?1;i>0,waittoreceivexi?1;itasksfromnodei?1 TheworkloadisexchangedaccordingtotheowgeneratedbyDDE.Therearetwotask- 2.ifi<k?1andxi;i+1<0,waittoreceivejxi;i+1jtasksfromnodei+1 3.ifi>0andxi?1;i<0,sendjxi?1;ijtaskstonodei?1 nicationstepstonish: 4.ifi<k?1andxi;i+1>0,sendxi;i+1taskstonodei+1 Usingthereceive-before-sendalgorithm,theloadexchangeinExample1takesfourcommu- (2) (1) node0tonode1,node5tonode6,node7tonode6 node1tonode2 Figure5:Taskexchange:receive-before-send. (3) (4) node2tonode3 node3tonode4 6

while(ai6=0orbi6=0) letai=xi?1;i;bi=xi;i+1 Send-before-receive Fornodei 4.ifi<k?1andbi<0andreceivedjbijtasksfromnodei+1,andletwi=wi?bi,bi=0 3.ifi>0andai>0andreceivedaitasksfromnodei?1,andletwi=wi+ai,ai=0 2.ifi<k?1and(wi>bi>0)sendbitaskstonodei+1,andletwi=wi?bi,bi=0 1.ifi>0and(wi>?ai>0)sendjaijtaskstonodei?1,andletwi=wi+ai,ai=0 infigure6.inthisalgorithm,anodecanstartsendingmessagesoutbeforeithasreceivedan incomingmessage.thecommunicationtimeandprocessoridletimecanbereduced.ittakes beforesendingoutmessages.byrelaxingthisconstraint,asend-before-receivealgorithmisshown onlytwocommunicationstepsforexample1: Inthereceive-before-sendalgorithm,eachnodemustreceiveanincomingmessage,ifany, Figure6:Taskexchange:send-before-receive. 1) 2) node0tonode1,node1tonode2,node3tonode4, node5tonode6,node7tonode6 before-sendorsend-before-receivealgorithmsisatrade-obetweencommunicationtimeand othernodesandthenreceivetasksfromothers.therefore,thedecisiononuseofthereceive- taskstoothernodes.butinthesend-before-receivealgorithm,anodemaysendlocaltasksto before-sendalgorithm,anodecankeepthemaximumnumberoflocaltasksandsendnon-local Thesend-before-receivealgorithmmayhavesomenegativeimpactinlocality.Inthereceive- node2tonode3 locality. oncommunicationtimecanoftenbeignored.therecursivedoublingalgorithm[2]cantake advantageofthepipelineeectofwormholeroutingwhileavoidingchannelcontention.this algorithmorganizesthenodesinachaintoatree.anexampleofeightnodesisshownin communicationsteps. Figure7.ApplyingtheTWAalgorithmin[6]tothetree,theloadcanbebalancedwithin4logk Mostmassivelyparallelcomputersusewormholeroutingwithwhichtheeectofpathlength 7

i=0 i=6 i=4 i=5 i=2 i=3 i=1 4.TheDDEAlgorithmfortheRing Figure7:Thetreeforrecursivedoubling. i=7 however,thecommunicationmaynotbeminimal.byutilizingtheend-roundedge,communicationcouldbereduced.wedescribeanalgorithmtominimizethetotalnumberoftaskstransferred. Thealgorithmisderivedfromtheminimumcostowalgorithm[4]andshowninFigure8.Inthis gorithmcanbeappliedtotheringbyignoringtheend-roundedge.theloadcanbebalanced, algorithm,aninitialsolutionisobtainedbyusingdde-chainwithoutconsideringtheend-round Aringcanbeobtainedbyaddinganend-roundconnectiontoachain.TheDDE-chainal- x0;1;x1;2;:::;xk?2;k?1,wherexi;jistheowonedge(i;j).letxk?1;0be0. DDE-ring ApplyDDE-chaintotheringwithoutconsideringtheend-roundedge(k?1;0)toobtain Iftheowisclockwise,xi;jispositive;otherwise,itisnegative. Letnpbethenumberofedgeswithxi;j>0,nnthenumberofedgeswithxi;j<0,andnzthe numberofedgeswithxi;j=0. 1.Ifnn+nz?np<0,letxmbethemthlargestxi;jfromallxi;j>0;andifnp+nz?nn<0, 2.Foreachedge,xi;j=xi;j?xm. letxmbethemthsmallestxi;jfromallxi;j<0,wherem=dk=2e. Figure8:TheDDEalgorithmforthering. 8

edge.then,anaugmentationisappliedtoobtainanoptimalsolution.thecomplexityofthis algorithmiso(klogk). ofdde-ring.here,weletx?1;0=xk?1;0. Wecanuseeitherthereceive-before-sendorsend-before-receivealgorithmfortaskexchange negativecost.therefore,thenetworkowisofminimumcost[4]. numberoftaskstransferred. Lemma2:AfterexecutionofDDE-ring,thetotalnetworkowisofminimumcost. Proof:Ifnp+nz?nn0andnn+nz?np0,thereisnoowaugmentingcyclewith Thefollowinglemmashowsthatthisalgorithmminimizesthetotalcostofow,thatis,the Notethatn0n+n0z+n0p=k.Becauseofm=dk=2e, Then, Ifnn+nz?np<0,aftermodicationofxi;j=xi;j?xm,wehave n0z+n0p?n0nm?n0n=m?(k?n0z?n0p)2m?k n0z+n0p?n0n2m?k0 n0z+n0pm Then,n0n+n0z?n0pk?m+1?n0p=k?m+1?(k?n0n?n0z)=1?m+n0n+n0z Wealsohave 1?m+k?m+1=k?2m+2 n0n+n0zk?m+1 Becauseofm=dk=2e, Thus,thenetworkowisofminimumcost. costinallcases. Thecaseofnp+nz?nn<0canbeprovedsimilarly.Thus,thenetworkowisofminimum n0z+n0p?n0nk?2m+20 toconstructaring.applyingthedde-chainalgorithmtotheringwithoutconsideringthe AnexampleisshowninFigure9.Anend-roundedgeisaddedtothechaininFigure4 9 2

end-roundedge,theowisshowninfigure9(a).thenumberoftaskstransferredis19.the augmentationisappliedtothisow:np=1;nz=2;nn=5 resultisshowninfigure9(b).thenumberoftaskstransferredisreducedto17. Becausenp+nz?nn<0andthe4thsmallestxi;jis?2,everyxi;jissubtractedby?2.The i=0 4i=1 6i=2 5i=3 (a) i=4 i=5 2i=6 1i=7 i=0 92i=1 74i=2 43i=3 112 Figure9:ExampleforDDE-ring. (b) i=4 42i=5 6 i=6 13i=7 5 n-cube.thealgorithmisshowninfigure10.initerationloftheddealgorithm,subcubeslm isdividedintokpartitionssl+1 WiththeDDE-ringalgorithm,itisnotdiculttocompositeaDDEalgorithmforthek-ary 5.TheDDEMethodforthek-aryn-cube DDE,nodeiwillhavewnitasks.Thetaskexchangestepcanuseeitherthereceive-before-send nodesindierentdimensions.takingatorusandstripthemofalltheend-roundconnections,we orsend-before-receivealgorithm. nodes.thenodesineachringexchangetheirloadandtheneachnodeihaswl+1 Thisalgorithmcanbeappliedtothen-dimensionaltorus,whichallowsdierentnumberof km+bwherem=0;1;:::;kl?1andb=0;1;:::;k?1.slmhaskn?l getamesh.thisalgorithmcanbeappliedtothemeshbyperformingthedde-chainalgorithm itasks.executing ofthek-aryn-cube. insteadofdde-ringineachstep. ThefollowingtheoremshowsthattheloaddierenceofDDEisboundedbyn,thedimension 10

DDEfork-aryn-cube Assumeak-aryn-cubeS0,thenumberofnodesiskn,andnodeihasw0itasks. forl=0ton?1 applythedde-ringalgorithmtokn?1ringsinthelthdimensionindependently, whereeveryringhasknodes(a0;a1;:::;al;:::;ak?1)andal=0;1;:::;k?1 exchangetasksaccordingtotheow eachnodeupdatesitsweightwl+1 i=wli+xi?1;i?xi;i+1 Figure10:TheDDEalgorithmforthek-aryn-cube. Theorem1:AfterexecutionofDDE,theloaddierence D=max(wni)?min(wni) isboundedbyn. Proof:InthelthstepofDDE,ak-ary(n?l)-cubeispartitionedintokk-ary(n?l?1)-cubes. Thedierenceofthenumberoftasksbetweentwopartitionsismaximalwhenineachringevery nodeinrstpartition,saysl+1 km,hasonemoretaskthanthatpossessedbythenodeintheother partitions,sl+1 Xkm+b,whereb=1;2;:::;k?1.Thus j2sl+1 kmwl+1 j=x j2sl+1 km+k?1wl+1 j+jsl+1 kmj=1 k?1(x j2slmwlj?x j2sl+1 kmwl+1 j)+kn?l?1 wherejsl+1 kmjdenotesthenumberofnodesinsubcubesl+1 kmwhichiskn?l?1.therefore, X j2sl+1 kmwl+1 j=1kx j2slmwlj+(k?1)kn?l?2 Similarly,X j2sl+1 km+k?1wl+1 j=x j2sl+1 kmwl+1 j?jsl+1 kmj=(x j2slmwlj?(k?1)x j2sl+1 km+k?1wl+1 j)?kn?l?1 Therefore, X j2sl+1 km+k?1wl+1 j=1kx j2slmwlj?kn?l?2 Let Almax=max 0m<klX j2slmwlj 11

and Whenl=0, A0max=A0min=X Almin=min j2s0w0j=x 0m<klX j2slmwlj: Similarly, wheretisthetotalnumberoftasks.thus, Almax=(T1kAl?1 max+(k?1)kn?l?1otherwise 0j<knw0j=T Thesolutiontotheaboverecurrenceisgivenby Almin=( 1kAl?1 min?kn?l?1otherwise T ifl=0 ifl=0 (1) Almax=Tkl+(k?1)lkn?l?1 (2) Itclearlysatises(1)and(2)forthebasis,l=0.If(3)satises(1)forl=m,then Am+1 max=t km+1+(m+1)(k?1)kn?(m+1)?1=1k(t Almin=Tkl?lkn?l?1 km+(k?1)mkn?m?1)+(k?1)kn?(m+1)?1 (4) (3) wheneverl0.similarly,itcanbeshownthat(4)satises(2)wheneverl0. Therefore,itsatises(1)forl=m+1.Thus,byinductiononlwehaveshownthat(3)satises(1) Letl=n Anmax=max =1kA(m+1)?1 max+(k?1)kn?(m+1)?1 atmostbyn. BecauseD=Anmax?Anmin=(k?1 Anmin=min k+1k)n=n,thenumberoftasksinanytwoprocessorsdiers 0j<knwnj=Tkn+k?1 0j<knwnj=Tkn?1kn; kn isshowninfigure11(c).themaximumloaddierenceis2. thatthedde-ringalgorithmappliestoeachringintherstdimension.then,dde-ringapplies toeachringintheseconddimension,asshowninfigure11(b).theresultantloaddistribution AnexampleisshowninFigure11.Thisisa4-ary2-cube(i.e.,torus44).Figure11(a)shows 2 12

173015 612616 21355 68516 24 2412 271 3 (a) 110101010 9 712967129671 867 (b) 2 9 8 8 8 87 Inthissection,wecompareperformanceofGDEandDDE.Weconsideratestsetofload Figure11:ExampleforDDE(4-ary2-cube). 6.ExperimentalResults (c) speciedvalue.inthissimulationexperiment,theaveragenumberoftasks(averageweight)per distributions,inwhichtheloadateachprocessorisrandomlyselectedwiththemeanequaltoa torus,an8883d-mesh,anda1616163d-torus.forthesenetworks,theoptimalvalueof forgdeis0.723[10]. processoris1,000.eachresultistheaverageof100testcases.wetestedan88mesh,a1616 thanthatofdde. showsitsaverageindierentnetworks.here,theloaddierenceofgdeisfourtosixtimeslarger byn,whereasthatofgdeisboundedbyn(k?1)forthemeshandnk=2forthetorus.figure12 First,wecompareloadimbalanceofGDEandDDE.TheloaddierenceofDDEisbounded 13

16 GDE 14 thenumberofsweepssfordierentnetworks.thevalueofsisproportionaltok[10].moreover, 12 sincreaseswiththeaverageweight.tableishowstherelationshipbetweenthenumberofsweeps andtheaveragenumberoftasks,measuredonan88mesh. DDEcompletesloadbalancinginonesweepbutGDEneedsmanysweeps.Figure13shows Figure12:Loaddierence. 8 6 4 2 0 8x8 mesh 8x8x8 mesh 16x16 torus 16x16x16 torus 12 GDE 10 DDE 8 ancing.therefore,thetotalnumberofcommunicationsofssweepsare3sc=6sn.fordde, sweephasciterations,wherecisthenumberofcolors.forevennumberofk,c=2n.each iterationhasthreecommunications,twoforexchangingloadinformationandoneforloadbal- 6 therearekcommunicationstepsineachdimensionforcollectionandbroadcastingofloadinformation.loadbalancingneedsatmostk?1andk=2communicationstepsforthemeshandthe torus,respectively.therefore,2knor32kncommunicationstepsintotalarerequired.ddecan Next,wecomparethenumberofcommunicationstepsofGDEandDDE.ForGDE,each Figure13:Thenumberofsweeps. 4 2 reducethenumberofcommunicationstepssignicantly.theanalysishasbeenconrmedbythe 0 experiment,asshowninfigure14. 8x8 mesh 8x8x8 mesh 16x16 16x16x16 14

TableI:TheRelationshipBetweentheNumberofSweepsandtheAverageWeight AverageNumberofSweeps7.289.2011.0813.0214.67 AverageNumberofTasks1003001,0003,00010,000 municationcostisdenedasthethetotalnumbersoftaskstransferreddividedbythetotal numberoftasks: Figure15showsthenormalizedcommunicationcostofGDEandDDE.Thenormalizedcom- Figure14:Thenumberofcommunicationsteps arenotmigratedtoothernodes.ddekeeps20%to50%moretasksinlocal. isabout50%largerthanthatofdde.itisduetothefactthatgdetransferstasksunnecessarily. Finally,DDEhasbetterlocalitythanGDE.Figure16showsthepercentageoflocaltasksthat whereejisthenumberoftaskstransmittedthroughtheedgej.thecommunicationcostofgde Piwi; Pjej 15 200 180 160 140 120 100 80 60 40 20 0 8x8 mesh 8x8x8 mesh 16x16 torus 16x16x16 torus GDE DDE

1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Figure15:Normalizedcommunicationcost. 8x8 mesh 8x8x8 mesh 16x16 torus 16x16x16 torus GDE DDE 60% 50% 40% 30% 20% 10% 0% Figure16:Thepercentageoflocaltasks. 16 8x8 mesh 8x8x8 mesh 16x16 torus 16x16x16 torus GDE DDE

n-cube,ddeisfaster,balancestheloadwell,reducescommunications,andkeepsbetterlocality. tothek-aryn-cube.comparedtothegdealgorithm,whichalsoextendeddemtothek-ary Thispaperproposedadirectmethodforloadbalancing.ItextendedtheDEMalgorithm 7.Conclusion themeshwalkingalgorithm[8].however,dderetainsitssimplicityofimplementationandcan deliverasatisedperformanceatthesametime. References DDEcanbefurtherimprovedforamorebalancedloadandlesscommunicationsbyextending [1]G.Cybenko.Dynamicloadbalancingfordistributedmemorymultiprocessors.J.ofParallel [2]M.Barnettetal.Broadcastingonmesheswithwormholerouting.TechnicalReportTR-93- [3]S.H.Hosseini,B.Litow,M.Malkawi,J.McPherson,andK.Vairavan.Analysisofagraph Distrib.Comput.,7:279{301,1989. coloringbaseddistributedloadbalancingalgorithm.journalofparallelanddistributed 24,Univ.TexasatAustin,1993. [6]W.ShuandM.Y.Wu.Runtimeparallelschedulingfordistributedmemorycomputers.In [4]E.L.Lawler.CombinatorialOptimization:NetworksandMatroids.Holt,Rinehartand [5]S.Ranka,Y.Won,andS.Sahni.Programmingahypercubemulticomputer.IEEESoftware, Int'lConf.onParallelProcessing,pagesII.143{150,August1995. pages69{77,september1988. Winston,1976. Computing,10:160{166,1990. [7]MarcWillebeek-LeMairandAnthonyP.Reeves.Strategiesfordynamicloadbalancing onhighlyparallelcomputers.ieeetrans.parallelanddistributedsystem,9(4):979{993, [10]C.Z.XuandF.C.M.Lau.Thegeneralizeddimensionexchangemethodforloadbalancing [8]M.Y.WuandW.Shu.High-performanceincrementalschedulingonmassivelyparallelcomputers aglobalapproach.insupercomputing'95,december1995. September1993. [9]C.Z.XuandF.C.M.Lau.Analysisofthegeneralizeddimensionexchangemethodfor December1992. January1995. ink-aryn-cubesandvariants.journalofparallelanddistributedcomputing,24(1):72{85, dynamicloadbalancing.journalofparallelanddistributedcomputing,16(4):385{393, 17