Graphic Algorithms and the Demographic Variations



Similar documents
thek-aryn-cubestructure. 1


Chapter 2. Multiprocessors Interconnection Networks

Mail for OS X Medical School IMAP & Exchange Setup Guide

Topological Properties

A. Factoring out the Greatest Common Factor.

EFFICIENTANIMATIONTECHNIQUESBALANCING BOTHUSERCONTROLANDPHYSICALREALISM

Components: Interconnect Page 1 of 18

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

ACTS 4301 FORMULA SUMMARY Lesson 1: Probability Review. Name f(x) F (x) E[X] Var(X) Name f(x) E[X] Var(X) p x (1 p) m x mp mp(1 p)

Load Balancing between Computing Clusters

Lecture 3: Models of Solutions

Factoring Special Polynomials

Lecture 4: Futures and Options Steven Skiena. skiena


Fast Fourier Transform: Theory and Algorithms

Chapter 12: Multiprocessor Architectures. Lesson 04: Interconnect Networks

Manual for SOA Exam MLC.

Chapter 8. Shear Force and Bending Moment Diagrams for Uniformly Distributed Loads.

Interconnection Network

Return On Investment XpoLog Center

Options, Futures, and Other Derivatives 7 th Edition, Copyright John C. Hull

Figure2:Themixtureoffactoranalysisgenerativemodel. j;j z

Regression and Correlation

Introduction: Overview of Kernel Methods

x 3 x 4 x 2 f Site A Site B Site C

BERNOULLI (BETA) and INTEGER PART SEQUENCES

Manual for SOA Exam MLC.

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

call option put option strike price/exercise price expiration date/maturity

Online Degree Ramsey Theory

RIGHT-OF-WAY ACQUISITION AND BRIDGE CONSTRUCTION BONDS 7/01/ MD ND QR4

SCHOOLOFCOMPUTERSTUDIES RESEARCHREPORTSERIES UniversityofLeeds Report95.4

niveau : 1 ere année spécialité : mécatronique & froid et climatisation AU : Programmation C Travaux pratiques

Parallel Programming

Transient Voltage Suppressor SMBJ5.0 - SMBJ440CA

Certification of Discontinuous Composite Material Forms for Aircraft Structures

Factoring - Factoring Special Products

... Schema Integration

Actuarial mathematics 2

Factoring Guidelines. Greatest Common Factor Two Terms Three Terms Four Terms Shirley Radai

The Algorithms of Speech Recognition, Programming and Simulating in MATLAB

Performance Comparison of Dynamic Load-Balancing Strategies for Distributed Computing

Tutorial on Exploratory Data Analysis

Review of Statistical Mechanics

Lecture 21. The Multivariate Normal Distribution

(1) The size of a gas particle is negligible as compared to the volume of the container in which the gas is placed.

Factoring. Factoring Polynomial Equations. Special Factoring Patterns. Factoring. Special Factoring Patterns. Special Factoring Patterns

Interconnection Networks

Chapter 7. BANDIT PROBLEMS.

Section 6.1 Factoring Expressions

Factors and Products

Earthquake Hazard Zones: The relative risk of damage to Canadian buildings

London South Bank University - United Kingdom. Business Management

Actuarial Science with

Unit 10 Geometry Circles. NAME Period

Repairing storm damaged roofs


Sleeve Yokes / Kayıcı Çatallar

Issues in Information Systems Volume 14, Issue 2, pp , 2013

timeout StoR!msg0 RtoS?ack0

Lesson 9: Radicals and Conjugates

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)


Flexible Distributed Capacity Allocation and Load Redirect Algorithms for Cloud Systems

THOSE WHO WAIT As recorded by Tommy Emmanuel

Topology Optimization of Engine Mount Brackets Dr. Dirk Sprengel Ford Werke GmbH

Tool 1. Greatest Common Factor (GCF)

Home Software Hardware Benchmarks Services Store Support Forums About Us

Design MEMO 60 Reinforcement design for TSS 102

Asal Siviçler Basic Switches

Themethodofmovingcurvesandmovingsurfacesisanew,eectivetoolfor Abstract

OR topics in MRP-II. Mads Jepsens. OR topics in MRP-II p.1/25

Development of backup power systems at Wuhan Troowin Power System

CE 366 SETTLEMENT (Problems & Solutions)

Mark Scheme (Results) November 2009

"#$%&'((&)!*+,-./ 0+1$23!4-+5#.-)!!

Design MEMO 54a Reinforcement design for RVK 41

Beat the Mean Bandit

How To Prove The Dirichlet Unit Theorem

1. Prove that the empty set is a subset of every set.

Factoring (pp. 1 of 4)

Transcription:

DDE:AModiedDimensionExchangeMethod forloadbalancingink-aryn-cubes StateUniversityofNewYorkatBualo DepartmentofComputerScience Min-YouWuandWeiShu algorithmforthehypercubestructure.ithasbeengeneralizedtok-aryn-cubes.however,the k-aryn-cubealgorithmmusttakemanyiterationstoconvergetoabalancedstate.inthispaper, Abstract Thedimensionexchangemethod(DEM)wasinitiallyproposedasaload-balancing wu,shu@cs.bualo.edu Bualo,NY14260 weproposeadirectmethodtomodifydem.thenewalgorithm,directdimensionexchange (DDE)method,takesloadaverageineverydimensiontoeliminateunnecessaryloadexchange.It moreaccuratelyandmuchfaster.1.introduction balancestheloaddirectlywithoutiterativelyexchangingtheload.itisabletobalancetheload sweep(logniterations),theloadisbalanced. nodepairsexchangetheirloadinformationandattempttoaveragethenumberoftasks.aftera DEMissuperiortootherschedulingmethods[7].DEMforthehypercubenetworkisasimple rithmforthehypercubestructure[5,1].itbalancestheloadforindependenttasksondistributed memorymachines.theexperimentcarriedbywillebeek-lemairandreevesconformedthat algorithm.loadbalancingisperformediterativelyineachofthelogndimensions,inwhichonly Thedimensionexchangemethod(DEM)wasinitiallyproposedasafullyload-balancingalgo- linearlyproportionaltothenumberofnodesinachain,andhencetothedimensionorderkof isnotabletoreachthebalancedstateinonesweep.thenumberofsweepsforconvergenceis network[10].becauseanodeexchangesworkloadwithonlyoneofitsneighboratatime,gde ittakesmanysweepstoconvergetothebalancedload.hosseinietal.extendeditforarbitrary structuresusingthetechniqueofedge-coloringofgraphs[3].xuandlauproposedthegeneralized dimensionexchange(gde)method[9].thegdemethodwasextendedtothek-aryn-cube Unfortunately,whenDEMappliestootherstructures,suchasthemeshorthek-aryn-cube, thek-aryn-cubestructure. 1

n-cube. beappliedtotwoormoredimensionstobalancetheloadforthemesh,thetorus,andthek-ary method.unlikeiterativealgorithms,thisdirectmethodcanbalancetheloadinonesweep.the orunderloadedandsubsequentlyexchangeworkloadwithothernodes.theddemethodcan beeasilyobtainedbyasumreduction.eachnodeinthechainknowswhetheritisoverloaded loadinachainisfullybalancedbyutilizinginformationofthetotalnumberoftasks,whichcan Wepresentadirectmethodforthek-aryn-cube,calledtheDirectDimensionExchange(DDE) methodiscomparedtothegdemethod.section7concludesthepaper. respectively.thealgorithmforthek-aryn-cubeispresentedinsection5.insection6,thedirect Then,thedirectmethodforthechainandtheringstructuresisdescribedinsections3and4, Thispaperisorganizedasfollows.Section2brieyreviewstheDEMandtheGDEalgorithms. Toachievethisgoal,anestimationofthetaskexecutiontimeisneeded,whichcanbedoneeither byaprogrammerorbyacompiler.sometimestheestimationcanbeapplication-specic,and sometimesitisimpossibletoobtainsuchanestimation.duetothesediculties,eachtaskis Thegoalofloadbalancingistoscheduleworkssothateachprocessorhasthesameworkload. 2.TheDEMandGDEAlgorithms presumedtorequiretheequalexecutiontimeandthegoalofthealgorithmistoscheduletasks algorithmistoredistributetaskssothatthenumberoftasksineachnodeisequal.assumethe computingnodesareconnectedbyagiventopology.eachnodeihaswitasks.ascheduling sothateachprocessorhasthesamenumberoftasks. sumofwiofallnodescanbeevenlydividedbyn.theaveragenumberoftaskswavgiscalculated Theschedulingproblemcanbedescribedasfollows.Inaparallelordistributedsystem,N wavg=pn 1 i=0wi addressesdierinonlytheleastsignicantbitbalancetheloadbetweenthemselves.next,all \integerversion"ofdemisdescribedinfigure1.allnodepairsintherstdimensionwhose andthencombinedtoformlargerdomainsuntilultimatelytheentiresystemisbalanced.the Eachnodeshouldhavewavgtasksafterscheduling. DEMwasdesignedforthehypercubestructure.InDEM,smalldomainsarebalancedrst N: nodehasbalanceditsloadwitheachofitsneighbors. nodepairsintheseconddimensionbalancetheloadbetweenthemselves,andsoforth,untileach AfterexecutionoftheDEMalgorithm,theloaddierence D=max(wi) min(wi) 2

DEM forl=0ton 1 wi=(d(wi+wj)=2eifwi>wj if(wj wi)>1,receiveb(wj wi)=2ctasksfromnodej if(wi wj)>1,sendb(wi wj)=2ctaskstonodej nodeiexchangeswithnodejthecurrentvaluesofwiandwj,wherej=i2l GDE b(wi+wj)=2cotherwise while(notterminate) forl=1toc foredgecoloredlconnectingnodesiandj Figure1:TheDEMalgorithmforthehypercube. wi=(d(1 )wi+wjeifwi>wj if(wj wi)>1,receiveb(wj wi)ctasksfromnodej if(wi wj)>1,sendb(wi wj)ctaskstonodej nodeiexchangeswithnodejthecurrentvaluesofwiandwj DEMalgorithmis3n[7]. isboundedbyn,thedimensionofthehypercube[3].thenumberofcommunicationstepsofthe Figure2:TheGDEalgorithmforthek-aryn-cube. b(1 )wi+wjcotherwise sweepaftercconsecutiveexchangeoperations,wherecisthenumberofcolors.ink-aryn-cubes, tasksbetweenneighboringnodesislessthanorequaltoone.theconvergenceratedependson c=2nifkisanevennumber.theterminationconditionisthatthedierenceofthenumberof Forthehypercube,theoptimal=12,andGDEisequivalenttotheoriginalDEMalgorithm.For theexchangeparameter.thevaluevariesfordierenttopologiesanddierentnetworksizes. graph.the\integerversion"ofthealgorithmisshowninfigure2.anodenishesacomplete TheGDEalgorithmoperatesoncolorgraphsderivedfromedge-coloringofthegivensystem whenthedimensionorderkincreases.thereisnocommunicationconictinthisalgorithm. topologyby\folding"themeshineachdimensiondlogmetimes[7].thismethodcouldbeapplied othertopologies,istobeoptimizedtomaximizetheconvergencerate.forthek-aryn-cube,the loaddierencebetweenanypairofnodesisboundedbynk=2.theconvergenceratedecreases Willebeek-LeMairandReevessuggestedanotherapproachtoextendDEMtoanMMmesh 3

pairswouldnolongerbedirectlylinkedtooneanotherandcommunicationswouldconict. tok-aryn-cubestoo.theloaddierenceisboundedbyndlogke.however,inthisapproach,node method.theworkloadinachaincanbebalanceddirectly.thebasicideaistocalculatethe totalnumberoftasksinthechainandtheaveragenumberoftaskspernode.thus,nodesinthe chaincanexchangetaskstobalancetheload. InsteadofusingtheGDEmethodwhichbalancestheloaditeratively,weproposeadirect 3.TheDDEMethodfortheChain DDE-chain Letwibethenumberoftasksinnodei,wherei=0;1;:::;k 1. 1.GlobalInformationCollection:Performthescanwithsumoperationofwi: 3.QuotaCalculation:Thequotaofeachnodeqiiscomputed: 2.AverageLoadCalculation:T=W0,wavg=bT=kc,andR=Tmodk,whereTisthe totalnumberoftasks. Wi=k 1 Xl=iwl qi=(wavg+1ifi<r 4.FlowCalculation:xi 1;i=Qi Wi,fori=1;2;:::;k 1,wherexi;jistheowon Also,anaccumulationquotaforeachnodeiscomputed: wavgotherwise edge(i;j). Qi=k 1 Figure3:TheDDEalgorithmforthechain. Xl=iql usingthescanwithsumoperationfromnodek 1tonode0,wherekisthelengthofthechain. thenodeweightwi(i=0;1;:::;k 1)andoutputsthecalculatedowxi 1;i(i=1;2;:::;k 1) foreveryedgeinthechain.therststepistoobtainthetotalnumberoftasksinthechainby oftaskspernodeatnode0.ifthenumberoftaskscannotbeevenlydividedbyk,theremaining EachnoderecordsapartialsumWi=Pk 1 TheDDEalgorithmforthechainshowninFigure3isits\integerversion."Ittakesasinput l=iwl.thesecondstepcalculatestheaveragenumber 4

RtasksaredistributedtotherstRnodessothattheyhaveonemoretaskthantheothers. ThevaluesofwavgandRarebroadcasttoeverynode.Inthethirdstep,eachnodecalculatesits asitsquota. theowisavailable,theworkloadisexchangedsothateachnodehasthesamenumberoftasks iscalculatedbytakingdierencebetweenqiandwi.nodeicalculatesxi 1;iandxi;i+1.When EachnodekeepsrecordsofQi,Wi,Qj,andWj,wherej=i+1.Inthefourthstep,theow quota.theaccumulationquotaqicanbecalculateddirectlyasfollows: Qi=wavg(k i)+min(0;r i): Example1: shownbelow: Then,eachnodecalculatesthevalueofQiinstep3.Thevaluesofwi,Wi,Qi,andxi 1;iareas andr: readytobescheduled.valuesofwiarecalculatedinstep1.node0calculatesthevalueofwavg AnexampleisshowninFigure4.Atthebeginningofscheduling,eachnodehaswitasks 093737 172832 iwiwiqixi 1;i wavg=4;r=5: 242127 311722 441617 561212 6168 { 46510 i=0 94i=1 76i=2 47554 1 5i=3 11i=4 42i=5 62i=6 11 Aftertaskexchange,nodes0{4havevetaskseach,andnodes5{7havefourtaskseach. Figure4:ExampleforDDE-chain. i=7 5 toitsquota. Lemma1:AfterexecutionofDDEandtaskexchange,thenumberoftasksineachnodeisequal 5

Becausexi 1;i=Qi Wi;xi;i+1=Qi+1 Wi+1;Wi+1=Wi wi;andqi+1=qi qi Proof:AfterexecutionofDDEandtaskexchange,thenumberoftasksinnodeiis w0i=wi+(qi Wi) (Qi+1 Wi+1)=Qi Qi+1=qi w0i=wi+xi 1;i xi;i+1 stepsinstep4isatmostk.therefore,thetotalnumberofcommunicationstepsofthisalgorithm andapplyingthetwaalgorithmin[6].thus,thetotalnumberofcommunicationstepsofthis isnomorethan3k.thisalgorithmcanbefurtherimprovedbyselectingnodek/2astheroot algorithmcanbereducedto2k.whentisevenlydividedbyk,thisalgorithmminimizesthe Inthisalgorithm,steps1and2spend2kcommunicationsteps.Thenumberofcommunication 2 exchangealgorithms.therstone,calledreceive-before-send,isshowninfigure5. Receive-before-send totalnumberoftasktransfersandthetotalnumberofcommunications.thisalgorithmalso maximizeslocality.thatis,itminimizesthenumberoftasksthataremigratedtoothernodes. Fornodei 1.ifi>0andxi 1;i>0,waittoreceivexi 1;itasksfromnodei 1 TheworkloadisexchangedaccordingtotheowgeneratedbyDDE.Therearetwotask- 4.ifi<k 1andxi;i+1>0,sendxi;i+1taskstonodei+1 3.ifi>0andxi 1;i<0,sendjxi 1;ijtaskstonodei 1 2.ifi<k 1andxi;i+1<0,waittoreceivejxi;i+1jtasksfromnodei+1 nicationstepstonish: Usingthereceive-before-sendalgorithm,theloadexchangeinExample1takesfourcommu- (2) (1) node0tonode1,node5tonode6,node7tonode6 node1tonode2 Figure5:Taskexchange:receive-before-send. (3) (4) node2tonode3 node3tonode4 6

Send-before-receive Fornodei letai=xi 1;i;bi=xi;i+1 while(ai6=0orbi6=0) 1.ifi>0and(wi> ai>0)sendjaijtaskstonodei 1,andletwi=wi+ai,ai=0 2.ifi<k 1and(wi>bi>0)sendbitaskstonodei+1,andletwi=wi bi,bi=0 beforesendingoutmessages.byrelaxingthisconstraint,asend-before-receivealgorithmisshown 3.ifi>0andai>0andreceivedaitasksfromnodei 1,andletwi=wi+ai,ai=0 4.ifi<k 1andbi<0andreceivedjbijtasksfromnodei+1,andletwi=wi bi,bi=0 incomingmessage.thecommunicationtimeandprocessoridletimecanbereduced.ittakes infigure6.inthisalgorithm,anodecanstartsendingmessagesoutbeforeithasreceivedan onlytwocommunicationstepsforexample1: Inthereceive-before-sendalgorithm,eachnodemustreceiveanincomingmessage,ifany, Figure6:Taskexchange:send-before-receive. 2) 1) node5tonode6,node7tonode6 node0tonode1,node1tonode2,node3tonode4, before-sendorsend-before-receivealgorithmsisatrade-obetweencommunicationtimeand taskstoothernodes.butinthesend-before-receivealgorithm,anodemaysendlocaltasksto othernodesandthenreceivetasksfromothers.therefore,thedecisiononuseofthereceivebefore-sendalgorithm,anodecankeepthemaximumnumberoflocaltasksandsendnon-local Thesend-before-receivealgorithmmayhavesomenegativeimpactinlocality.Inthereceive- node2tonode3 locality. communicationsteps. advantageofthepipelineeectofwormholeroutingwhileavoidingchannelcontention.this oncommunicationtimecanoftenbeignored.therecursivedoublingalgorithm[2]cantake algorithmorganizesthenodesinachaintoatree.anexampleofeightnodesisshownin Figure7.ApplyingtheTWAalgorithmin[6]tothetree,theloadcanbebalancedwithin4logk Mostmassivelyparallelcomputersusewormholeroutingwithwhichtheeectofpathlength 7

i=0 i=6 i=4 i=5 i=2 i=3 i=1 4.TheDDEAlgorithmfortheRing Figure7:Thetreeforrecursivedoubling. i=7 tioncouldbereduced.wedescribeanalgorithmtominimizethetotalnumberoftaskstransferred. Thealgorithmisderivedfromtheminimumcostowalgorithm[4]andshowninFigure8.Inthis however,thecommunicationmaynotbeminimal.byutilizingtheend-roundedge,communica- algorithm,aninitialsolutionisobtainedbyusingdde-chainwithoutconsideringtheend-round gorithmcanbeappliedtotheringbyignoringtheend-roundedge.theloadcanbebalanced, Aringcanbeobtainedbyaddinganend-roundconnectiontoachain.TheDDE-chainal- Letnpbethenumberofedgeswithxi;j>0,nnthenumberofedgeswithxi;j<0,andnzthe numberofedgeswithxi;j=0. DDE-ring ApplyDDE-chaintotheringwithoutconsideringtheend-roundedge(k 1;0)toobtain x0;1;x1;2;:::;xk 2;k 1,wherexi;jistheowonedge(i;j).Letxk 1;0be0. Iftheowisclockwise,xi;jispositive;otherwise,itisnegative. 2.Foreachedge,xi;j=xi;j xm. 1.Ifnn+nz np<0,letxmbethemthlargestxi;jfromallxi;j>0;andifnp+nz nn<0, letxmbethemthsmallestxi;jfromallxi;j<0,wherem=dk=2e. Figure8:TheDDEalgorithmforthering. 8

ofdde-ring.here,weletx 1;0=xk 1;0. algorithmiso(klogk). edge.then,anaugmentationisappliedtoobtainanoptimalsolution.thecomplexityofthis Wecanuseeitherthereceive-before-sendorsend-before-receivealgorithmfortaskexchange negativecost.therefore,thenetworkowisofminimumcost[4]. numberoftaskstransferred. Lemma2:AfterexecutionofDDE-ring,thetotalnetworkowisofminimumcost. Proof:Ifnp+nz nn0andnn+nz np0,thereisnoowaugmentingcyclewith Thefollowinglemmashowsthatthisalgorithmminimizesthetotalcostofow,thatis,the Notethatn0n+n0z+n0p=k.Becauseofm=dk=2e, Then, Ifnn+nz np<0,aftermodicationofxi;j=xi;j xm,wehave n0z+n0p n0nm n0n=m (k n0z n0p)2m k n0z+n0p n0n2m k0 n0z+n0pm Then,n0n+n0z n0pk m+1 n0p=k m+1 (k n0n n0z)=1 m+n0n+n0z Wealsohave 1 m+k m+1=k 2m+2 n0n+n0zk m+1 Thus,thenetworkowisofminimumcost. costinallcases. Becauseofm=dk=2e, Thecaseofnp+nz nn<0canbeprovedsimilarly.thus,thenetworkowisofminimum n0z+n0p n0nk 2m+20 toconstructaring.applyingthedde-chainalgorithmtotheringwithoutconsideringthe AnexampleisshowninFigure9.Anend-roundedgeisaddedtothechaininFigure4 9 2

resultisshowninfigure9(b).thenumberoftaskstransferredisreducedto17. end-roundedge,theowisshowninfigure9(a).thenumberoftaskstransferredis19.the Becausenp+nz nn<0andthe4thsmallestxi;jis 2,everyxi;jissubtractedby 2.The augmentationisappliedtothisow:np=1;nz=2;nn=5 i=0 4i=1 6i=2 5i=3 (a) i=4 i=5 2i=6 1i=7 i=0 92i=1 74i=2 43i=3 112 Figure9:ExampleforDDE-ring. (b) i=4 42i=5 6 i=6 13i=7 5 nodes.thenodesineachringexchangetheirloadandtheneachnodeihaswl+1 n-cube.thealgorithmisshowninfigure10.initerationloftheddealgorithm,subcubeslm isdividedintokpartitionssl+1 WiththeDDE-ringalgorithm,itisnotdiculttocompositeaDDEalgorithmforthek-ary 5.TheDDEMethodforthek-aryn-cube km+bwherem=0;1;:::;kl 1andb=0;1;:::;k 1.Slmhaskn l getamesh.thisalgorithmcanbeappliedtothemeshbyperformingthedde-chainalgorithm nodesindierentdimensions.takingatorusandstripthemofalltheend-roundconnections,we orsend-before-receivealgorithm. DDE,nodeiwillhavewnitasks.Thetaskexchangestepcanuseeitherthereceive-before-send Thisalgorithmcanbeappliedtothen-dimensionaltorus,whichallowsdierentnumberof itasks.executing ofthek-aryn-cube. insteadofdde-ringineachstep. ThefollowingtheoremshowsthattheloaddierenceofDDEisboundedbyn,thedimension 10

DDEfork-aryn-cube Assumeak-aryn-cubeS0,thenumberofnodesiskn,andnodeihasw0itasks. forl=0ton 1 applythedde-ringalgorithmtokn 1ringsinthelthdimensionindependently, whereeveryringhasknodes(a0;a1;:::;al;:::;ak 1)andal=0;1;:::;k 1 exchangetasksaccordingtotheow eachnodeupdatesitsweightwl+1 i=wli+xi 1;i xi;i+1 Figure10:TheDDEalgorithmforthek-aryn-cube. Theorem1:AfterexecutionofDDE,theloaddierence D=max(wni) min(wni) isboundedbyn. Proof:InthelthstepofDDE,ak-ary(n l)-cubeispartitionedintokk-ary(n l 1)-cubes. Thedierenceofthenumberoftasksbetweentwopartitionsismaximalwhenineachringevery nodeinrstpartition,saysl+1 km,hasonemoretaskthanthatpossessedbythenodeintheother partitions,sl+1 Xkm+b,whereb=1;2;:::;k 1.Thus j2sl+1 kmwl+1 j=x j2sl+1 km+k 1wl+1 j+jsl+1 kmj=1 k 1(X j2slmwlj X j2sl+1 kmwl+1 j)+kn l 1 wherejsl+1 kmjdenotesthenumberofnodesinsubcubesl+1 kmwhichiskn l 1.Therefore, X j2sl+1 kmwl+1 j=1kx j2slmwlj+(k 1)kn l 2 Similarly,X j2sl+1 km+k 1wl+1 j=x j2sl+1 kmwl+1 j jsl+1 kmj=(x j2slmwlj (k 1)X j2sl+1 km+k 1wl+1 j) kn l 1 Therefore, X j2sl+1 km+k 1wl+1 j=1kx j2slmwlj kn l 2 Let Almax=max 0m<klX j2slmwlj 11

and Whenl=0, A0max=A0min=X Almin=min j2s0w0j=x 0m<klX j2slmwlj: Similarly, wheretisthetotalnumberoftasks.thus, Almax=(T1kAl 1 max+(k 1)kn l 1otherwise 0j<knw0j=T Thesolutiontotheaboverecurrenceisgivenby Almin=( 1kAl 1 min kn l 1otherwise T ifl=0 ifl=0 (1) Almax=Tkl+(k 1)lkn l 1 Almin=Tkl lkn l 1 (3) (4) (2) Itclearlysatises(1)and(2)forthebasis,l=0.If(3)satises(1)forl=m,then Therefore,itsatises(1)forl=m+1.Thus,byinductiononlwehaveshownthat(3)satises(1) wheneverl0.similarly,itcanbeshownthat(4)satises(2)wheneverl0. Am+1 max=t km+1+(m+1)(k 1)kn (m+1) 1=1k(T =1kA(m+1) 1 max+(k 1)kn (m+1) 1 km+(k 1)mkn m 1)+(k 1)kn (m+1) 1 BecauseD=Anmax Anmin=(k 1 atmostbyn. Letl=n Anmax=max Anmin=min k+1k)n=n,thenumberoftasksinanytwoprocessorsdiers 0j<knwnj=Tkn+k 1 0j<knwnj=Tkn 1kn; kn isshowninfigure11(c).themaximumloaddierenceis2. thatthedde-ringalgorithmappliestoeachringintherstdimension.then,dde-ringapplies toeachringintheseconddimension,asshowninfigure11(b).theresultantloaddistribution AnexampleisshowninFigure11.Thisisa4-ary2-cube(i.e.,torus44).Figure11(a)shows 2 12

173015 612616 21355 68516 24 2412 271 3 (a) 110101010 9 712967129671 867 (b) 2 9 8 8 8 87 Inthissection,wecompareperformanceofGDEandDDE.Weconsideratestsetofload Figure11:ExampleforDDE(4-ary2-cube). 6.ExperimentalResults (c) distributions,inwhichtheloadateachprocessorisrandomlyselectedwiththemeanequaltoa speciedvalue.inthissimulationexperiment,theaveragenumberoftasks(averageweight)per processoris1,000.eachresultistheaverageof100testcases.wetestedan88mesh,a1616 thanthatofdde. showsitsaverageindierentnetworks.here,theloaddierenceofgdeisfourtosixtimeslarger forgdeis0.723[10]. torus,an8883d-mesh,anda1616163d-torus.forthesenetworks,theoptimalvalueof byn,whereasthatofgdeisboundedbyn(k 1)forthemeshandnk=2forthetorus.Figure12 First,wecompareloadimbalanceofGDEandDDE.TheloaddierenceofDDEisbounded 13

16 GDE 14 thenumberofsweepssfordierentnetworks.thevalueofsisproportionaltok[10].moreover, 12 sincreaseswiththeaverageweight.tableishowstherelationshipbetweenthenumberofsweeps andtheaveragenumberoftasks,measuredonan88mesh. DDEcompletesloadbalancinginonesweepbutGDEneedsmanysweeps.Figure13shows Figure12:Loaddierence. 8 6 4 2 0 8x8 mesh 8x8x8 mesh 16x16 torus 16x16x16 torus 12 GDE 10 sweephasciterations,wherecisthenumberofcolors.forevennumberofk,c=2n.each Next,wecomparethenumberofcommunicationstepsofGDEandDDE.ForGDE,each Figure13:Thenumberofsweeps. DDE 8 6 4 2 reducethenumberofcommunicationstepssignicantly.theanalysishasbeenconrmedbythe mation.loadbalancingneedsatmostk 1andk=2communicationstepsforthemeshandthe torus,respectively.therefore,2knor32kncommunicationstepsintotalarerequired.ddecan iterationhasthreecommunications,twoforexchangingloadinformationandoneforloadbalancing.therefore,thetotalnumberofcommunicationsofssweepsare3sc=6sn.fordde, therearekcommunicationstepsineachdimensionforcollectionandbroadcastingofloadinfor- 0 experiment,asshowninfigure14. 8x8 mesh 8x8x8 mesh 16x16 16x16x16 14

TableI:TheRelationshipBetweentheNumberofSweepsandtheAverageWeight AverageNumberofSweeps7.289.2011.0813.0214.67 AverageNumberofTasks1003001,0003,00010,000 municationcostisdenedasthethetotalnumbersoftaskstransferreddividedbythetotal numberoftasks: Figure15showsthenormalizedcommunicationcostofGDEandDDE.Thenormalizedcom- Figure14:Thenumberofcommunicationsteps isabout50%largerthanthatofdde.itisduetothefactthatgdetransferstasksunnecessarily. Finally,DDEhasbetterlocalitythanGDE.Figure16showsthepercentageoflocaltasksthat whereejisthenumberoftaskstransmittedthroughtheedgej.thecommunicationcostofgde arenotmigratedtoothernodes.ddekeeps20%to50%moretasksinlocal. Piwi; Pjej 15 200 180 160 140 120 100 80 60 40 20 0 8x8 mesh 8x8x8 mesh 16x16 torus 16x16x16 torus GDE DDE

1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Figure15:Normalizedcommunicationcost. 8x8 mesh 8x8x8 mesh 16x16 torus 16x16x16 torus GDE DDE 60% 50% 40% 30% 20% 10% 0% Figure16:Thepercentageoflocaltasks. 16 8x8 mesh 8x8x8 mesh 16x16 torus 16x16x16 torus GDE DDE

n-cube,ddeisfaster,balancestheloadwell,reducescommunications,andkeepsbetterlocality. tothek-aryn-cube.comparedtothegdealgorithm,whichalsoextendeddemtothek-ary Thispaperproposedadirectmethodforloadbalancing.ItextendedtheDEMalgorithm 7.Conclusion themeshwalkingalgorithm[8].however,dderetainsitssimplicityofimplementationandcan deliverasatisedperformanceatthesametime. References DDEcanbefurtherimprovedforamorebalancedloadandlesscommunicationsbyextending [1]G.Cybenko.Dynamicloadbalancingfordistributedmemorymultiprocessors.J.ofParallel [4]E.L.Lawler.CombinatorialOptimization:NetworksandMatroids.Holt,Rinehartand [3]S.H.Hosseini,B.Litow,M.Malkawi,J.McPherson,andK.Vairavan.Analysisofagraph [2]M.Barnettetal.Broadcastingonmesheswithwormholerouting.TechnicalReportTR-93- Computing,10:160{166,1990. coloringbaseddistributedloadbalancingalgorithm.journalofparallelanddistributed 24,Univ.TexasatAustin,1993. Distrib.Comput.,7:279{301,1989. [5]S.Ranka,Y.Won,andS.Sahni.Programmingahypercubemulticomputer.IEEESoftware, [6]W.ShuandM.Y.Wu.Runtimeparallelschedulingfordistributedmemorycomputers.In Winston,1976. pages69{77,september1988. [9]C.Z.XuandF.C.M.Lau.Analysisofthegeneralizeddimensionexchangemethodfor [8]M.Y.WuandW.Shu.High-performanceincrementalschedulingonmassivelyparallelcomputers aglobalapproach.insupercomputing'95,december1995. September1993. onhighlyparallelcomputers.ieeetrans.parallelanddistributedsystem,9(4):979{993, Int'lConf.onParallelProcessing,pagesII.143{150,August1995. [7]MarcWillebeek-LeMairandAnthonyP.Reeves.Strategiesfordynamicloadbalancing [10]C.Z.XuandF.C.M.Lau.Thegeneralizeddimensionexchangemethodforloadbalancing December1992. ink-aryn-cubesandvariants.journalofparallelanddistributedcomputing,24(1):72{85, January1995. dynamicloadbalancing.journalofparallelanddistributedcomputing,16(4):385{393, 17