Similar documents

( ) = ( ) = {,,, } β ( ), < 1 ( ) + ( ) = ( ) + ( )

Topological Properties

Performance Comparison of Dynamic Load-Balancing Strategies for Distributed Computing

ToappearinJ.ofParallelandDistributedProcessing. TheGeneralizedDimensionExchangeMethodforLoad Balancingink-aryn-cubesandVariants

CONTROLLER INFORMATION SHEET

Panasonic FP. HMI Setting: Device Address:

thek-aryn-cubestructure. 1

Currency Options (2): Hedging and Valuation

Themethodofmovingcurvesandmovingsurfacesisanew,eectivetoolfor Abstract

estadium Project Lab 8: Wireless Mesh Network Setup with DD WRT

Implementing and Managing Windows Server 2008 Clustering

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

RAID. Storage-centric computing, cloud computing. Benefits:

(Master Slave Mode) This chapter explains how to connect multiple HMIs.

The integrating factor method (Sect. 2.1).

Using AD fields in Policy Patrol

Tutorial: Structural Models of the Firm

Continual Reassessment Method

Interconnection Networks

L 2 : x = s + 1, y = s, z = 4s Suppose that C has coordinates (x, y, z). Then from the vector equality AC = BD, one has

New Product Hotline

ACTS 4302 SOLUTION TO MIDTERM EXAM Derivatives Markets, Chapters 9, 10, 11, 12, 18. October 21, 2010 (Thurs)

Lecture 2 Parallel Programming Platforms

Jorge Cruz Lopez - Bus 316: Derivative Securities. Week 9. Binomial Trees : Hull, Ch. 12.

Interconnection Network Design

Interconnection Network

Copyright bizagi

Version of Barcode Toolbox adds support for Adobe Illustrator CS

Features and Benefits

Example: Determine the power supplied by each of the sources, independent and dependent, in this circuit:

ES250: Electrical Science. HW7: Energy Storage Elements

Homework 2 Solutions

Nominal rates of interest and discount

Experiences of numerical simulations on a PC cluster Antti Vanne December 11, 2002

Stirling s formula, n-spheres and the Gamma Function

Electricity & Gas Prices in Ireland. 1 st Semester 2013

Load Balancing. Load Balancing 1 / 24

This chapter includes installation instructions and limitations for Antivirus products on client computers and loggers.

Sample Solutions for Assignment 2.

Aperiodic Task Scheduling

How to Fix Mail-Merge Number Formatting in Word 2010

Lesson Outline Outline

2 Basic Concepts. Contents

Components: Interconnect Page 1 of 18

Math 22B, Homework #8 1. y 5y + 6y = 2e t

The Goldberg Rao Algorithm for the Maximum Flow Problem

1099 and W2 Tax Form Tips and Instructions for 2013 (Effective January 1, 2014)

1.(6pts) Find symmetric equations of the line L passing through the point (2, 5, 1) and perpendicular to the plane x + 3y z = 9.

How To - Implement Clientless Single Sign On Authentication in Single Active Directory Domain Controller Environment

. MEDIUM SPEED OPERATION - 8MHz . MULTI-PACKAGE PARALLEL CLOCKING FOR HCC4029B HCF4029B PRESETTABLE UP/DOWN COUNTER BINARY OR BCD DECADE

Intrusion Log Sharing University of Wisconsin-Madison

Exam MFE/3F Sample Questions and Solutions #1 to #76

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Technical Bulletin. Teledyne PDS Clock Synchronization Considerations. Version 1.2

1 The Black-Scholes model: extensions and hedging

Next Generation Siebel Monitoring: A Real World Customer Experience. An Oracle White Paper June 2010

How To - Implement Single Sign On Authentication with Active Directory

Attachment "A" - List of HP Inkjet Printers

EC3070 FINANCIAL DERIVATIVES

ISDN SIGNALLING MODULE SINGLE E1/T1

Answers to Sample Questions on Network Layer

Solutions to old Exam 1 problems

SOC architecture and design

How to Design a Form Report (RTF) Output

Monte Carlo Experiment With Path Dependent Trader Survival Rates

i n g S e c u r it y 3 1B# ; u r w e b a p p li c a tio n s f r o m ha c ke r s w ith t his å ] í d : L : g u id e Scanned by CamScanner

Big Data: Opportunities and Challenges for Complex Networks

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

InHand Device Cloud Service DN 4.0 Quick Start Guide

524 Computer Networks

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1


Here, we will discuss step-by-step procedure for enabling LDAP Authentication.

Agenda. Federation using ADFS and Extensibility options. Office 365 Identity overview. Federation and Synchronization

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

General Theory of Differential Equations Sections 2.8, , 4.1

Fore! Reservations. Integrated Debit Processing

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Manual for SOA Exam MLC.

Communication Networks. MAP-TELE 2011/12 José Ruela

8741A UNIVERSAL PERIPHERAL INTERFACE 8-BIT MICROCOMPUTER

DIGITAL COUNTERS. Q B Q A = 00 initially. Q B Q A = 01 after the first clock pulse.

LDAP Operation Guide

- Nishad Nerurkar. - Aniket Mhatre

Pull versus Push Mechanism in Large Distributed Networks: Closed Form Results

Asynchronous Bypass Channels

IE1204 Digital Design F12: Asynchronous Sequential Circuits (Part 1)

Manual for SOA Exam MLC.

REPORT OF RESULTS APPENDIX VXTRJNXTE. WELL 34/7-4 i. TO SCREENING f. Saga Fetr. INDEX: fjr *- u,a. L. Leith> S< % 17A^ nrj/n k >/ftv

Transcription:

Nearestneighboralgorithmsforloadbalancingin ChengzhongXu parallelcomputers DepartmentofElectricalandComputerEngg. WayneStateUniversity,Detroit,MI48202 BurkhardMonien,ReinhardLuling czxu@ece.eng.wayne.edu UniversityofPaderborn,Germany DepartmentofComputerScience FrancisC.M.Laufbm,rlg@uni-paderborn.de DepartmentofComputerScience,TheUniversityofHongKong,HongKong fcmlau@csd.hku.hk sionsbasedonlocalizedworkloadinformationandmanagesworkloadmigrationswithinits Withnearestneighborloadbalancingalgorithms,aprocessormakesbalancingdeci- Abstract andtheirseveralvariants theaveragedimension-exchange(ade),theoptimally-tuned neighborhood.thispapercomparesacoupleoffairlywell-knownnearestneighboralgorithms,thedimension-exchange(de,forshort)andthediusion(df,forshort)methods dimension-exchange(ode),thelocalaveragediusion(adf)andtheoptimally-tuneddiffusion(odf).themeasuresofinterestaretheireciencyindrivinganyinitialworkload distributiontoauniformdistributionandtheirabilityincontrollingthegrowthofthevarianceamongtheprocessors'workloads.thecomparisonismadewithrespecttobothoneportandall-portcommunicationarchitecturesandinconsiderationofvariousimplementationstrategiesincludingsynchronous/asynchronousinvocationpoliciesandstatic/dynamigorithmleadsitselftobestsuitedforstaticallysynchronousimplementationsofaload thediusionmethodintheone-portcommunicationmodel.inparticular,theodeal- randomworkloadbehaviors.itturnsoutthatthedimension-exchangemethodoutperforms theodfalgorithmperformsbestinthatcase.theunderlyingcommunicationnetworks consideredassumethemostpopulartopologies,themeshandthetorusandtheirspecial diusionmethodisinasynchronousimplementationsintheall-portcommunicationmodel; balancingprocessregardlessofitsunderlyingcommunicationmodels.thestrengthofthe cases:thehypercubeandthek-aryn-cube.

Massivelyparallelcomputershavebeenshowntobeveryecientinsolvingproblemsthatcan bepartitionedintotaskswithstaticcomputationandcommunicationpatterns.however,there Introduction communicationpatterns.tosolvethiskindofproblemsecientlyinparallelcomputers,itis existalargeclassofproblemsthathaveunpredictablecomputationalrequirementsorirregular necessarytoperformloadbalancingoperationsatrun-time. totakeplace.everyloadbalancingstrategyhastoresolvetheissuesofwhentoinvokea viewofthesystemandsomenegotiationmechanismforworkloadmigrationsacrossprocessors balancingoperation,whomakesloadbalancingdecisionsaccordingtowhatinformation,and Theexecutionofaloadbalancingprocedurerequiressomemeansofmaintainingaglobal howtomanageworkloadmigrationsbetweenprocessors.combiningdierentanswerstothe aboveyieldsalargespaceofpossibledesignsofloadbalancingalgorithmswithwidelyvarying characteristics.nearestneighboralgorithmsaresuchaclassofmethodsinwhichprocessors workloadtonearestneighbors,thesealgorithmscanbeeasilyscaledtooperateinmassively makedecisionsbasedonlocalinformationinadecentralizedmannerandmanageworkload migrationswithintheimmediateneighborhood[,2,3,4,5].sincetheywouldonlyspreadlocal parallelcomputersofanysize,andwouldtendtopreservethecommunicationlocalityinherent intheunderlyingcomputations.ingeneral,thesealgorithmsareexecutediteratively,withthe expectationthatsuccessiveinvocationsoflocalloadbalancingwouldeventuallybringabouta spectrumofpossibilities,fromloadsharing(noidleprocessorscoexistwithbusyprocessors) globalbalancedstate;hence,theygivetheexibilityofcontrollingthebalancequalityovera uniformdistribution,andhenceateachoperation,needonlybeconcernedwiththedirection totheglobalbalancedstate. loadbalancingmethodsthatarecharacterizedbydierentchoicesofthedirectionofworkload ofworkloadmigrationandtheissueofhowtoapportionexcessworkloads.amongexisting Nearestneighborloadbalancingalgorithmsrelyonsuccessiveapproximationstoaglobal migration[6],weareinterestedinthediusionandthedimension-exchangemethods.these twomethodshavedrawnafairamountofattentioninrecentyears.withthediusionmethod, aheavilyorlightlyloadedprocessorbalancesitsworkloadwithallofitsnearestneighbors thesubsequentpairwisebalancing[8,5,9].thesetwomethodsarecloselyrelated,andthey simultaneouslyinaloadbalancingoperation[7,8].withthethedimension-exchangemethod, oneatatime,andeachtimeanewworkloadindexiscomputed,whichwillbeusedinthe lendthemselvesparticularlywelltoimplementationintwobasiccommunicationarchitectures, aprocessorinneedofloadbalancingbalancesitsworkloadsuccessivelywithitsneighbors theall-portandtheone-portmodels,respectively.theall-portmodelallowsaprocessorto neighboratonetime.bothofthesetwomodelswereassumedinmanyrecentresearcheson exchangemessageswithallitsdirectneighborssimultaneouslyinonecommunicationstep, whiletheone-portmodelrestrictsaprocessortoexchangemessageswithatmostonedirect communicationalgorithms[0,].althoughthelatestdesignsofmessage-passingprocessors tendtosupportall-portsimultaneouscommunications,therestrictiveone-portmodelisstill 2

validinexistingrealparallelcomputersystems.sincethecostinsettingupacommunication isxed,thetotaltimespentinsendingdmessagestoddierentports,assumingthebest possibleoverlappingintime,isstilllargelydeterminedbydunlessthemessagesarerather longṫheall-portandone-portmodelsfavorthediusionandthedimension-exchangemethods, usingthediusionmethodcanbecompletedinonecommunicationstepwhilethatusing respectively.inasystemthatsupportsall-portcommunications,aloadbalancingoperation bandwidthisconcerned.anaturalbutinterestingquestioniswhethertheadvantagetranslates hasanadvantageoverthedimension-exchangemethodasfarasexploitingthecommunication thedimension-exchangemethodwouldtakedsteps.itappearsthatthediusionmethod auniformdistribution.thismeasurealoneissucientforthosekindsofproblemsthatneed ofcommunicationstepsrequiredbythealgorithmtodriveaninitialworkloaddistributioninto algorithmisdeterminedbytwomeasures.oneiseciencywhichisreectedbythenumber intorealperformancebenetsinloadbalancingornot.theperformanceofaloadbalancing globalbalancingatruntime.however,fortheotherkindsofapplicationsthatneedtoachieve loadsharingratherthanglobalbalancing,weneedanothermeasure,thebalancequality,to indierentcommunicationmodels. thequestionconcerningtheperformanceofthediusionandthedimension-exchangemethods reecttheabilityofthealgorithminboundingthevarianceofprocessors'workloadsafter performingoneormoreloadbalancingoperations.theobjectiveofthisstudyistoanswer ofattentionfromboththeoreticalandexperimentalresearchers.thediusionmethodwas rstmodeledusinglinearsystemtheorybycybenko[8],andbertsekasandtsitsiklis[7].cybenkoshowedthatthediusionmethodwilleventuallycoerceanyinitialworkloaddistribution Intheliterature,thediusionandthedimension-exchangemethodshavereceivedalot intoaglobaluniformdistributioninstaticsituationsinwhichnoworkloadsaregeneratedor consumedduringloadbalancing,andpresentedanasymptoticboundforthevarianceofany workloaddistributionduringloadbalancinginthedynamicsituation.similarconvergence resultsinthestaticsituationwereobtainedindependentlybyboillat[2].boillatalsoproved ThediusionmethodinthedynamicsituationwasstudiedbyHongetal.[3],andQianand Yang[4],aswell.Theypresentedaconstantboundforthevarianceofworkloaddistribution thatthediusionloadbalancingwillconvergetoaglobalbalancedstateinpolynomialtime. itsoptimalvaluesforthemeshandthetorusnetworks[5]. Lauanalyzedtheeectsoftheparameterontheeciencyofthediusionmethod,andderived whenapplyingthemethodtosomespecicstructures.thediusionmethodischaracterized byaparameterwhichdeterminestheportionofexcessworkloadtobediusedaway.xuand benkoshowedthatregardlessoftheorderofdimensionsconsidered,thissimpleloadbalancing allelcomputers,inwhichbalancingproceedsiterativelyindimensions.ateachdimension, aprocessorbalancesitsworkloadwiththatofitsneighborbelongingtothedimension.cy- Thedimension-exchangemethodwasconceptuallydesignedforhypercube-structuredpar- methodyieldsauniformdistributionfromanyinitialworkloaddistributionafteraroundof balancingoperations[8].healsorevealedthesuperiorityofthedimension-exchangemethod 3

overthediusionmethodintermsoftheirecienciesandbalancequalities. appliedthismethodtoarbitrarystructuresbasedonedge-coloring[6].furthermore,xuand Laushowedthat\equalsplitting"oftheworkloadinapairwisebalancingoperationmight notleadtomaximumeciencyinmostpopularstructures,suchasthemeshandthetorus, Thedimension-exchangemethodisnotlimitedtohypercubestructures.Hosseinietal. formforthen-dmeshandtorusstructures. althoughitperformsbestinthehypercube[5,9].throughintroducinganexchangeparameter togovernthesplittingofworkloadateverystep,theyderivedtheoptimalvaluesinclosed theirsoundmathematicalfoundation.onthepracticalside,thebenetsofthediusion methodweredemonstratedinthecontextofdistributedcomputationsofbranch-and-bound algorithms[7,4],andthedimension-exchangemethodwasexperimentedinparallelgraph Thetheoreticalstudyofthediusionandthedimension-exchangemethodsestablished partitioning[8]andperiodicre-mappingofdataparallelcomputations[9].also,willebeek- concludedthatthespeedupduetothedimension-exchangemethodisbetterthanthespeedup LeMairandReeves[4]comparedtheresultsofthesetwomethodsinthedistributedcomputationofbranch-and-boundalgorithmsonahypercube-structurediPSC/2.Theirexperiments ofthedimension-exchangemethodinhypercubes,itmightnotbethecaseforotherpopular duetothediusionmethod.itisinagreementwiththecybenko'sresult. networks.ontheotherhand,previoustheoreticalstudiesofthesetwomethodsweremostly ontheirsynchronousimplementationsinwhichallprocessorsparticipateinloadbalancing Althoughtheresultsofboththeoreticalandexperimentalstudypointtothesuperiority resultshavebeenobtainedontheasynchronousimplementationsofthesemethods.bertsekas workloadmigrationsdemandedbythecurrentoperationhavecompleted.relativelylittle andtsitsiklisprovedtheconvergencepropertyofanasynchronousimplementationofthe operationssimultaneouslyandeachprocessorcannotproceedintothenextstepuntilthe diusionmethod[7],andsongextendedtheresulttothecaseofthetotalworkloadbeingtoo smalltobedividedinnitely[20].lulingandmonienconsideredarandomizedversionofthe diusionmethodinwhichaprocessorinneedofloadbalancingactivatesanoperationamong boththeissuesofeciencyandbalancequalitytogether. dierencebetweenanytwoprocessorsbounded[2].however,noneoftheseworksaddressed anumberofrandomlychosenneighbors,andshowedthatthealgorithmwillkeeptheworkload cationpolicies,andwithstatic/dynamicrandomworkloadbehaviors.thecommunication exchangemethodsintermsoftheireciencyandbalancingqualitywhentheyareimplemented inbothone-portandall-portcommunicationmodels,usingsynchronous/asynchronousinvo- Inthispaper,wemakeacomprehensivecomparisonbetweenthediusionandthedimension- cases:thering,thechain,thehypercubeandthek-aryn-cube.themeshandthetorusallow networkstobeconsideredincludethestructuresofn-dtorusandmesh,andtheirspecial dierentnumberofnodesindierentdimensions.ak-aryn-cubeisaspecialcaseofthen-d torusinthatithasknodesineachdimension[22,23].thehypercubeisaspecialcaseofboth then-dmeshandthek-aryn-cube.ahypercubeisann-dmeshhavingtwonodesineach 4

themostpopularchoicesoftopologiesincommercialparallelcomputers[23,24]. dimension,thatis,a2-aryn-cube.welimitourscopetothesestructuresbecausetheyare oftheparametervalueineachmethod:theaveragedimension-exchange(ade),theoptimallytuneddimension-exchange(ode),thelocalaveragediusion(adf),andtheoptimally-tuned Boththedimension-exchangeandthediusionmethodsareparameterizedmethods.Their performanceislargelyinuencedbythechoiceoftheparametervalues.wefocusontwochoices diusion(odf).theoptimalityhereisintermsoftheeciencyinstaticsynchronousimplementationsamongvariouschoicesofthedimension-exchangeandthediusionparameters. Theaverageversionsarethemostoriginalversionswhenthemethodswererstproposedand arestillbeingemployedinrealapplicationstoday.ourmainresultsarethatthedimensionexchangemethodoutperformsthediusionmethodintheone-portcommunicationmodel;in balancingevenundertheall-portcommunicationmodel;thestrengthofthediusionmethodis inasynchronousimplementationundertheall-portcommunicationmodel;theodfalgorithm particular,theodealgorithmisfoundtobebestsuitedforsynchronousimplementationin performsbestinthiscase. thestaticsituation;andthatthedimension-exchangemethodissuperiorinsynchronousload Section3describestheloadbalancingalgorithmsinauniedform.InbothSection4andSection5,thealgorithmsarecomparedwithrespecttotheirimplementationusingasynchronous insection2,whichprovidesaframeworkforthecomparisonoftheloadbalancingalgorithms. Therestofpaperisorganizedasfollows.Werstpresentagenericmodelofloadbalancing ancingalgorithms.weconcludeinsection7withasummaryoftheresultsofthecomparison betweenthedimension-exchangeandthediusionmethods. whichverifyourtheoreticalresultsaswellasprovidefurtherinformationontheseloadbal- andsynchronousinvocationpolicies,respectively.section6givestheresultsfromsimulations, cessorsinterconnectedbyadirectcommunicationnetwork.processorscommunicatethrough 2Weconsideraclassofparallelcomputerswhicharecomposedofanitesetofhomogeneouspro- Agenericmodelofloadbalancing passingmessages.thecommunicationchannelsareassumedtobefullduplexsothatapair ofdirectlyconnected(nearestneighbor)processorscansend/receivemessagessimultaneously andevvisasetofedges.everyedge(i;j)2ecorrespondstothecommunication messagesthroughachannelcantakeplaceinstantaneously.werepresentsuchasystemby asimpleconnectedgraphg=(v;e),wherevisasetofprocessorslabeledthroughn, to/fromeachother.inaddition,weassumethattheoperationsofsendingandreceiving channelbetweenprocessorsiandj.leta(i)denotethesetofnearestneighborsofprocessor i,d(i)=ja(i)jbethedegreeofprocessori,anddbethemaximumofd(i)forin. tobelargeenoughsothattheworkloadofaprocessorisinnitelydivisible.processesmaybe processes,whicharethebasicunitsofworkload.thetotalnumberofprocessesareassumed Theunderlyingparallelcomputationisassumedtocomprisealargenumberofindependent 5

dynamicallygenerated,consumed,ormigratedduetoimbalanceasthecomputationproceeds. ispossiblewhenprocessorsarecapableofmultiprogrammingorthebalancingoperationisdone operation,orbothoperationssimultaneously.theconcurrentexecutionofthesetwooperations operation.ananytime,aprocessorcanperformacomputationaloperation,abalancing Weclassifytheoperationsintotwotypes:thecomputationaloperationandthebalancing inthebackgroundbyspecialcoprocessors.theworkloadofprocessorscanbeeitherxedor varyingwithtimeduringtheloadbalancingoperation,whichwerefertoasthestaticandthe processoriattimetbywtiintermsofthenumberofresidingprocesses.weuseintegertime dynamicsituations,respectively. tosimplifythepresentation.theresultscanbeexpendedreadilytocontinuoustime.leti(t) denotethesetofprocessorsperformingbalancingoperationsattimet.thechangeofworkload Lettbeatimevariable,representingglobalrealtime.Wequantifytheworkloadof ofaprocessorattimetcanbemodeledbythefollowingequationinthestaticsituation andthefollowingequationinthedynamicsituation wt+ i=(wti+t+ fi(wti;wtjjj2a(i))i2i(t) i i62i(t) () wheret+ wt+ i=(wti+t+ i denotestheamountofworkloadgeneratedornishedfromtimettot+,and fi(wti;wtjjj2a(i))+t+ i i i2i(t) i62i(t) (2) fi()representsaloadbalancingoperator. loadbalancingatanytimet,i(t),areleftunspecied.theoperatorfi()canbeanynearestneighborloadbalancingalgorithm,includingthediusionandthedimension-exchange methods.theseti(t)isdeterminedbytheinvocationpolicyoftheloadbalancing.the Thismodelisgenericbecausetheloadbalanceoperatorfi()andthesetofprocessorsin choiceofi(t)isorthogonaltotheloadbalancingalgorithminthatanyinvocationpolicy canbeusedincombinationwithanyloadbalancingalgorithminimplementation.sincea parallelcomputationsusingdomaindecompositiontechniques,forexample,thecomputational requirementassociatedwitheachportionofaproblemdomainmaychangeasthecomputation loadbalancingoperationincursnon-negligibleoverheads,dierentapplicationsrequiredierentinvocationpoliciesforabettertradeobetweenperformancebenetsandoverheads.in proceeds.aneectivewaytoreducethepenaltyduetoloadimbalancesistoperiodicallyredecomposetheproblemdomainwiththeaimofachievingaglobaluniformdistributionacross theprocessors.tothisend,allprocessorsarerequiredtoperformloadbalancingoperations Bycontrast,theparallelexecutionofdynamictree-structuredcomputationsusuallyrequires theinstancewhentheglobalsystemstatesatisescertainconditionssuchasthosesetin[25]. synchronouslyforashorttimeperiod.thatis,i(t)=f;2;;:::;ngfortt0,wheret0is sors.thus,eachprocessorisallowedtoinvokealoadbalancingoperationatanytimeinan onlyloadsharing assuringthatnoidleprocessorsexistwhilethereareotherbusyproces- asynchronousmanneraccordingtoitsownlocalworkloaddistribution.asimplepolicyisto 6

5 5 Processor 4 4 Processor activatealoadbalancingoperationonceaprocessor'sworkloaddropsbelowapresetthreshold,wunderload,i.e.,i(t)=fijwti<wunderloadg.moresophisticatedinvocationpolicieswere discussedin[2,4].inshort,wemakeadistinctionbetweensynchronousandasynchronous implementationsofloadbalancingaccordingtotheirinvocationpolicies.figurepresents respectively. oneexampleofthesetwoimplementationmodelsinasystemofveprocessors.thedots andthetrianglesrepresentthecomputationaloperationsandtheloadbalancingoperations, 3 3 2 2 t t+5 t+0 t+5 time t t+5 t+0 t+5 time cedure.weareconcernedwithsubsequentworkloaddistributionsresultingfromdierent Figure:Anillustrationofgenericmodelsofloadbalancing loadbalancingalgorithms.denotetheoverallworkloaddistributionatcertaintimetbya Assumet=0whenprocessorsinvokeasynchronousorasynchronousloadbalancingpro- (a) Asynchronous implementation (b) Synchronous implementation Wt=(wt;wt;;wt),wherewt=PNi=wti=N.Wedenetheworkloadvariance,denotedby t,asthedeviationofwtfromwt;thatis, vectorwt=(wt;wt2;;wtn).denoteitscorrespondinguniformdistributionbyavector Withtheworkloadvariancet,wedenetheeciencyofaloadbalancingalgorithmasthe t=jjwt?wtjj2=nxi=(wti?wt)2: numberofloadbalancingstepsrequiredtoreducethevarianceoftheinitialstatetoatolerable Throughoutthepaper,E[]denotestheexpectedvalueofarandomvariable. algorithmswillbecomparedintermsofthesetwomeasuresunderthefollowingassumption. istobeguaranteedbytheloadbalancingprocedureinthedynamicsituation.loadbalancing levelinthestaticsituation;anddenethebalancequalityastheboundforthevariancewhich timet,t0,processors'workloadgeneration/consumptionamount,ti,in,arezero Assumption2.Initially,processors'workloads,w0i,iN,areNindependentand identicallydistributed(i.i.d.)randomvariableswithexpectation0andvariance20.atany inthestaticsituationori.i.d.randomvariableswithexpectationandvariance2inthe dynamicsituation. 7

Thissectionbrieydescribesthedimension-exchangeandthediusionmethods.Bothof themareparameterizedloadbalancingalgorithms.wepresentseveralinstancesofthesetwo 3 Thedimension-exchangeandthediusionmethods methodsbasedondierentchoicesofvaluesfortheirparameters. 3.Thedimension-exchangemethod way Withthedimension-exchangemethod,anyprocessorwhichinvokesaloadbalancingoperation balancesitsworkloadwithitsneighborssuccessively.foraprocessori,itworksinthefollowing f()for(c=;cd(i);c++) valuebeforehandwhichdeterminesthefractionofexcessworkloadtobemigratedbetweena wherejc2a(i);and0<<,calledthedimension-exchangeparameter,isgivenaxed wi=wi+(wjc?wi) (3) balancing.itisbecauseofthesequentialnatureinthesequenceofbalancingsteps,aload balancesitsworkloadwithoneofitsneighbors,andusesthenewresultforthesubsequent methodcomprisesd(i)pairwisebalancingstepsforprocessori.ateachstep,processori pairofprocessors.theformulasaysthatabalancingoperationinthedimension-exchange balancingoperationrequiresd(i)communicationstepsinboththeall-portandtheone-port communicationmodels. choicesoftheparameterwhichhavebeensuggestedasrationalchoicesintheliterature. parameter.adimension-exchangeoperationwithdierentchoicesoftheparameterwillreduce theworkloadvarianceofthesystembydierentdegrees.inthefollowing,wepresenttwo Theeciencyofthedimension-exchangemethodisdeterminedbythedimension-exchange.Averagedimension-exchange(ADE)equallysplitsthetotalworkloadofapairofprocessors 2.Optimally-tuneddimension-exchange(ODE)takescertainspecicparametervaluesthat operation,andhasbeenfavoredinhypercube-structuredsystems[8,26,27]. thatis,==2.itisastraightforwardchoiceforlocalbalancingateachpairwise havetheeectofmaximizingeciencyinstaticandsynchronousbalancing[5,9].the optimalparameterdependsonthetopologyandthesizeofunderlyingcommunication network.letk=maxfki;inginthekk2knmeshandtorus.then,their optimalparametervalueswereshown,in[9],tobe ==(+sin(2=k))inthetorus. ==(+sin(=k))inthemesh, onlyafewprocessorsthatarenotclosetoeachotherareinneedofloadbalancingatthe sametime.however,itssynchronousimplementationrequiresprocessorstobecoordinatedin Thedimension-exchangemethodcanbeimplementedwithoutdicultyincaseswhere 8

avoidcommunicationcollisions.thepotentialofparalleleciencyisduetothefactthatthe ordertoparallelizebalancingoperationsalongdierentcommunicationchannelsaswellasto executionorderofpairwisebalancingstepsintheoperationf()ofeq.(2)isleftundened. Thepairwisebalancingstepsalongthechannelsinthesamesubsetcanthenbeperformed ofedgesintoanumberofsubsetssuchthatnotwoadjoiningedgesareinthesamesubset. Theparallelizationofpairwisebalancingoperationscanberealizedbypartitioningtheset concurrentlywithoutcollisions.suchgraphpartitionisequivalenttotheproblemofedge coloringofgraphs[28].figure2showsexamplesofcolorgraphsofameshandatorus. Thenumbersinparenthesesaretheassignedchromaticindices.Analternativeapproachto parallelizingloadbalancingoperationsisrandommatchingwhichwasusedin[29]. 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 3.2Thediusionmethod Figure2:Examplesofcoloredmeshandtorus (a) Colored Mesh (b) Colored torus Withthediusionmethod,anyprocessorwhichinvokesaloadbalancingoperationcompares itsworkloadwiththoseofitsnearestneighbors,andthengivesawayortakesincertainamount canbewrittenintheform ofworkloadwithrespecttoeachofnearestneighbors.thediusionoperatorinaprocessori where0<ij<,calledthediusionparameter,ispredenedtodictatetheportiontobe fi()wi+x balancingoperationwiththediusionmethodrequiresonlyonecommunicationstepinthe migratedbetweenanytwoprocessors.processoriapportionsexcessworkloadjwi?wjjto all-portcommunicationmodel,butd(i)stepsintheone-portcommunicationmodel. processorjifwi>wj,orfetchessomeworkloadfromprocessorjotherwise.clearly,aload bythediusionparameter.followingaretwocommonchoicesoftheparameter..localaveragediusion(adf)takesanaverageoftheworkloadofneighboringprocessors Asinthedimension-exchangemethod,theeciencyofthediusionmethodisdetermined 9 2 2 2 2 3 3 3 2 3 3 2 3 3 2 3 3 2 3 2 3 3 2 3 3 2 3 3 4 4 4 2 4 4 2 4 4 2 4 4 2 4 2 4 4 2 4 4 2 4 4 j2a(i)ij(wj?wi) (4)

bysettingij= weuseasinglevalue= samedegree.themeshisapproximatelyregularwhenitssizeislarge.forsimplicity, +d(i)[2,3,4];thetorusisregularinthatallprocessorshavethe 2.Optimally-tuneddiusion(ODF)takescertainspecicparametervaluesformaximizing thetorus. +dtocoverallcommunicationchannelsinthemeshandin eciencyinstaticandsynchronousbalancing[8].asinthedimension-exchangemethod, network.letk=maxfk;k2;;knginthekk2knmeshandtorus.then,their theoptimaldiusionparameterdependsonthetopologyandthesizeoftheunderlying optimalchoiceswereshown,in[8,5],tobe ==(n+)inthen-dhypercube. ==(2n+?cos(2=k))inthetorus, ==2ninthemesh, Inanasynchronousimplementationofloadbalancing,processorsperformbalancingoperations 4discretelybasedontheirownlocalworkloaddistributionsandinvocationpolicies.Sinceload Asynchronousimplementations balancingalgorithmscanbetreatedasorthogonaltoinvocationpolicies,weconsidertheload balancingoperationsoftheprocessorsinonetimestepsoastoisolatetheireectsonthe processorisperformingloadbalancingoperations.thedynamicsituationpresentsonlyafew loadbalancinginwhichtheunderlyingcomputationinaprocessorissuspendedwhilethe workloadvariancefromtheeectsofinvocationpolicies.wefocusonthestaticsituationof relativelyminordierencestotheanalysisoftheeectsofloadbalancing. aretheresultsfromvariousloadbalancingoperations. variancewhent=.ourcomparisonwillbemadebetweenade,ode,adf,andodfwhich Let0betheoriginalsystemworkloadvariancewhent=0,andbethesystemworkload Assumption2..Then,E[ade]E[df]intheone-portcommunicationmodel,whileE[df] Theorem4.Supposeprocessorsarerunninganasynchronousloadbalancingprocessunder E[ade]intheall-portcommunicationmodel.Moreover,E[adf]E[odf]inchainandring networks,bute[odf]e[adf]intwo-orhigher-dimensionalmeshesandtori.inaddition, E[ade]E[ode]intheall-portcommunicationmodel. theone-portandtheall-portcommunicationmodels,respectively.morespecically,itreveals thattheodfalgorithmoutperformstheadfalgorithminhigherdimensionalmeshesand torialthoughtheodfwasoriginallyproposedforuseinsynchronousglobalbalancing. Thistheoremsaysthatthedimension-exchangeandthediusionmethodsaresuitablefor ThecalculationofE[]isbasedonalemmaconcerningthesamplevarianceofacombination ThetheoremisprovedthroughthederivationoftheclosedformofeachvarianceE[]. 0

ofrandomvariablesinasampleset,whichwepresentwithoutproof.itcanbeeasilyshown =PNi=i.Then, usingfundamentalstatisticaltheories. Lemma4.Supposethat;2;:::;NareNi.i.d.randomvariableswithvariance2,and.foranyk,kN, where0<ai<satisespki=ai=;andthevarianceisminimizedatai==kfora givenk. E(jkXi=aii?j2)=(kX=a2i?N)2; (5) 2.foranykandk2andkk2N, where0<ai<satisespk E(jk i=ai=and0<bj<satisespk2 Xi=aii?j2)E(jk2 Xj=bjj?j2) j=bj=. (6) ProofofTheorem4.Atcertaintimeinanasynchronousloadbalancingprocess,there simultaneously.letea(i)=fig[a(i)denotethebalancingdomainofaninvokerprocessor mightbemorethanoneprocessorthatareinvokingloadbalancingwithintheirneighborhoods areunionsofoverlappingdomains.processorsindierentspheresperformloadbalancing other.asawhole,thoseprocessorsthatarerunningloadbalancingprocessesarepartitioned i.thebalancingdomainsofconcurrentinvokersmayoverlapormaybeseparatedfromeach intoanumberofseparatedspheres,someofwhicharesingularbalancingdomainsandsome operationsindependently,whileprocessorsinthesamesphereperformloadbalancingina B;B2;;Bm.Then,bythedenitionoftheworkloadvarianceandAssumption2.,we synchronousmanner. havesupposeinitiallytherearemindependentbalancingspheresinthesystem,denotedby E[]=E(NXi=jwi?wj2) =NXi=E(jwi?wj2) =mxj=x i2bie(jwi?wj2)+x i2bje(jwi?wj2)+(n?n0)(?n)(20+2); i62[mj=bje(jwi?wj2) wheren0=j[mi=bijisthenumberofprocessorsinvolvedinloadbalancing.thelastterm (7) of(7)isduetotheunderlyingcomputationaloperations.itisaconstantforagivenn0and

independentofthetopologicalrelationshipsamongthen0processors.thersttermof(7)is thattheexpectedvalueofthesystemworkloadvarianceisinuencedindependentlybyload duetoloadbalancingoperationsinallseparatedbalancingspheres.itisasimplearithmetic sumofworkloadvarianceofeachsphere,pi2bje(jwi?wj2).asawhole,eq.(7)implies balancingoperationswithindierentbalancingspheres.therefore,itsucestocomparethe Case:loadbalancinginasingularbalancingdomain eectsofloadbalancingalgorithmswithindierentspheresusinglemma4.. Werstconsiderloadbalancinginspheresofsingularbalancingdomains.SupposeBissuch B.Then,withthediusionalgorithm,theworkloadsofprocessorsattheendofadiusion asphere,andwithoutlossofgenerality,b=ea()=f;2;3;;d+g.thatis,processor operationaregiven,accordingtoeq.(4),by invokesaloadbalancingoperationwithinitsdneighborswhicharelabeledfrom2tod+. LetX=Pd+ i=e(jwi?wj2),denotingtheexpectedvalueofworkloadvarianceofsphere InvokingLemma4.oneachcomponentwi,wehavethat wi=((?d)w0+pd+ w0+(?)w0ij=2w0iifi=; if2id+; (8) Xdf=d+ =E(j(?d)w0+d+ Xi=E(jwi?wj2) =[d2+(?d)2?=n]20+d[2+(?)2?=n)]20 Xi=2w0i?w0j2)+d+ Xi=2E(jw0+(?)w0i?w0j2) =(d22+3d2?4d+d+?d+ Letopt=2=(3+d).WereplacebyoptintheexpressionofXdf,andobtain ItcanbeeasilyveriedthatXdf,asaconvexfunctionof,isminimizedat=2=(3+d). N)20: (9) Recallthatadf==(d+),andthatodf==dinameshandodf==(d+?cos(2=k)) Xdfj(opt)=(d2+3 d+3?d+ inatorus,wherekisthemaximumdimensionalorderofthetorus.itfollowsthat N)20: (0) inthecaseofachain(i.e.,the-dmesh)whered=2,adf<2=(3+d)<odfand inthecaseofaring(i.e.,the-dtorus)whered=2,adf<opt<odfandjadf? jadf?optj<jodf?optj; inthecaseofhigherdimensionalmeshesandtoriwhered4,adf<odf<opt. optj<jodf?optj,fork2; 2

Consequently,withthediusionmethod, (XodfXadfXdf(opt)ifd=2andk2; Withthedimension-exchangemethod,processorisassumedtoperformpairwiseloadbalancingwithprocessors2;3;:::;d+inturninadimension-exchangeloadbalancingoperation. XadfXodfXdf(opt)ifd4 () theendofadimension-exchangeoperationaregiven,accordingtoeq.(3),by Assumetheunderlyingsystemisintheone-portcommunicationmodel.Then,theworkload generation/consumptionratioinaroundofpairwisebalancingstepshasthesamestatistical characteristicsasthoseinadiusionoperation.consequently,theprocessors'workloadsat wi=8><>:(?)dw0+pd? (?)w0i+(?)i?2w0+2pi?3 (?)w02+w0j=0(?)jw0d?j+ j=0(?)jw0i??jif3id+; ifi=2; ifi=; InvokingLemma4.oneachcomponentwi,wehavethat (2) Xde=d+ =[(?)2d+2d? Xi=E(jwi?wj2) +d+ Xi=3[(?)2+2(?)2(i?2)+4i?3 Xj=0(?)2j?=N]2o+[(?)2+2?=N]20 =[d(?)2+22?(?)2d?(?)2+4d??(?)2?(?)2d?2 Xj=0(?)2j?(d?)=N]20 Inparticular,substituting=2fortheintheexpressionofXde,callingitXade,leadstothat +(?)2d?d+ N]20: Xade=(3d+5+22?2d 9?d+ communicationmodel, From(0)and(3),itisknownthatXadeXdf(opt):Itisthusprovedthatintheone-port N)20: (3) asmuchtimeasadiusionloadbalancingoperation.thatis,inatimestepofthediusion Intheall-portcommunicationmodel,adimension-exchangepairwisebalancingsteptakes XadeXdf: (4) method,aprocessorbalanceswithonlyoneofitsneighborswiththedimension-exchange method.itresultsinthat Consequently,XadeislessthanXodebutlargerthanXdf. Xde=2[(?)2+2?N]20+(d?)(?N)20: 3

Case2:loadbalancinginaunionofoverlappingdomains Wenowconsiderloadbalancinginsphereswhichareunionsofoverlappingbalancingdomains. Abalancingspherecanbeaunionofanynumberofoverlappingdomains.Inconsideration ofthelikelihoodthatfewprocessorswillbeinvokingloadbalancingsimultaneouslyinasynchronousimplementations,wefocusontheunionoftwobalancingdomainsonly.figurepingbalancingdomainsin2-dmeshesandtori.thetrianglesareinvokersofloadbalancing illustratesthreetopologicalrelationshipsbetweenapairofprocessorswhichhaveoverlap- processesandthedotsareprocessorsbeinginvolvedinloadbalancing. Figure3:Illustrationsofoverlappingbalancingdomains wj2).supposeprocessorsjandj2havethesamenumberofdirectneighbors.then,inthe ea(j2).letydenotetheexpectedvalueoftheworkloadvarianceofb2,i.e.,y=pi2b2e(jwi? AssumeB2isaunionofbalancingdomainsofprocessorsjandj2.Thatis,B2=eA(j)[ (a) (b) (c) casethatprocessorsjandj2aredirectlyconnected,asinfigure3(a),wehavethatinthe andinthedimension-exchangemethod, Inparticular, Yde2Xde?2[(?)2+2(?)2(d?)+4?(?)2(d?) Yade2Xade?2(3+ 2?(?)2?N]20: (6) (orea(j2)nfjg)changesitsworkloadinthesamewayasinloadbalancingwithinasingular balancingdomainea(j)(orea(j2)).thereasonsoftheinequalityoftheydeinthedimensionexchangemethodareasfollows.withthedimension-exchangemethod,bothprocessorsjand TheequationofthediusionmethodisduetothefactthateachprocessorineA(j)nfj2g j2performpairwisebalancingoperationswiththeirneighborsinturnaccordingtoorderswhich as,processorj2asc+,andotherneighboringprocessorsofprocessorjas2tod+ inb2isthusinuencedbytheexecutionorderacrossthecommunicationchannels.suppose thechannel(j;j2)isindexedascth.withoutlossofgenerality,werelabeltheprocessorj arepresetthroughedge-coloringofthesystemgraph.thechangeoftheworkloaddistribution 4 diusionmethod, Ydf=2Xdf?2[(?)2+2?N]20; (5) 322d?N)20: (7)

excludingc+.then,itisclearthatprocessorsfrom2tocchangetheirworkloadsin thesamewayastheircounterpartsiftheyareperformingloadbalancingwithinasingular E(jwd+?wj2)E(jwi?wj2)fori>c,theboundofYdeishenceobtained. domainea(i)alone,whilethebehaviorsofotherprocessorswillbeinuencedbyprocessorsin workloadvariancee(jwi?wj2)inaunionofoverlappingdomainsthaninea(i)alone.since ea(j)nfig.fromlemma4.(2),itisalsoknownthateachprocessori,i>c,willpossessless theoptimalchoiceof.then,ydf(opt)=d?4d2?4d+ ItcanbeeasilyveriedthatYdfisminimizedat=(2d?)=(d2+3d?2).Letoptbe Asinthecaseofsingularbalancingdomain,itcanbeshownthat d2+3d?2: (8) inthecaseofachain(i.e.,-dmesh)whered=2,adf<opt<odfandjadf?optj< inthecaseofaring(i.e.,-dtorus)whered=2,adf<opt<odfandjadf?optj< jodf?optj; inthecaseofhigherdimensionalmeshesandtoriwhered4,adf<odf<opt. jodf?optj,fork6; Consequently,withthediusionmethod, (YodfYadfYdf(opt)ifd=2andk6; YadeYdf(opt). Ontheotherhand,thecomparisonbetweenYadeofEq.(7)andYdfj=optofEq.(8)yields YadfYodfYdf(opt)ifd4 (9) thereareatmosttwoprocessorsintheintersectoftheirbalancingdomainsinthemeshand torusnetworks.letsbethecardinalityoftheintersectea(i)\ea(j),s=or2.then,with thediusionmethod, Incasesthatprocessorsiandjarenonadjacent,asillustratedinFigure3(b)and3(c), Ydf=2Xdf?2s[(?)2+2?N]20+s[(?2)2+22?N]20 andwiththeadealgorithm, =2Xdf?s[(?22)?N]20; (20) Yade2Xade?s[3+ 322d?N]20: 2 YodfYadfincased=2,andYadfYodfincased4. Similarlytothecaseofsingularbalancingdomain,wehavetheresultthatYadeYdf, (2) ontheassumptionofone-portcommunicationmodel.intheall-portcommunicationmodel, Noticethattheprecedinganalysisofthedimension-exchangemethodisimplicitlybased 5

adimension-exchangepairwisebalancingstepcorrespondstoadiusionbalancingoperation. formedconcurrently,wethushave Becausetwopairwisebalancingoperationsinaunionoftwobalancingdomainscanbeper- wheres=or2.obviously,yadeislessthanyodebutlargerthanydf. Thetheoremisthenproved. Yde=4[(?)2+2?N]20+(2d?4?s)(?N)20; theoremstillholdsinthedynamicsituation.considerprocessorsinbalancingsphereb.since theworkloadsgenerated/consumedfromtime0totimeinanyprocessori,i2b,willnotbe consideredinitsloadbalancingoperationattimestep,theoperationinthedynamicsituation Notethateventhoughtheproofofthetheoremassumesstaticworkloadbehaviors,the N0=jBj.TheaddedtermisaconstantforagivenN0andindependentoftheloadbalancing workloadvarianceinthestaticsituation.asawhole,theaccumulativeworkloadvarianceof processorsinbalancingspherebinthedynamicsituationispi2be(wi?wj2)+n02,where thenresultsinanworkloadvariancee(jwi?wj2)+2,wheree(jwi?wj2)istheprocessor's situation. algorithmused.hence,theargumentsintheproofofthetheoremarevalidinthedynamic workloadofaprocessor,sayprocessor,anditssurroundingdprocessors,labeledfrom2to d+,inasimplewaythatprocessorgives(w?wi)loadstoprocessori,inthecaseof w>wi,andtakes(wi?w)loadfromprocessori,otherwise(2id+).inasingular Toconcludethissection,weremarkthatadiusionloadbalancingoperationaveragesthe balancing.specically,processorcalculatesthelocalaveragewas balancingdomain,theremightbeavariantoftheadfalgorithmwhichstrivesforlocalload w=w+p2id+wi iisdecientornot.aftersuchanoperation,eachprocessori,2id+endsupwiththe andthengivesortakesjwi?wjloadstoorfromprocessoriaccordingtowhetherprocessor +d ; Pd+ sameworkloadasprocessor.consequently,theexpectedworkloadvarianceofthedomain i=e(jwi?wj2)becomes model.althoughitincursmoreoverheadsthananodforadfoperation,suchavariantof whichisobviouslysmallerthanthatoftheademethodevenintheone-portcommunication (?d+ N)20; inbalancingsphereswhereanumberofbalancingdomainsoverlapwitheachotherbecause processorsinsuchasphereareunabletobalancetheirworkloadswithalltheprocessorsin suchanoperation. diusionoperationispreferredinsingularbalancingdomains.however,itmaynotbeeective 6

5Inasynchronousimplementationofloadbalancing,allprocessorsperformloadbalancing operationsconcurrentlyandcontinuouslyforatimeperiodinordertoachieveaglobalbalanced Synchronousimplementations dynamicsituation. stateinthestaticsituationortokeepthevaryingsystemworkloadvarianceboundedinthe oftheworkloaddistributionattimetinthediusionmethodcanbemodeledbytheequation bemodeledbylineariterativeprocesses,asillustratedin[2,8,5].fromeq.(4),thechange Thesynchronousimplementationofthediusionandthedimension-exchangemethodscan whered,calledadiusionmatrix,isgivenby Dij=8><>: ifprocessorsiandjaredirectlyconnected; Wt+=DWt+t; (22) 0?d(i)ifi=j; methodarefullycapturedbytheiterativeprocessgovernedbyd. Withtheaboveformulation,thefeaturesofthesynchronousimplementationofthediusion otherwise: methodcanbemodeledbytheequation Then,fromEq.(3),thechangeoftheworkloaddistributionattimetinthedimension-exchange Letbetheminimumnumberofcolorsrequiredforedgecoloringofthesystemgraph. wherem,calledthedimension-exchangematrix,isgivenby Wt+=MWt+t M=MM?:::M: (23) EachMc(c)reectsthechangeoftheworkloaddistributionofthesystematpairwise balancingstepcoftimet.thus,thefeaturesofthesynchronousimplementationofthe dimension-exchangemethodarefullycapturedbytheiterativeprocessgovernedbym. balancingoperationssimultaneouslyandallcomputationaloperationsaresuspended.this 5.Staticsituation situationistrueofperiodicloadbalancing,asexperimentedin[30,3,25,9].theeciency ofaloadbalancingalgorithminthissituationisreectedbythenumberofcommunication Inastaticsynchronousloadbalancingprocess,allprocessorsareassumedtoperformload stepsrequiredforarrivingataglobalbalancedstatefromanyinitialloaddistribution. FtWt=Wt,itfollowsthat Wt=FtW0,whereFt=FFF LetFbeeitherthedimension-exchangematrixMorthediusionmatrixD.Then, ttimes {z }.SinceWt=W0inthestaticsituation,and Wt?Wt=F(Wt??Wt?)=Ft(W0?W0): 7

Then,bythedenitionoftheworkloadvariance,wehave where(f)isthesub-dominanteigenvalueoffinmodulus.itsaysthattheworkloadvariance isreducedgeometrically,anditsscalefactorisupperboundedby(f).theboundistight, t=jjwt?wjj2=jjft(w0?w)jj22t(f)0; Thesub-dominanteigenvalueinmodulus(F)isthusreferredalsoastheconvergencefactor andtsatises ofaloadbalancingalgorithm. t'2t(f)0 forlarget. (24) initialstatetosomeprescribedbound.then,fromeq.24,itfollowsthat LetTbethenumberofiterationstepsrequiredtoreducetheworkloadvarianceofan Hence, T=ln?ln0 T=O(=ln(F)): 2ln(F): (26) (25) algorithm,adf,whenappliedtoabroadvarietyofstructures.in[5,5],xuandlauanalyzed atedbyanumberofresearchers.in[2],boillatpresentedtheconvergencefactorsoftheadf theeectsofthedimension-exchangeandthediusionparametersontheeciencyofload Theconvergencefactorsofthedimension-exchangeandthediusionmethodswereevalu- balancing,andproposedtheodeandodfalgorithmsbychoosingtheoptimalvaluesforthe parametersand.thecorrespondingconvergencefactors,odeandodf,arehencereadily work.wesummarizetheconvergencefactorsintable. availablefromtheirproofs.also,theconvergencefactoradecanbederivedeasilyfromthe isthemaximumnumberofnodesoveralldimensionsofann-dnetwork Table:Convergencefactorsofthedimension-exchangeandthediusionmethods,wherek toruscos2(2=k)?sin(2=k) ADE DEmethod ODE ADF Diusionmethod meshcos2(=k) +sin(2=k)2n?+2cos(2=k)?sin(=k) +sin(=k) 2n?+2cos(=k) 2n+ 2n?+cos(2=k) 2n+?cos(2=k) n?+cos(=k) ODF Noticethattheconvergencefactorisiniterationsteps,eachofwhichiswhatwecalleda n requiresonlyonecommunicationstepwhileadimension-exchangeoperationstillrequires2n boththedimension-exchangeandthediusionmethodsrequires2ncommunicationstepsin ann-dnetwork.intheall-portcommunicationmodel,adiusionloadbalancingoperation loadbalancingoperationbefore.intheone-portcommunicationmodel,suchanoperationin Byg(t)'h(t)forlarget,wemeanthatg(t)=h(t)?!forlarget. 8

steps.therefore,tableandtheeq.(26)leadtotable2.itpresentsthetimecomplexities incommunicationstepsnecessaryforvariousloadbalancingalgorithmsinbothone-portand all-portcommunicationmodels. themaximumnumberofnodesoveralldimensionsinann-dnetworkand?portmeansthe all-portcommunicationmodel. Table2:Timecomplexitiesofthedimension-exchangeandthediusionmethods,wherekis toruso(nk2)o(nk2)o(nk)o(nk)o(n2k2)o(nk2)o(n2k2)o(nk2) mesho(nk2)o(nk2)o(nk)o(nk)o(n2k2)o(nk2)o(n2k2)o(nk2) -port ADE*-port-port*-port ODE -port ADF*-port -port ODF*-port example,theo(nk)estimatefortheodealgorithmfollowsfromthefollowingderivation. ThetimecomplexitiesgiveninTable2areinferredfromtheconvergencefactors.For ln(ode)=ln(?sin(2=k) =ln(?2sin(2=k) +sin(2=k)) 'ln(?4 +sin(2=k)) ' k+2 k+2) forlargek FromEq.(26),wehaveTode=O(k)inbalancingoperations.SinceanODEloadbalancing O(nk)isthusproved. operationrequireso(n)communicationstepsinbothcommunicationmodels,theestimate Theorem5.Supposeprocessorsarerunningsynchronousloadbalancingprocessesinthe staticsituation.then,boththeadeandtheodealgorithmsconvergeasymptoticallyfaster Theentriesofthetableshowthefollowing. k. thanthediusionmethodintheone-portcommunicationmodel.intheall-portcommunication model,theodealgorithmconvergesalsofasterthantheotherthreealgorithmsbyafactorof 5.2Dynamicsituation chronousimplementationofthediusionmethodinthedynamicsituationhasbeenevaluated anceofprocessors'workloadsboundedastightlyaspossible.theperformanceofthesyn- computationconcurrently.dynamicloadbalancinginthissituationaimsatkeepingthevari- Indynamicsynchronousimplementations,allprocessorsareperformingloadbalancingand in[8,3,4].in[8],cybenkoshowedthatthediusionmethodkeepstheasymptoticvariance 9

this,wearestillunabletodrawaconclusionregardingthesuperiorityoftheadealgorithm thevariancefromtheadealgorithmwhenbothareappliedtothehypercubenetwork.given intermsofthebalancequalityduringloadbalancing.in[3],hong,tanandchenreported bounded.healsoprovedthattheasymptoticvariancefromthediusionmethodislargerthan aconstantboundfortheworkloadvariancewhentheadfalgorithmrunsinthehypercube network.thisresultwasextendedlaterbyqianandyangtogeneralizedhypercubesandmesh structures[4].althoughtheboundstheyderivedareindependentoftime,theyaretooloose tobeusedforthecomparisonofbalancequalitiesduringloadbalancing.also,theapproaches theirdierentoperationalbehaviors. usedin[3,4]areunsuitablefortheanalysisofthedimension-exchangemethodbecauseof algorithms.wepresentaclosedformoftheworkloadvariancewhenaloadbalancingprocess runsinthetorusandthehypercubenetworks.theapproachisnotapplicabletothecase ofthemeshnetworksastheyarenotregularnetworks.nevertheless,sinceann-dmeshhas Inthissubsection,wedevelopanewapproachforanalyzingthebalancequalitiesofdierent onlyafractionofitsprocessorswhosedegreeissmallerthan2n,ourresultsasareasonable resultstobepresentedinthenextsection. approximationcanbeappliedtothemeshstructureaswell;thisissupportedbyoursimulation Lemma5.Supposeprocessorsarerunningasynchronousdiusionloadbalancingprocess d=2nbethedegreeofthenetwork. Throughoutthesubsection,weassumeloadbalancinginann-Dtorusnetwork,andlet underassumption2..then,e(wt)isauniformdistributionatanytimetand wherea=(?d)2+d2. E[tdf]=(at+20+?at+?a2)N?(t+)2?20; (27) Proof.TheuniformdistributionofE(Wt)resultingfromthediusionmethodcanbeeasily shown.weomititsproofbecauseitisalsoavailableasaspecialcaseintheproofoftheuniform distributionofe(wt)resultingfromthedimension-exchangemethodinthenextlemma. haveconsidertheexpectedworkloadvariancee[tdf].byitsdenitionandassumption2.,we E[tdf]=E(jjWt?Wtjj2) =E(jjDt+W0?W0jj2+tXi=0E(jjDit+?i?t+?ijj2) =E(jjDWt??Wt?jj2)+E(jjt?tjj2) =(at+20+?at+ =N(at+?N)20+tXi=0(ai?N)2?a2)N?(t+)2?20; 20

nent'sd+sub-componentswithcoecients?d;;;:::;;andasequenceofoperation distributionchangeseachofitscomponentstobecomealinearcombinationofthecompo- wherethefourthstepisbasedonthefollowingobservations.anoperationdontheworkload Dtchangeseachcomponentoftobecomealinearcombinationofits(d+)tsub-components. N(a?=N)2,andE(jjDt?jj2)=N(at?=N)2,wherea=(?d)2+d2. terminedonlybytheircombinatorialcoecients.therefore,wehavee(jjd?t+?ijj2)= FromLemma4.,itisknownthatthevarianceofacombinationofrandomvariablesisde- allpossiblechoicesoftheparameter,whichhappenstobethechoiceoftheadfalgorithm inn-dmeshesandtori.immediately,weobtain Considertheterma=(?d)2+d2inLemma5..Itisminimizedat==(d+)over presentacompaniontolemma5.inthefollowing. Next,weconsidersynchronousimplementationsofthedimension-exchangemethod.We E[tadf]E[todf]: (28) Lemma5.2Supposeprocessorsarerunningasynchronousdimension-exchangeloadbalancingprocessunderAssumption2.,exceptthatprocessorsgenerate/consumeiworkloadata pairwisebalancingstep.then,e(wt)isauniformdistributionatanytimet,and whereb=(?)2+2ands=+b+b2++bd?. E[tde]=(sbtd+d20+s?btd+d?bd2)N?(t+)d2?20; (29) algorithm.aloadbalancingoperationcomprisesdpairwisebalancingstepsinboththetorus andthemeshstructures.toexaminecloselythedynamicbehaviorofthedimension-exchange algorithminthelevelofpairwiseoperations,weintroduceonemorevariablet0todenotethe Proof.Recallthattistheindexofloadbalancingoperationsinthedimension-exchange indexofpairwisesteps.t=0ifandonlyift0=0,andtindexesthetimeinstancest0thatare integermultipliesofd.then,attimet0thatt0=dt, Wt0=MdWt0?+t0 =MdMd?Wt0?2+Mdt0?+t0 =c=dmcwt0?d+2c=dmct0?d+++mdt0?+t0 wherecj=dmj=mdmd?mc,andd+ =MWt0?d+dXc=(c+ j=dmjt0?d+c); j=dmj=. (30) fromtimet0?dtot0,i.e.,,thetthdimension-exchangebalancingoperation.usingindextinsteadoft0,eq.(30)leadsto Let t=pdc=(c+ j=dmjt0?d+c)bethedistributionofworkloadswhicharegenerated/consumed Wt=MWt?+ =MtW0+tXj=Mt?j t 2 j:

Usingthelinearityoftheexpectationoperations,E,weobtainthat E(Wt)=E(MtW0+tXj=Mt?j =MtE(W0)+tXj=Mt?jE( j) =0u+dtu; j) whereuisaunitaryvectorofsizen.itisauniformdistribution.therstpartofthelemma tionofworkloadsthataregenerated/consumedintheroundt.then, isthusproved. Wt=Wt?+ Next,weconsidertheworkloadvarianceattimet,E[tde].Let t:bythedenitionofworkloadvariance,wehavetbetheuniformdistribu- t=pdc=t0?d+c,and E[tde]=E(jjWt?Wtjj2) =E(jjMt+W0?W0jj2+tXi=0E(jjMi =E(jjMWt??Wt?jj2)+E(jj t? t+?i? tjj2) t+?ijj2): ToprovethelemmaregardingE[tde],itsucestoshowthatfor0it, Itcanbeshownbyinduction.WerstconsiderE(jj E(jjMi t+?i? t+?ijj2)=bidsn2?d2: of,wehavethat augmentedinaroundofdimension-exchangepairwisebalancingoperations.bythedenition? jj2).itistheworkloadvariance E(jj t? tjj2)=e(jjdxc=(c+ =dxc=[e(jjc+ j=dmjt0?d+c?t0?d+cjj2)] j=dmjt0?d+c?t0?d+c)jj2) =sn2?d2; =d? Xc=(bcN?)2+(N?)2 wherethesecondstepisduetothefactthatc+ ofc+ dentrandomvariablesforcd,andthethirdstepisduetothefollowingreasons.each componentofcj=dmjforanyc,c<d,isrecursivelyacombinationoftwocomponents j=dmjwithcoecients?and.itcanthusbeinferredthatacomponentofcdmj j=dmjt0?d+c?t0?d+carezeromeanindepen- isacombinationof2d?c+componentsofwithcoecientsasfollows. 22

3 2 2 2 3 b2 b3 d 2222 2 d?2 d? d dd?2d?2 Combinatorialcoecientsai,where=? d d?22d?d?pa2i b Consequently,fromLemma4.,itfollowsthatE(jjcdMj?jj2)=Nbd?c+2?2. Weproceedbyinductiononi.AssumeE(jjMi d t+?i? t+?ijj2)=bidsn2?d2.we bd tisindependentoftaswell.then,e(jjmi+ thenconsidere(jjmi+ t?i? t?ijj2).sincetiisassumedtobeindependentoftimet, suxoperatorsj=dmjredistributestheworkloadsofmi t?i+jj2).fromtheargumentintheprecedingparagraph,itisknownthatasequenceof t?i? t?ijj2)=e(jj(j=dmj)mi insuchawaythateachofits t?i+? whichconcludestheinductionandprovesthesecondpartofthelemma. ofthetable.consequentlye(jjmi+ componentsbecomesacombinationofits2dcomponentswithcoecientsasinthelastrow Fromthelemma,itisevidentthatE[tde]isminimizedat==2overallpossiblechoices t?i? t?ijj2)=bdbidsn2?d2=bid+dsn2?d2, ofthedimension-exchangeparameter.thus,wehave models.noticethatlemma5.2holdsundertheassumptionthattheworkloadgeneration/consumptionratiostiineachpairwisebalancingstepofaroundofdimension-exchange operationhasthesamestatisticalcharacteristicsasthoseinadiusionoperation.itistherefore fairtocomparee[tdf]ofeq.(27)withe[tde]ofeq.(29).considertheall-portcommunication model.substituting=d+forine[tdf]and=2forine[tde],weobtain WefurthercompareE[tade]withE[tadf]inbothone-portandall-portcommunication E[tade]E[tode]: (3) E[tade]=(2? E[tadf]=d+ d[?(d 2d?)?=2t+ d+)t+]n2?(t+)2+?=2dn2?(t+)d2+(2? (d+)t+20?20 Itcanbeeasilyveriedthatthecoecientof20inE[ade]issmallerthanthatinE[tadf],and 2d?) thatthecoecientof2ine[tade]issmallerthanthatine[tadf]whentn=d.hence,for 2td+d20?20: tn=d, interestinpractice. SinceNdinthemeshandthetorus,theaboverelationshipholdsforanytimeinstantof E[tade]E[tadf]: processorinasinglediusionoperationisexpectedtobedwithvarianced2.then,e[tdf] ofeq.(27)becomes Inthecaseoftheone-portcommunicationmodel,theworkloadgenerated/consumedbya (at+20+?at+?ad2)n?(t+)d2?20: 23

Clearly,E[tade]E[tadf]atanytimet.Conclusively,weobtainthefollowingtheorem. Theorem5.2Supposeprocessorsarerunningsynchronousdimension-exchangeanddiusion loadbalancingprocessesunderassumption2..then,e[tade]e[tode],e[tadf]e[todf], ande[tade]e[tadf]inbothone-portandall-portcommunicationmodels. Intheprecedingtwosections,weexploredanumberofrelationshipsbetweenthedimension- Numericalexperiments 6exchangeandthediusionmethodswithrespecttotheirecienciesandbalancingqualities. Inordertoobtainanideaofthemagnitudeoftheirdierences,weconductedastatistical networksandusingsyntheticworkloaddistributions.theexperimentalresultsalsoserveto simulationoftheseloadbalancingalgorithmsonvarioustopologiesandsizesofcommunication verifythetheoreticalresults. inastaticworkloadsituation,asimulationofasynchronousloadbalancinginthedynamic situation,andasimulationofsynchronousloadbalancinginthedynamicsituation.ineach simulation,theinitialworkloaddistributionwisassumedtobearandomvector,eachelement Theexperimentincludesthreeparts.Theyareasimulationofsynchronousloadbalancing workloaddistributionsanddierentworkloadgenerationratios.wealsoassumethattheunderlyingsystemimplementstheall-portcommunicationmodelsothatadimension-exchange wofwhichisdrawnindependentlyfromanidenticaluniformdistributionin[0;000].each datapointobtainedintheexperimentistheaverageof20runs,usingdierentrandominitial balancingoperationtakes2ndiusionoperationsinann-dmeshortorus.adiusionoperationistakenasabasictimestepinaloadbalancingprocess. communicationsteps,denotedbyt,necessaryforarrivingatagloballybalancedstate.inthe simulation,wedenetheglobalbalancedstatetobethestateinwhichthesystemworkload varianceislessthanorequaltoone.figure4andfigure5plotthesimulationresultsfrom Inthesimulationofstaticsynchronousloadbalancingprocesses,wemeasurethenumberof dierentloadbalancingalgorithmsexecutedintheringofsizes(n)varyingfrom2to28 nodesandinthe2-dmeshofsizesvaryingfrom22to3232,respectively.thesetwo guresclearlyindicatethatthedimension-exchangemethodoutperformsthediusionmethod acceleratethedimension-exchangeloadbalancingprocesssignicantly.intheringof64nodes, evenintheall-portcommunicationmodel.inparticular,weseethattheodealgorithmdoes forexample,tode=98withtheodealgorithmwhiletade'todf=305andtadf=684with theothers.itsimprovementovertheadealgorithmreachesashighas92:5%.infigure5, balancingprocessina64-ary2-cubeonlyrequiresabout96communicationstepsforarriving observationwasprovedtobetrueinboththemeshandthetorusin[9].thus,anodeload wealsoseethatthenumberofcommunicationstepstina2-dmeshisdependentonlyon thesizeofitslargerdimensionandisinsensitivetothesizeofitssmallerdimension.this ataglobalbalancedstate.itreallyputsforththeodealgorithmasapracticalmethodfor dynamicglobalbalancinginrealmulticomputers. 24

8 6 ADE ODE ADF ODF 4 2 log2(t) 0 8 6 Figure4:Thenumberofcommunicationstepsnecessaryforreachingagloballybalancedstate 4 duringastaticsynchronousloadbalancingprocessintheringofsizesvaryingfrom4to28 2 nodes 0 2 3 4 5 6 7 log2(n) 8 6 ADE ODE ADF ODF 4 2 log2(t) 0 8 6 Figure5:Thenumberofcommunicationstepsnecessaryforreachingagloballybalancedstate 4 duringastaticsynchronousloadbalancingprocessinthe2-dmeshofsizesvaryingfrom22 to3232 2 0 2x2 4x4 8x4 8x8 6x8 6x6 32x8 32x6 32x32 25

Figure6thesystemworkloadvarianceintherst00stepsofvariousloadbalancingprocesses intheringof32nodes.thegureillustratesthattheodealgorithmpullsdownthesystem Furthermore,inordertoexaminetheeectsofasingleloadbalancingoperation,weplotin 200 000 ADE ODE ADF ODF 800 600 400 Figure6:Reductionoftheworkloadvarianceduringastaticsynchronousloadbalancing 200 workloadvariancesharplyalthoughitsinitialreductionratioseemstobenotassatisfactory processintheringof32nodes 0 0 20 40 60 80 00 intheirreductionratios.thissaysthatboththeodeandtheodfalgorithmsmaynot outperformtheirlocalaveragebalancingcounterpartsintheshortterm. asthatoftheadealgorithm.theodfandtheadfalgorithmshavethesamerelationship policysuchthatonceaprocessor'sworkloaddropsorrisesbeyondapairofpresetbounds, processorateachtimestepis00withthevarianceof30andtheconsumptionratioisa constant00.inthesimulationofasynchronousloadbalancing,weuseasimpleinvocation Inthedynamicsituation,weassumethattheexpectedworkloadgenerationratioofa pairofthresholdsdeterminethedegreeofasynchronismofaloadbalancingprocess.suppose wunderloadandwoverload,theprocessorwouldactivatealoadbalancingoperation.evidently,the wunderloadandwoverloadaresymmetricwithrespecttotheexpectedworkloadofaprocessor E(w)=500atanytime,itfollowsthatwunderload=500?range=2andwoverload=500+ range=2.figures7and8plotthesystemworkloadvariancesresultingfromdierentload E(w).Wethenmeasuretherangebetweenwunderloadandwoverloadbyanindexrange.Since balancingalgorithmsinaringof64nodesandameshofsize66forthecaseofrange=600. tunedalgorithmsforglobalsynchronousloadbalancing,donotgainsignicantbenetsin workloadvariancemorerapidlythanthediusionmethodandkeepsitboundedatamuch lowerlevel.itcanalsobeobservedthatboththeodeandtheodfalgorithms,theoptimally Fromthesetwogures,itcanbeseenthattheADEalgorithmreducestheinitialsystem asynchronousimplementations. 26

3000 2500 ADE ODE ADF ODF 2000 500 000 Figure7:Changeoftheworkloadvarianceintherst200stepsofadynamicasynchronous loadbalancingprocessintheringofsize64 500 0 0 50 00 50 3000 2500 ADE ODE ADF ODF 2000 500 000 Figure8:Changeoftheworkloadvarianceintherst200stepsofadynamicasynchronous loadbalancingprocessinthemeshofsize66 500 0 0 50 00 50 27

3000 2500 ADE ODE ADF ODF 2000 500 000 Figure9:Changeoftheworkloadvariancesduringadynamicsynchronousloadbalancing processina66torus 500 0 0 50 00 50 200 3000 2500 ADE ODE ADF ODF 2000 500 conductedintherstexperiment,anditsresultsinaringof32nodesarereportedinfigure6. simultaneously.thesimulationofsynchronousimplementationsinthestaticsituationwas mentationsinwhichrangeissettozerosothatallprocessorsparticipateinloadbalancing Synchronousimplementationsofloadbalancingarespecialcasesofasynchronousimple- Figures9and0presentthesimulationresultsofdynamicsynchronousimplementationsinthe 66torusandthe66mesh.InagreementwiththendingsfromFigure6,Figures9 000 Figure0:Changeofthesystemworkloadvarianceduringadynamicsynchronousloadbalancingprocessina66mesh 500 and0showthatthesuperiorityofthedimension-exchangemethodoverthediusionmethod 0 0 50 00 50 200 28

holdsunderthesynchronousinvocationpoliciesaswell,andthattheadealgorithmhasan 7advantageoverthediusionmethodinbothshortandlongterms. algorithms,thedimension-exchange(de)andthediusion(df)methods,withrespectto Inthispaper,wemadeacomparisonbetweentwoclassesofnearestneighborloadbalancing Conclusions theireciencyindrivinganyinitialworkloaddistributiontoauniformdistributionandtheir abilityincontrollingthegrowthofthevarianceamongtheprocessors'workloads.wefocused ontheirfourinstances,theade,theode,theadfandtheodf,whicharethemost synchronous/asynchronousinvocationpoliciesandstatic/dynamicrandomworkloadbehaviors. commonversionsinpractice.thecomparisonwasmadecomprehensivelyinbothone-port andall-portcommunicationmodelswithconsiderationofvariousimplementationstrategies: thataisapproximatelyequivalenttobinperformance.then,ourcomparativeresultscanbe summarizedasintables3and4. Let\ab"denotetherelationshipthataoutperformsb,and\ab"therelationship andn-dtori. Table3:Summaryofcomparativeresultsintheone-portcommunicationmodelinn-Dmeshes Synchronous ODEADEODFADF Staticloadbalancing Dynamicloadbalancing ADEfADF;ODFg ADEODE ADFODF ADEADF Asynchronous ADFODFincasen= ODFADFincasen2 sameasleft Table4:Summaryofcomparativeresultsinall-portcommunicationmodelinn-Dmeshesand n-dtori. Synchronous ODEADEODFADF Staticloadbalancing Dynamicloadbalancing fadf;odfgadeode ADEODE ADFODF ADEADF Asynchronous ADFODFincasen= ODFADFincasen2 sameasleft 29

besttosynchronousimplementationinthestaticsituation.wealsorevealedthesuperiority ofthedimension-exchangemethodinsynchronousloadbalancingevenintheall-portcommunicationmodel.thestrengthofthediusionmethodisinasynchronousimplementationin methodintheone-portcommunicationmodel.inparticular,theodealgorithmlendsitself Specically,weshowedthatthedimension-exchangemethodoutperformsthediusion theall-portcommunicationmodel.theodfalgorithmperformsbestinthatcase. algorithms,butalsooerspracticalguidelinestosystemdevelopersindesigningloadbalancing architecturesforvariousparallelcomputationalparadigms.weappliedboththediusionand thedimension-exchangemethodsindistributedbranch-and-boundcomputations,andpartly Thecomparativestudynotonlyprovidesaninsightintonearestneighborloadbalancing intheplatformsofparsytecgcpp(powerpc-based)andparsytecgcel(transputer-based) veriedourcomparativeresultsinbothstaticanddynamicasynchronousimplementations multicomputers[7].wealsoevaluatedtheirsynchronousperformancesinrealapplicationsin periodicre-mappingofdataparallelcomputationsin[9]. ThisworkissupportedinpartbyNSFMIP-9309489andtheDFG-Forschergruppe\Eziente Acknowledgments NutzungmassivparallelerSystems".WearegratefultoH.L.Xieforhiscarefulproofreading References andtheanonymousrefereesfortheirvaluablecomments. []I.Ahmad,A.Ghafoor,andKMehrotra.Performancepredictionfordistributedload [2]L.V.Kale.Comparingtheperformanceoftwodynamicloaddistributionmethods.In balancingonmulticomputersystems.inproceedingsofsupercomputing'99,pages830{ 839(99). [3]V.Kumar,A.Y.Grama,andN.R.Vempaty.Scalableloadbalancingtechniquesfor ProceedingsofInternationalConferenceonParallelProcessing,pages8{2(988). [4]M.Willebeek-LeMairandA.P.Reeves.Strategiesfordynamicloadbalancingonhighly parallelcomputers.journalofparallelanddistributedcomputing,22():60{79(994). [5]C.-Z.XuandF.C.M.Lau.Analysisofthegeneralizeddimensionexchangemethodfor parallelcomputers.ieeetransactionsonparallelanddistributedsystems,4(9):979{993 (993). [6]C.-Z.XuandF.C.M.Lau.Iterativedynamicloadbalancinginmulticomputers.Journal dynamicloadbalancing.journalofparallelanddistributedcomputing,6(4):385{393 (992). ofoperationalresearchsociety,45(7):786{796(994). 30

[7]D.P.BertsekasandJ.N.Tsitsiklis.Parallelanddistributedcomputation:Numerical [8]G.Cybenko.Loadbalancingfordistributedmemorymultiprocessors.JournalofParallel methods.prentice-hallinc.(989). [9]C.-Z.XuandF.C.M.Lau.Thegeneralizeddimensionexchangemethodforloadbalancing anddistributedcomputing,7:279{30(989). [0]S.L.JohnssonandC.-T.Ho.Spanninggraphsforoptimumbroadcastingandpersonalized ink-aryn-cubesandvariants.journalofparallelanddistributedcomputing,24():72{85 (995). []D.W.Krumme,G.Cybenko,andK.N.Venkataraman.Gossipinginminimaltime.SIAM communicationinhypercubes.ieeetransactionsoncomputers,38(9):249{268(989). [2]J.B.Boillat.Loadbalancingandpoissonequationinagraph.Concurrency:Practice JournalonComputing,2():{39(992). [3]J.-W.Hong,X.-N.Tan,andM.Chen.Fromlocaltoglobal:ananalysisofnearestneighbor andexperience,2(4):289{33(990). [4]X.-S.QianandQ.Yang.Loadbalancingongeneralizedhypercubeandmeshmultiprocessorswithlal.InProceedingsofthInternationalConferenceonDistributedComputing balancingonhypercube.inproceedingsofacm{sigmetrics,pages73{82(988). [5]C.-Z.XuandF.C.M.Lau.Optimalparametersforloadbalancingwiththediusion methodinmeshnetworks.parallelprocessingletters,4(2):39{47(994). Systems,pages402{409(99). [6]S.H.Hosseini,B.Litow,M.Malkawi,J.Mcpherson,andK.Vairavan.Analysisofagraph [7]C.-Z.Xu,S.Tschoeke,andB.Monien.Performanceevaluationofloaddistributionstrategiesinparallelbranch-and-boundcomputations.Technicalreport,Dept.ofElectricaland coloringbaseddistributedloadbalancingalgorithm.journalofparallelanddistributed Computing,0:60{66(990). [8]R.Diekmann,D.Meyer,andB.Monien.ParalleldecompositionofunstructuredFEMmeshes.Technicalreport,Dept.ofMathematicsandComputerScience,Universityof Paderborn,Germany(995). ComputerEngg.,WayneStateUniversity(995). [9]C.-Z.XuandF.C.M.Lau.Decentralizedremappingofdata-parallelcomputationswith [20]J.Song.Apartiallyasynchronousanditerativealgorithmfordistributedloadbalancing. formancecomputingconference,pages44{42.ieeecomputersocietypress(994). thegeneralizeddimensionexchangemethod.inproceedingsof994scalablehighper- ParallelComputing,20(6):853{868(994). 3

[2]R.LulingandB.Monien.Adynamicdistributedloadbalancingalgorithmwithprovable [22]W.J.Dally.Performanceanalysisofk-aryn-cubeinterconnectionnetworks.IEEETransactionsonComputers,39(6):775{785(990). goodperformance.inproceedingsof5thacmsymposiumonparallelalgorithmsand Architectures,pages64{72(993). [23]L.M.NiandP.K.McKinley.Asurveyofwormholeroutingtechniquesindirectnetworks. [24]G.RamanathanandJ.Oren.Surveyofcommercialparallelmachines.ACMComputer ArchitectureNews,2(3):3{33(993). IEEEComputer,26:62{76(993). [26]S.Ranka,Y.Won,andS.Sahni.Programmingahypercubemulticomputer.IEEE [25]D.M.NicolandJ.H.Saltz.Dynamicremappingofparallelcomputationswithvarying Software,5:69{77(988). resourcedemands.ieeetransactionsoncomputers,37(9):073{087(988). [27]Y.ShihandJ.Fier.Hypercubesystemsandkeyapplications.InK.HwangandD.Degroot,editors,ParallelProcessingforSupercomputersandArticalIntelligence,pages 203{243.McGraw-HillPublishingCo.(989). [28]S.FioriniandR.J.Wilson.Edge-coloringofgraphs.InL.W.BeinekeandR.J.Wilson, [29]B.GhoshandS.Muthukrishnan.Dynamicloadbalancingindistributednetworksby randommatchings.inproceedingsof6thacmsymposiumonparallelalgorithmsand editors,selectedtopicsingraphtheory,pages03{25.academicpress(978). [30]A.N.Choudhary,B.Narahari,andR.Krishnamurti.Anecientheuristicschemefor dynamicremappingofparallelcomputations.parallelcomputing,9:62{632(993). Architectures(994). [3]J.DeKeyserandD.Roose.Loadbalancingdataparallelprogramsondistributedmemory computers.parallelcomputing,9:99{29(993). 32