- PDF Free Download

DistributedSharedMemorySystems? AdaptiveLoadBalancinginSoftware CompilerandRun-TimeSupportfor SotirisIoannidisandSandhyaDwarkadas fsi,sandhyag@cs.rochester.edu DepartmentofComputerScience Rochester,NY14627{0226 UniversityofRochester ablehighperformancecomputingenvironments.acriticalissuefor Abstract.Networksofworkstationsoerinexpensiveandhighlyavail- balancingdynamicallyonsoftwaredistributedsharedmemoryprograms. achievinggoodperformanceinanyparallelsystemisloadbalancing, asystemthatcombinescompilerandrun-timesupporttoachieveload besharedamongmanyusers.inthispaper,wepresentandevaluate evenmoresoinworkstationenvironmentswherethemachinesmight Weuseinformationprovidedbythecompilertohelptherun-timesystemdistributetheworkoftheparallelloops,notonlyaccordingtothe 1Introduction relativepoweroftheprocessors,butalsoinsuchawayastominimize communicationandpagesharing. Clustersofworkstations,whetheruniprocessorsorsymmetricmultiprocessors eectivetargetforaparallelizingcompiler.theadvantagesofusingansdsm use.previouswork[5]hasshownthatansdsmrun-timecanprovetobean (SMPs),oercost-eectiveandhighlyavailableparallelcomputingenvironments.Softwaredistributedsharedmemory(SDSM)providesasharedmemory compile-timeandrun-timeinformationtoachievebetterperformance([6,18]). systemincludereducedcomplexityatcompile-time,andtheabilitytocombine abstractiononadistributedmemorymachine,withtheadvantageofease-of- themachinesmightbesharedamongmanyusers.inordertomaximizeperformancebasedonavailableresources,theparallelsystemmustnotonlyoptimally distributetheworkaccordingtotheinherentcomputationandcommunication Oneissueinachievinggoodperformanceinanyparallelsystemisloadbalancing.Thisissueisevenmorecriticalinaworkstationenvironmentwhere communicationresources. demandsoftheapplication,butalsoaccordingtoavailablecomputationand?thisworkwassupportedinpartbynsfgrantscda{9401142,ccr{9702466,and CCR{9705594;andanexternalresearchgrantfromDigitalEquipmentCorporation.

run-timesupporttoachieveloadbalancingdynamicallyonsdsmprograms. Thecompilerprovidesaccesspatterninformationtotherun-timeatthepoints inthecodethatwillbeexecutedinparallel.therun-timeusesthesepointsto determineavailablecomputationalandcommunicationresources.basedonthe Inthispaper,wepresentandevaluateasystemthatcombinescompilerand loadevenly,butalsotominimizecommunicationoverheadinthefuture.the timecanthenmakeintelligentdecisionsnotonlytodistributethecomputational changesincomputationalpower,resultinginreducedexecutiontime. accesspatternsacrossphases,aswellasonavailablecomputingpower,therun- resultisasystemthatadaptsbothtochangesinaccesspatternsaswellasto forprefetchingandconsistency/communicationavoidancedescribedin[6].we implementedthenecessarycompilerextensionsinthesuif[1]compilerframework.ourexperimentalenvironmentconsistsofeightdecalphaserver2100 4/233computers,eachwithfour21064Aprocessorsoperatingat233MHz.Preliminaryresultsshowthatoursystemisabletoadapttochangesinload,with Ourtargetrun-timesystemisTreadMarks[2],alongwiththeextensions performancewithin20%ofideal. Section4describesrelatedwork.Finally,wepresentourconclusionsanddiscuss dynamicloadbalancingdecisions.section3presentssomepreliminaryresults. on-goingworkinsection5. timesystem,thenecessarycompilersupport,andthealgorithmusedtomake Therestofthispaperisorganizedasfollows.Section2describestherun- WerstprovidesomebackgroundonTreadMarks[2],therun-timesystemwe 2DesignandImplementation usedinourimplementation.wethendescribethecompilersupportfollowedby therun-timesupportnecessaryforloadbalancing. 2.1TheBaseSoftwareDSMLibrary user-levelsdsmsystemthatrunsoncommonlyavailableunixsystems.tread- TreadMarks[2]isanSDSMsystembuiltatRiceUniversity.Itisanecient Marksprovidesparallelprogrammingprimitivessimilartothoseusedinhardwaresharedmemorymachines,namely,processcreation,sharedmemoryallocation,andlockandbarriersynchronization.Thesystemsupportsarelease synchronizationtoensurethatchangestoshareddatabecomevisible. protocol[3]toreducetheoverheadinvolvedinimplementingthesharedmemory consistent(rc)memorymodel[10],requiringtheprogrammertouseexplicit abstraction. Consequently,theconsistencyunitisavirtualmemorypage.Themultiple-writer Thevirtualmemoryhardwareisusedtodetectaccessestosharedmemory. TreadMarksusesalazyinvalidate[14]versionofRCandamultiple-writer protocolreducestheeectsoffalsesharingwithsuchalargeconsistencyunit. Withthisprotocol,twoormoreprocessorscansimultaneouslymodifytheirown

copyofasharedpage.theirmodicationsaremergedatthenextsynchronizationoperationinaccordancewiththedenitionofrc,therebyreducingthe eectsoffalsesharing.themergeisaccomplishedthroughtheuseofdis.a comparingthepagetoacopysavedpriortothemodications(calledatwin). acquiresynchronizationoperation[10],thosepagesforwhichithasreceived diisarun-lengthencodingofthemodicationsmadetoapage,generatedby noticeofmodicationsbyotherprocessors.onasubsequentpagefault,the processfetchesthedisnecessarytoupdateitscopy. Withthelazyinvalidateprotocol,aprocessinvalidates,atthetimeofan 2.2Compile-TimeSupportforLoadBalancing builtontopofakernelthatdenestheintermediateformat.thepassesare implementedasseparateprogramsthattypicallyperformasingleanalysisor (SUIF)[11]compiler.TheSUIFsystemisorganizedasasetofcompilerpasses programusingtreadmarks,weusethestanforduniversityintermediateformat Forthesource-to-sourcetranslationfromasequentialprogramtoaparallel thatwestartwithisaversionofthecodeparallelizedforsharedaddressspace transformationandthenwritetheresultsouttoale.thelesalwaysusethe sameformat. machines.thecompilergeneratesasingle-program,multiple-data(spmd)programthatwemodiedtomakecallstothetreadmarksrun-timelibrary.alternatively,theusercanprovidethespmdprogram(insteadofhavingthesuif Theinputtothecompilerisasequentialversionofthecode.Theoutput compilergenerateit)byidentifyingtheparallelloopsintheprogramthatare executedbyallprocessors. tochangetheloaddistributionintheparallelloopsifnecessary. regions,andfeedsthisinformationtotherun-timesystem.thepassisalso responsibleforaddinghooksintheparallelizedcodetoallowtherun-timelibrary OurSUIFpassextractstheshareddataaccesspatternsineachoftheSPMD x).aregularsection[12]isthencreatedforeachsuchsharedaccess.regular sectiondescriptors(rsds)conciselyrepresentthearrayaccessesinaloopnest. Accesspatternextraction TheRSDsrepresenttheaccesseddataaslinearexpressionsoftheupperand theprogramlookingforaccessestosharedmemory(identiedusingtheshpre- Inordertogenerateaccesspatternsummaries,ourSUIFpasswalksthrough lowerloopboundsalongeachdimension,andincludestrideinformation.this andthesizeofeachdimensionofthearray,todeterminetheaccesspattern. entstrategiesofloadredistributionincaseofimbalance.wewilldiscussthese informationiscombinedwiththecorrespondingloopboundariesofthatindex, Dependingonthekindofdatasharingbetweenparalleltasks,wefollowdier- strategiesfurtherinsection2.3.

Prefetching Markslibraryoersprefetchingcalls.Thesecalls,givenarangeofaddresses, prefetchthedatacontainedinthepagesinthatrange,andprovideappropriate (read/write)permissionsonthepage.thisprefetchingpreventsfaultingand consistencyactionsonuncacheddatathatisguaranteedtobeaccessedinthe Theaccesspatterninformationcanalsousedtoprefetchdata[6].TheTreadbulktransfer. Loadbalancinginterfaceandstrategy future,aswellasallowscommunicationoptimizationbytakingadvantageof eachparalleltask.thisessentiallymeanschangingthenumberofloopiterations therun-timelibrarybeforetheparallelloops.thiscallisresponsibleforchanging performedbyeachtask.toaccomplishthis,weaugmentthecodewithcallsto theloopboundsandconsequentlytheamountofworkdonebyeachtask. Therun-timesystemneedsawayofchangingtheamountofworkassignedto timebyconsideringboththecommunicationandthecomputationcomponents. strategiesfordistributingtheparallelloops.thegoalistominimizeexecution 1.Shiftingofloopboundaries:Thisapproachchangestheupperandlower Thecompilercandirecttherun-timetochoosebetweentwopartitioning boundsofeachparalleltask,sothattasksonlightlyloadedprocessorswill ing,onthedataaccessedbyourtasks.applicationswithnearestneighbor endupwithmoreworkthantasksonheavilyloadedprocessors.withthis schemeweavoidthecreationofnewboundaries,andthereforepossibleshar- 2.Multipleloopbounds:Thisschemeisaimedatminimizingunnecessarydata sharingwillbenetfromthisscheme.thispolicyhoweverhasthedrawback ofcausingmorecommunicationatthetimeofloadredistribution,sincedata hastobemovedbetweenallneighboringtasksratherthatonlyfromtheslow processor movement.eachprocessthatusesthispolicycanaccessnon-continuous amongtheprocessors,butreducescommunicationatloadredistribution time.hence,caremustbetakentoensurethatthisfragmentationdoesnot databyusingmultipleloopbounds.thispolicyfragmentstheshareddata 2.3Run-TimeLoadBalancingSupport Therun-timelibraryisresponsibleforkeepingtrackoftheprogressofeach resultineitherfalsesharingorexcesstruesharingduetoloadredistribution. adjuststheloadaccordingly.theexecutiontimeforeachparalleltaskismaintainedonaper-processorbasis(tasktime).therelativeprocessingpowerof process.itcollectsstatisticsabouttheexecutiontimeofeachparalleltaskand region).itiscrucialnottotrytoadjusttooquicklytochangesinexecution Figure1. theprocessor(relativepower)iscalculatedonthebasisofcurrentloaddistribution(relativepower)aswellastheper-processortasktimeasdescribedin Eachprocessorexecutestheabovecodepriortoeachparallelloop(SPMD

floatrelativepower[numofprocessors]; floattasktime[numofprocessors]; floatsumofpowers; forallprocessorsi forallprocessorsi RelativePower[i]/=TaskTime[i]; RelativePower[i]/=SumOfPowers; SumOfPowers+=RelativePower[i]; timebecausesuddenchangesinthedistributionofthedatamightcausethe Fig.1.AlgorithmtoDetermineRelativeProcessingPower isveryslowthersttimewegatherstatistics.ifweadjusttheload,wewill endupsendingmostofitsworktoanotherprocessor.thiswillcauseitto systemtooscillate.tomakethisclear,imagineaprocessorthatforsomereason Forthisreasonwehaveaddedsomehysteresisinoursystem.Weredistribute beveryfastthesecondtimearound,resultinginaredistributiononceagain. theloadonlyiftherelativepowerremainsconsistentlyatoddswithcurrent allocationthroughacertainnumberoftaskcreationpoints.similarly,loadis balancedonlyifthevarianceinrelativepowerexceedsathreshold.ifthetime ifthetimeoftheslowestprocessisnotwithin10%ofthetimeofthefastest communicationisgeneratedduetotheadjustedload.inourexperiments,we collectstatisticsfor10taskcreationpointsbeforetryingtoadjust,andthen changethedistributionofwork.otherwise,minoroscillationsmayresultas oftheslowestprocessiswithinn%ofthetimeofthefastestprocesswedon't processweredistributethework.thesecut-oswereheuristicallydetermined onthebasisofourexperimentalplatform,andareafunctionoftheamountof computationandanyextracommunication. asloadbalancing.thisisevenmoresoinsoftwaredsmwheretheprocessorsare toavoidunnecessarymovementofdataandatthesametimeminimizepage LoadBalancingvs.LocalityManagement nottightlycoupled,makingcommunicationexpensive.consequently,weneed sharing.inordertodealwiththisproblem,therun-timelibraryusestheinformationsuppliedbythecompileraboutwhatloopdistributionstrategytouse. SPMDregions.Changesinpartitioningthatmightresultinextracommunicationareavoidedinfavorofasmallamountofloadimbalance.Wecallthis Previouswork[20]hasshownthatlocalitymanagementisatleastasimportant Inaddition,itkeepstrackofaccessestothesharedarrayasdeclaredinprevious methodlocality-consciousloadbalancing.

2.4Example ConsidertheparallelloopofFigure2.Ourcompilerpasstransformsthisloop intothatinfigure3.thenewcodemakesaredistributecalltotherun-time libraryprovidingitwithallthenecessaryinformationtocomputetheaccess patterns(thearrays,thetypesofaccesses,theupperandlowerboundsofthe algorithmshowninfigure1),andthenusestheaccesspatterninformationto loopsandtheformatoftheexpressionsfortheindices). decidehowtodistributetheworkload. Theredistributecomputestherelativepowersoftheprocessors(usingthe for(i=lowerbound;i<upperbound;i+=stride) intsh_dat1[a*i+b]+=sh_dat2[c*i+d]; sh_dat1[n],sh_dat2[n]; Fig.2.Initialparallelloop. int redistribute( listoftypesofaccesses listofsharedarrays,/*sh_dat1,sh_dat2*/ sh_dat1[n],sh_dat2[n]; listofcoefficientsand constantsforindices listofupperbounds, listoflowerbounds, /*upper_bound*/ /*lower_bound*/ /*a,c,b,d*/ /*read/write*/ ); while(therearestillranges){ lowerbound=newlowerboundforthatrange; upperbound=newupperboundforthatrange; }Fig.3.Parallelloopwithaddedcodethatservesasaninterfacewiththerun-time for(i=lowerbound;i<upperbound;i+=stride) range=range->next; sh_dat1[a*i+b]+=sn_dat2[c*i+d]; library.therun-timesystemcanthenchangetheamountworkassignedtoeachparallel task.

3ExperimentalEvaluation 3.1Environment OurexperimentalenvironmentconsistsofeightDECAlphaServer21004/233 computers.eachalphaserverisequippedwithfour21064aprocessorsoperating networkinterface.eachalphaserverrunsdigitalunix4.0dwithtruclusterv.1.5extensions.theprograms,theruntimelibrary,andtreadmarkswere compiledwithgccversion2.7.2.1usingthe-o2optimizationag. 3.2LoadBalancingResults Weevaluateoursystemontwoapplications:amatrixmultiplicationofthree 256x256sharedmatricesoflongs(whichisrepeated100times)andJacobi,with amatrixsizeof2050x2050oats.thecurrentimplementationonlyusesthe oneoftheprocessorsofeachsmp.thisconsistsofatightloopthatwriteson rstpolicy,shiftingofloopboundariesanddoesnotuseprefetching.totestthe performanceofourloadbalancinglibrary,weintroducedanarticialloadon anarrayof10240longs.thisloadtakesup50%ofthecputime. timeson1,2,4,8,and16processors,usinguptofoursmps.weaddedone articialloadforeveryfourprocessorsexceptinthecaseoftwoprocessorswhere weonlyaddedoneload.theloadbalancingschemeweuseistheshiftingofloop boundaries(wedonotusemultipleloopbounds).therstcolumnshowsthe OurpreliminaryresultsappearinFigures4and5.Wepresentexecution at233mhzandwith256mbofsharedmemory,aswellasamemorychannel executiontimesforthecaseswheretherewasnoloadinthesystem.thesecond columnshowstheexecutiontimeswiththearticialload,andnallythelast columnisthecasewherethesystemisloadedbutweareusingourloadbalancing library. muchas100%inthecaseoftwoprocessors(withtheoverheadat4,8and16 notbeingfaro). Theintroductionofloadslowsdownbothmatrixmultiply,andJacobibyas obtainedusing8processorswithloadandourloadbalancescheme,withthat theresultsofourloadbalancingalgorithmare,wecomparetheexecutiontimes using7processorswithoutanyload.this7-processorrunservesasaboundon mancecomparedtoexecutiontimewithload.inordertodeterminehowgood Ourloadbalancingstrategyprovidesasignicantimprovementinperfor- howwellwecanperformwithloadbalancing,sincethatisthebestwecanhope theirpower,givingustheequivalentofsevenprocessors).theresultsarepresentedinfigure6.formatrixmultiply,ourloadbalancingalgorithmisonly toachieve(twoofoureightprocessorsareloaded,andoperateatonly50%of processorremainsthesame. duetothefactthatwhilecomputationcanberedistributed,communicationper 9%slowerthanthesevenprocessorloadfreecase.Jacobiis20%slower,partly tothaton8processorswithnoload,indicatingtherelativetimespentinuser InFigure7,wepresentabreakdownofthenormalizedexecutiontimerelative

Times in seconds 600 500 400 300 200 Matrix Multiplication No load With load With load balance 100 code,intheprotocol,andincommunicationandwaittime(atsynchronization Fig.4.ExecutionTimesforMatrixMultiply 0 1p 2p 4p 8p 16p waitingatsynchronizationpointsrelativetotheexecutiontimewithloadandno points).whenweuseourloadbalancingalgorithm,wereducethetimespent Number of processors loadbalancebecausewehavebetterdistributionofwork,andthereforeimprove overallperformance. Werunmatrixmultiplicationandjacobiinaloadfreeenvironmentwithand withoutuseofourrun-timelibrary.theresultsarepresentedinfigure8.inthe worstcaseweimposelessthan6%overhead. Finallywewantedtomeasurestheoverheadimposedbyourrun-timesystem. Fortheevaluationofourlocality-consciousloadbalancingpolicyweusedShallow,withinputsize514x514matricesofdoubles.Shallowoperatesontheinterior 3.3Locality-consciousLoadBalancingResults codeoranaiveimplementationwouldhaveeachprocessupdateapartofthe cesseswritingthesamepages,falsesharing.asmarterapproachistohavethe processesthatowntheboundarypagesdotheupdates,thiseliminatesfalse elementsofthearraysandthenupdatestheboundaries.compilerparallelized boundariesalongeachdimensioninparallel.thiscanresulttomultiplepro-

Times in seconds 300 250 200 150 100 Jacobi No load With load With load balance 50 sharing.ourintegratedcompiler/run-timesystemisabletomakethedecision 0 atrun-time,usingtheaccesspatterninformationprovidedbythecompiler.it Fig.5.ExecutionTimeforJacobi 1p 2p 4p 8p 16p identieswhichprocesscachesthedatacanrepartitiontheworksothatitmaximizeslocalityduceanyloadimbalancetooursystem,sincewewanttoevaluateourlocality- thatdoesn'tconsiderdataplacementperformsverypoorlyasthenumberof toeliminatefalsesharingassuggestedearlier.anaivecompilerparallelization consciousloadbalancingpolicy.wehaveoptimizedthemanualparallelization WepresentourresultsinFigure9.Intheseexperimentswedon'tintro- Number of processors code. processorsincreasesbecauseofthemultiplewritersonthesamepage.however balancingrun-timesystemtheperformanceisequivalenttothehandoptimized whenwecombinethecomplierparallelizationwithoutlocality-consciousload Inthisscheme,thereisacentralqueueofloopiterations.Onceaprocessorhas loadbalancing.perhapsthemostcommonapproachisthetaskqueuemodel. 4RelatedWork Therehavebeenseveralapproachestotheproblemsoflocalitymanagementand

80 70 Matrix Multiplication - Jacobi Manual 7p no load SUIF 8p with load balance Times in seconds 60 50 40 30 Fig.6.Comparisonoftherunningtimesoftheapplicationsusingourloadbalancingalgorithmon8loadedprocessors,comparedtothereperformanceon7loadfree processors. 20 10 0 nisheditsassignedportion,moreworkisobtainedfromthisqueue.thereare Matrix Multiplication Jacobi severalvariations,includingself-scheduling[23],xed-sizechunking[15],guided self-scheduling[22],andadaptiveguidedself-scheduling[7]. portantthanloadbalancinginthreadassignment.theyintroduceapolicythey callmemory-consciousschedulingthatassignsthreadstoprocessorswhoselocalmemoryholdsmostofthedatathethreadwillaccess.theirresultsshow thattheloosertheinterconnectionnetworkthemoreimportantthelocality MarkatosandLeBlancin[20],arguethatlocalitymanagementismoreim- veryimportant,anityschedulingwasintroducedin[19].theloopiterations management. them.kisaparameteroftheiralgorithmwhichtheydeneaspinmostof isidle,itremoves1/koftheiterationsinitslocalworkqueueandexecutes aredividedoveralltheprocessorsequallyinlocalqueues.whenaprocessor Basedontheobservationthatthelocalityofthedatathataloopaccessesis andexecutesthem,wherepisthenumberofprocessors. processoranditremoves1/poftheiterationsinthatprocessor'sworksqueue theirexperiments.ifaprocessor'sworkqueueisempty,itndsthemostloaded Theiralgorithmissimilartoanityschedulingbuttheirruntimesystemcan Buildingon[19],Yanetal.in[24],suggestadaptiveanityscheduling.

Normalized Times for MM - Jacobi Normalized Execution Time 200 180 160 140 120 100 80 60 40 20 0 User Communication & wait Protocol spentintheusertime,communicationandwaitatsynchronizationpointsandprotocol time. Fig.7.BreakupofnormalizedtimeformatrixmultiplicationandJacobi,intotime changingk,anexponentialadaptivemechanism,alinearadaptivemechanism, aconservativeadaptivemechanism,andagreedyadaptivemechanism. theloadedprocessor'slocalworkqueue.theypresentfourpossiblepoliciesfor modifykduringtheexecutionoftheprogram.whenaprocessorisloaded,kis increasedsothatotherprocessorswithalighterloadcangetloopiterationsfrom theoverallperformanceoftheapplication.similarly,moonandsaltz[21]also withrespecttoprograms,processorsandtheinterconnectionnetworks.theirresultsindicatethattakingintoaccounttherelativecomputationpoweraswellas In[4],Cierniaketal.studyloopschedulinginheterogeneousenvironments anyheterogeneityintheloopformatwhiledoingtheloopdistributionimproves repartitioningisrequiredateverytimestep. imbalance,theyintroduceperiodicre-mapping,orre-mappingatpredetermined pointsoftheexecution,anddynamicre-mapping,inwhichtheydetermineif lookedatapplicationswithirregularaccesspatterns.tocompensateforload Kaddourain[13]presentarun-timeapproachforhandlingsuchenvironments. Beforeeachparallelsectionoftheprogramtheycheckifthereisaneedtore- Inthecontextofdynamicallychangingenvironments,Edjlalietal.in[8]or MM 8p no load MM 8p with load MM 7p no load MM 8p with lb Jacobi 8p no load Jacobi 8p with load Jacobi 7p no load Jacobi 8p with lb

Times for MM - Jacobi without load 70 60 Execution Time 50 40 30 20 10 0 Fig.8.Runningtimesformatrixmultiplicationandjacobiinaloadfreeenvironment, withandwithoutoutuseofourrun-timelibrary. maptheloop.thisissimilartoourapproach.howevertheirapproachdeals withmessagepassingprograms. loadbalancingispresentedin[25].basedontheinformationtheyusetomake distributedamongtheprocessors.theauthorsarguethatdependingonthe loadbalancingdecisionstheycanbedividedintolocalandglobal.distributed andcentralizedreferstowhethertheloadbalancerisonemasterprocessoror Adiscussiononglobalvs.localanddistributedvs.centralizedstrategiesfor applicationandsystemparameterseachofthoseschemescanbemoresuitable DSMsystem.Itmonitorscommunicationandpagefaults,anddynamicallymodiesloopboundariessothattheprocessesaccessdatathatarelocalifpossible. Adaptisabletoextracttheaccesspatternsbyinspectingthepatternsofthe ThesystemthatseemmostrelatedtooursisAdapt,presentedin[17].Adapt thantheothers. isimplementedinconcertwiththedistributedfilamentssoftwarekernel[9],a pagefaults.itcanonlyrecognizetwopatterns:nearest-neighborandbroadcast, patternsandprovidesthemtotherun-timesystem,makingourapproachmore thislimitsitsexibility.inoursystemweusethecompilertoextracttheaccess generalandexible. tionofprocessesfromoneworkstationtoanother.however,suchsystemsdon't supportparallelprogramseciently. FinallytherearesystemslikeCondor[16],thatsupporttransparentmigra- MM 8p no lb MM 8p with lb Jacobi 8p no lb Jacobi 8p with lb

Times in seconds 180 160 140 120 100 80 60 Shallow Manual Parallelization Compiler Parallelization Adaptive Parallelization 40 Fig.9.RunningtimesofthethreedierentimplementationsofShallow,inseconds. Themanualparallelizationtakesintoaccountdataplacementinordertoavoidpage 20 sharing.thecompilerparallelizationdoesn'tconsiderdataplacement.theadaptive 0 theworkloadtakingintoaccountdataplacementdynamically. parallelizationusesthecompilerparallelizationwithourrun-timelibrarywhichadjusts 1p 2p 4p 8p 16p Number processors trasttocloselycoupledsharedmemoryormessagepassing.ourloadbalancing methodtargetsbothirregularitiesoftheloopsaswellaspossibleheterogeneous processorsandloadcausedbycompetingprograms.furthermore,oursystem addresseslocalitymanagementbytryingtominimizecommunicationandpage Oursystemdealswithsoftwaredistributedsharedmemoryprograms,incon- sharing. 5Conclusions ticsthatareattractive:itoerstheeaseofprogrammingofasharedmemory modelinawidelyavailableworkstation-basedmessagepassingenvironment. Inthispaper,weaddresstheproblemofloadbalancinginSDSMsystemsby couplingcompile-timeandrun-timeinformation.sdsmhasuniquecharacteris- However,multipleusersandlooselyconnectedprocessorschallengetheperfor-

manceofsdsmprogramsonsuchsystemsduetoloadimbalancesandhigh communicationlatencies. twoapplicationsandaxedloadindicatethattheperformancewithloadbalance usedtoprefetchdata.preliminaryresultsareencouraging.performancetestson powerandcommunicationspeeds.thesameaccesspatterninformationisalso dynamicallyadjustloadatrun-timebasedontheavailablerelativeprocessing Ourintegratedsystemusesaccessinformationavailableatcompile-timeto sharing.oursystemidentiedregionswherefalsesharingexistedandchanged iswithin9and20%oftheidealperformance.additionally,oursystemisableto theloopboundariestoavoidit.theperformanceonourthirdapplication,when partitiontheworksothatprocessesaccessonlytheirlocaldata,minimizingfalse partitioning. thenumberofprocessorswashigh,wasequivalenttothebestpossibleworkload factorusedwhendeterminingwhentoredistributework.thetradeobetween Inaddition,foramorethoroughevaluation,weneedtodeterminethesensitivity localitymanagementandloadmustalsobefurtherinvestigated. ofourstrategytodynamicchangesinload,aswellastochangesinthehysteresis Furtherworktocollectresultsonalargernumberofapplicationsisnecessary. References 2.C.Amza,A.L.Cox,S.Dwarkadas,P.Keleher,H.Lu,R.Rajamony,and 1.S.P.Amarasinghe,J.M.Anderson,M.S.Lam,andC.W.Tseng.TheSUIF compilerforscalableparallelmachines.inproceedingsofthe7thsiamconference onparallelprocessingforscienticcomputing,february1995. W.Zwaenepoel.TreadMarks:Sharedmemorycomputingonnetworksofworkstations.IEEEComputer,29(2):18{28,February1996. 4.MichalCierniak,WeiLi,andMohammedJaveedZaki.Loopschedulingforheterogeneity.InFourthInternationalSymposiumonHighPerformanceDistributed Computing,August1995. consistency-relatedinformationindistributedsharedmemorysystems.acm 3.J.B.Carter,J.K.Bennett,andW.Zwaenepoel.Techniquesforreducing TransactionsonComputerSystems,13(3):205{243,August1995. 5.A.L.Cox,S.Dwarkadas,H.Lu,andW.Zwaenepoel.Evaluatingtheperformance 6.S.Dwarkadas,A.L.Cox,andW.Zwaenepoel.Anintegratedcompile-time/run- ofsoftwaredistributedsharedmemoryasatargetforparallelizingcompilers.in 482,April1997. Proceedingsofthe11thInternationalParallelProcessingSymposium,pages474{ 7.D.L.EageandJ.Zahorjan.Adaptiveguidedself-scheduling.TechnicalReport October1996. timesoftwaredistributedsharedmemorysystem.inproceedingsofthe7thsympo- siumonarchitecturalsupportforprogramminglanguagesandoperatingsystems, 8.GuyEdjlali,GaganAgrawal,AlanSussman,andJoelSaltz.Dataparallelprogramminginanadaptiveenvironment.InInternationParallelProcessingSymposium, April1995. 92-01-01,DepartmentofComputerScience,UniversityofWashington,January 1992.

10.K.Gharachorloo,D.Lenoski,J.Laudon,P.Gibbons,A.Gupta,andJ.Hennessy. 9.V.W.Freeh,D.K.Lowenthal,andG.R.Andrews.Distributedlaments:Ecient ne-grainparallelismonaclusterofworkstations.inproceedingsofthefirst Memoryconsistencyandeventorderinginscalableshared-memorymultiprocessors.InProceedingsofthe17thAnnualInternationalSymposiumonComputer USENIXSymposiumonOperatingSystemDesignandImplementation,pages201{ 213,November1994. 12.P.HavlakandK.Kennedy.Animplementationofinterproceduralboundedregular 11.TheSUIFGroup.Anoverviewofthesuifcompilersystem. 360,July1991. sectionanalysis.ieeetransactionsonparallelanddistributedsystems,2(3):350{ Architecture,pages15{26,May1990. 14.P.Keleher,A.L.Cox,andW.Zwaenepoel.Lazyreleaseconsistencyforsoftwaredistributedsharedmemory.InProceedingsofthe19thAnnualInternational SymposiumonComputerArchitecture,pages13{21,May1992. ParallelComputing,pages173{183,February1997. stationnetwork.incommunicationandarchitecturesupportfornetwork-based 13.MaherKaddoura.Loadbalancingforregulardata-parallelapplicationsonwork- 17.DavidK.LowenthalandGregoryR.Andrews.Anadaptiveapproachtodata 16.M.LitzkowandM.Solomon.Supportingcheckpointingandprocessmigration 15.C.KruskalandA.Weiss.Allocatingindependentsubtasksonparallelprocessors. outsidetheunixkernel.inusenixwinterconference,1992. placement. InTransactionsonComputerSystems,October1985. 18.H.Lu,A.L.Cox,S.Dwarkadas,R.Rajamony,andW.Zwaenepoel.Software 19.EvangelosP.MarkatosandThomasJ.LeBlanc.Usingprocessoranityinloop distributedsharedmemorysupportforirregularapplications.inproceedingsof the6thsymposiumontheprinciplesandpracticeofparallelprogramming,pages 20.EvangelosP.MarkatosandThomasJ.LeBlanc.Loadbalancingversuslocality schedulingonshared-memorymultiprocessors.ieeetpds,5(4):379{400,april 48{56,June1996. 21.BongkiMoonandJoelSaltz.Adaptiveruntimesupportfordirectsimulationmonte 1994. I:258{267,August1992. managementinshared-memorymultiprocessors.procofthe1992icpp,pages 22.C.D.PolychronopoulosandD.J.Kuck.Guidedself-scheduling:apractical carlomethodsondistributedmemoryarchitectures.insalablehighperformance 23.P.TangandP.C.Yew.Processorself-scheduling:Apracticalschedulingschemefor September1992. ComputingComference,May1994. schedulingschemeforparallelsupercomputers.intransactionsoncomputers, 24.YongYan,CanmingJin,andXiaodongZhang.Adaptivelyschedulingparallelloops parallelcomputers.ininternationalconferenceonparallelprocessing,augoust 25.MohammedJaveedZaki,WeiLi,andSrinivasanParthasarathy.Customizeddy- 1986. indistributedshared-memorysystems.intransactionsonparallelandsitributed systems,volume8,january1997. namicloadbalancingforanetworkofworkstations.technicalreport602,departmentofcomputerscience,universityofrochester,december1995.