Similar documents
Problems and Measures Regarding Waste 1 Management and 3R Era of public health improvement Situation subsequent to the Meiji Restoration



Copyright Bizagi. Change Management Construction Document Bizagi Process Modeler

A Review of Customized Dynamic Load Balancing for a Network of Workstations


Application Note: AN00141 xcore-xa - Application Development

Linux Driver Devices. Why, When, Which, How?

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

DECLARATION OF PERFORMANCE NO. HU-DOP_TN _001

DECLARATION OF PERFORMANCE NO. HU-DOP_TD-25_001

Storing Data: Disks and Files. Disks and Files. Why Not Store Everything in Main Memory? Chapter 7

Parallel Algorithm Engineering

CUDA programming on NVIDIA GPUs

Security & Chip Card ICs SLE 44R35S / Mifare

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

CS 147: Computer Systems Performance Analysis

Program Coupling and Parallel Data Transfer Using PAWS

DISTRIBUTED AND PARALLELL DATABASE

Thepurposeofahospitalinformationsystem(HIS)istomanagetheinformationthathealth

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Optimizing Load Balance Using Parallel Migratable Objects

What is Multi Core Architecture?

AN4156 Application note

ZA-12. Temperature - Liquidus + 45 o C (81 o C) Vacuum = 90mm

Mesh Partitioning and Load Balancing

Optimizing matrix multiplication Amitabha Banerjee

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Customized Dynamic Load Balancing for a Network of Workstations 1

M-RPC:ARemoteProcedureCallServiceforMobileClients. DepartmentofComputerScience. Rutgers,TheStateUniversityofNewJersey. Piscataway,NJ08855

Optimizing Performance. Training Division New Delhi

Arbitration and Switching Between Bus Masters

Linux Kernel Architecture

threads threads threads

How To Get A Computer Science Degree At Appalachian State

ENGI E1112 Departmental Project Report: Computer Science/Computer Engineering

CUDA Programming. Week 4. Shared memory and register

Architectures and Platforms

Avid ISIS

To connect to the cluster, simply use a SSH or SFTP client to connect to:

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

Common Approaches to Real-Time Scheduling

chapater 7 : Distributed Database Management Systems

MICHELIN CARGOXBIB and XP 27

Programming Languages for Large Scale Parallel Computing. Marc Snir

Software-Programmable FPGA IoT Platform. Kam Chuen Mak (Lattice Semiconductor) Andrew Canis (LegUp Computing) July 13, 2016

DISK: A Distributed Java Virtual Machine For DSP Architectures

HPC enabling of OpenFOAM R for CFD applications

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

!NAVSEC':!A!Recommender!System!for!3D! Network!Security!Visualiza<ons!

ICRI-CI Retreat Architecture track

High Performance Computing in the Multi-core Area

The Asterope compute cluster

E) Modeling Insights: Patterns and Anti-patterns

D1.1 Service Discovery system: Load balancing mechanisms

Dynamo: Amazon s Highly Available Key-value Store

GPU Hardware Performance. Fall 2015

Distributed communication-aware load balancing with TreeMatch in Charm++

Design and Implementation of Distributed Process Execution Environment

CPUInheritance Scheduling. UniversityofUtah

RESEARCH PAPER International Journal of Recent Trends in Engineering, Vol 1, No. 1, May 2009

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

A Real Time, Object Oriented Fieldbus Management System

MODULE BOUSSOLE ÉLECTRONIQUE CMPS03 Référence :

Managing Devices. Lesson 5

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Guide to SATA Hard Disks Installation and RAID Configuration

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Oracle Java SE and Oracle Java Embedded Products

Operating Systems Networking for Home and Small Businesses Chapter 2


Control 2004, University of Bath, UK, September 2004

Keystone 600N5 SERVER and STAND-ALONE INSTALLATION INSTRUCTIONS

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Architecture of Hitachi SR-8000

front unit 1 3 back unit

Last class: Distributed File Systems. Today: NFS, Coda

Middleware. Peter Marwedel TU Dortmund, Informatik 12 Germany. technische universität dortmund. fakultät für informatik informatik 12

Intel Pentium 4 Processor on 90nm Technology

Data Storage at IBT. Topics. Storage, Concepts and Guidelines

Load Imbalance Analysis

TRACE PERFORMANCE TESTING APPROACH. Overview. Approach. Flow. Attributes

Transcription:

DistributedSharedMemorySystems? AdaptiveLoadBalancinginSoftware CompilerandRun-TimeSupportfor SotirisIoannidisandSandhyaDwarkadas fsi,sandhyag@cs.rochester.edu DepartmentofComputerScience Rochester,NY14627{0226 UniversityofRochester ablehighperformancecomputingenvironments.acriticalissuefor Abstract.Networksofworkstationsoerinexpensiveandhighlyavail- balancingdynamicallyonsoftwaredistributedsharedmemoryprograms. achievinggoodperformanceinanyparallelsystemisloadbalancing, asystemthatcombinescompilerandrun-timesupporttoachieveload besharedamongmanyusers.inthispaper,wepresentandevaluate evenmoresoinworkstationenvironmentswherethemachinesmight Weuseinformationprovidedbythecompilertohelptherun-timesystemdistributetheworkoftheparallelloops,notonlyaccordingtothe 1Introduction relativepoweroftheprocessors,butalsoinsuchawayastominimize communicationandpagesharing. Clustersofworkstations,whetheruniprocessorsorsymmetricmultiprocessors eectivetargetforaparallelizingcompiler.theadvantagesofusingansdsm use.previouswork[5]hasshownthatansdsmrun-timecanprovetobean (SMPs),oercost-eectiveandhighlyavailableparallelcomputingenvironments.Softwaredistributedsharedmemory(SDSM)providesasharedmemory compile-timeandrun-timeinformationtoachievebetterperformance([6,18]). systemincludereducedcomplexityatcompile-time,andtheabilitytocombine abstractiononadistributedmemorymachine,withtheadvantageofease-of- themachinesmightbesharedamongmanyusers.inordertomaximizeperformancebasedonavailableresources,theparallelsystemmustnotonlyoptimally distributetheworkaccordingtotheinherentcomputationandcommunication Oneissueinachievinggoodperformanceinanyparallelsystemisloadbalancing.Thisissueisevenmorecriticalinaworkstationenvironmentwhere communicationresources. demandsoftheapplication,butalsoaccordingtoavailablecomputationand?thisworkwassupportedinpartbynsfgrantscda{9401142,ccr{9702466,and CCR{9705594;andanexternalresearchgrantfromDigitalEquipmentCorporation.

run-timesupporttoachieveloadbalancingdynamicallyonsdsmprograms. Thecompilerprovidesaccesspatterninformationtotherun-timeatthepoints inthecodethatwillbeexecutedinparallel.therun-timeusesthesepointsto determineavailablecomputationalandcommunicationresources.basedonthe Inthispaper,wepresentandevaluateasystemthatcombinescompilerand loadevenly,butalsotominimizecommunicationoverheadinthefuture.the timecanthenmakeintelligentdecisionsnotonlytodistributethecomputational changesincomputationalpower,resultinginreducedexecutiontime. accesspatternsacrossphases,aswellasonavailablecomputingpower,therun- resultisasystemthatadaptsbothtochangesinaccesspatternsaswellasto forprefetchingandconsistency/communicationavoidancedescribedin[6].we implementedthenecessarycompilerextensionsinthesuif[1]compilerframework.ourexperimentalenvironmentconsistsofeightdecalphaserver2100 4/233computers,eachwithfour21064Aprocessorsoperatingat233MHz.Preliminaryresultsshowthatoursystemisabletoadapttochangesinload,with Ourtargetrun-timesystemisTreadMarks[2],alongwiththeextensions performancewithin20%ofideal. Section4describesrelatedwork.Finally,wepresentourconclusionsanddiscuss dynamicloadbalancingdecisions.section3presentssomepreliminaryresults. on-goingworkinsection5. timesystem,thenecessarycompilersupport,andthealgorithmusedtomake Therestofthispaperisorganizedasfollows.Section2describestherun- WerstprovidesomebackgroundonTreadMarks[2],therun-timesystemwe 2DesignandImplementation usedinourimplementation.wethendescribethecompilersupportfollowedby therun-timesupportnecessaryforloadbalancing. 2.1TheBaseSoftwareDSMLibrary user-levelsdsmsystemthatrunsoncommonlyavailableunixsystems.tread- TreadMarks[2]isanSDSMsystembuiltatRiceUniversity.Itisanecient Marksprovidesparallelprogrammingprimitivessimilartothoseusedinhardwaresharedmemorymachines,namely,processcreation,sharedmemoryallocation,andlockandbarriersynchronization.Thesystemsupportsarelease synchronizationtoensurethatchangestoshareddatabecomevisible. protocol[3]toreducetheoverheadinvolvedinimplementingthesharedmemory consistent(rc)memorymodel[10],requiringtheprogrammertouseexplicit abstraction. Consequently,theconsistencyunitisavirtualmemorypage.Themultiple-writer Thevirtualmemoryhardwareisusedtodetectaccessestosharedmemory. TreadMarksusesalazyinvalidate[14]versionofRCandamultiple-writer protocolreducestheeectsoffalsesharingwithsuchalargeconsistencyunit. Withthisprotocol,twoormoreprocessorscansimultaneouslymodifytheirown

copyofasharedpage.theirmodicationsaremergedatthenextsynchronizationoperationinaccordancewiththedenitionofrc,therebyreducingthe eectsoffalsesharing.themergeisaccomplishedthroughtheuseofdis.a comparingthepagetoacopysavedpriortothemodications(calledatwin). acquiresynchronizationoperation[10],thosepagesforwhichithasreceived diisarun-lengthencodingofthemodicationsmadetoapage,generatedby noticeofmodicationsbyotherprocessors.onasubsequentpagefault,the processfetchesthedisnecessarytoupdateitscopy. Withthelazyinvalidateprotocol,aprocessinvalidates,atthetimeofan 2.2Compile-TimeSupportforLoadBalancing builtontopofakernelthatdenestheintermediateformat.thepassesare implementedasseparateprogramsthattypicallyperformasingleanalysisor (SUIF)[11]compiler.TheSUIFsystemisorganizedasasetofcompilerpasses programusingtreadmarks,weusethestanforduniversityintermediateformat Forthesource-to-sourcetranslationfromasequentialprogramtoaparallel thatwestartwithisaversionofthecodeparallelizedforsharedaddressspace transformationandthenwritetheresultsouttoale.thelesalwaysusethe sameformat. machines.thecompilergeneratesasingle-program,multiple-data(spmd)programthatwemodiedtomakecallstothetreadmarksrun-timelibrary.alternatively,theusercanprovidethespmdprogram(insteadofhavingthesuif Theinputtothecompilerisasequentialversionofthecode.Theoutput compilergenerateit)byidentifyingtheparallelloopsintheprogramthatare executedbyallprocessors. tochangetheloaddistributionintheparallelloopsifnecessary. regions,andfeedsthisinformationtotherun-timesystem.thepassisalso responsibleforaddinghooksintheparallelizedcodetoallowtherun-timelibrary OurSUIFpassextractstheshareddataaccesspatternsineachoftheSPMD x).aregularsection[12]isthencreatedforeachsuchsharedaccess.regular sectiondescriptors(rsds)conciselyrepresentthearrayaccessesinaloopnest. Accesspatternextraction TheRSDsrepresenttheaccesseddataaslinearexpressionsoftheupperand theprogramlookingforaccessestosharedmemory(identiedusingtheshpre- Inordertogenerateaccesspatternsummaries,ourSUIFpasswalksthrough lowerloopboundsalongeachdimension,andincludestrideinformation.this andthesizeofeachdimensionofthearray,todeterminetheaccesspattern. entstrategiesofloadredistributionincaseofimbalance.wewilldiscussthese informationiscombinedwiththecorrespondingloopboundariesofthatindex, Dependingonthekindofdatasharingbetweenparalleltasks,wefollowdier- strategiesfurtherinsection2.3.

Prefetching Markslibraryoersprefetchingcalls.Thesecalls,givenarangeofaddresses, prefetchthedatacontainedinthepagesinthatrange,andprovideappropriate (read/write)permissionsonthepage.thisprefetchingpreventsfaultingand consistencyactionsonuncacheddatathatisguaranteedtobeaccessedinthe Theaccesspatterninformationcanalsousedtoprefetchdata[6].TheTreadbulktransfer. Loadbalancinginterfaceandstrategy future,aswellasallowscommunicationoptimizationbytakingadvantageof eachparalleltask.thisessentiallymeanschangingthenumberofloopiterations therun-timelibrarybeforetheparallelloops.thiscallisresponsibleforchanging performedbyeachtask.toaccomplishthis,weaugmentthecodewithcallsto theloopboundsandconsequentlytheamountofworkdonebyeachtask. Therun-timesystemneedsawayofchangingtheamountofworkassignedto timebyconsideringboththecommunicationandthecomputationcomponents. strategiesfordistributingtheparallelloops.thegoalistominimizeexecution 1.Shiftingofloopboundaries:Thisapproachchangestheupperandlower Thecompilercandirecttherun-timetochoosebetweentwopartitioning boundsofeachparalleltask,sothattasksonlightlyloadedprocessorswill ing,onthedataaccessedbyourtasks.applicationswithnearestneighbor endupwithmoreworkthantasksonheavilyloadedprocessors.withthis schemeweavoidthecreationofnewboundaries,andthereforepossibleshar- 2.Multipleloopbounds:Thisschemeisaimedatminimizingunnecessarydata sharingwillbenetfromthisscheme.thispolicyhoweverhasthedrawback ofcausingmorecommunicationatthetimeofloadredistribution,sincedata hastobemovedbetweenallneighboringtasksratherthatonlyfromtheslow processor movement.eachprocessthatusesthispolicycanaccessnon-continuous amongtheprocessors,butreducescommunicationatloadredistribution time.hence,caremustbetakentoensurethatthisfragmentationdoesnot databyusingmultipleloopbounds.thispolicyfragmentstheshareddata 2.3Run-TimeLoadBalancingSupport Therun-timelibraryisresponsibleforkeepingtrackoftheprogressofeach resultineitherfalsesharingorexcesstruesharingduetoloadredistribution. adjuststheloadaccordingly.theexecutiontimeforeachparalleltaskismaintainedonaper-processorbasis(tasktime).therelativeprocessingpowerof process.itcollectsstatisticsabouttheexecutiontimeofeachparalleltaskand region).itiscrucialnottotrytoadjusttooquicklytochangesinexecution Figure1. theprocessor(relativepower)iscalculatedonthebasisofcurrentloaddistribution(relativepower)aswellastheper-processortasktimeasdescribedin Eachprocessorexecutestheabovecodepriortoeachparallelloop(SPMD

floatrelativepower[numofprocessors]; floattasktime[numofprocessors]; floatsumofpowers; forallprocessorsi forallprocessorsi RelativePower[i]/=TaskTime[i]; RelativePower[i]/=SumOfPowers; SumOfPowers+=RelativePower[i]; timebecausesuddenchangesinthedistributionofthedatamightcausethe Fig.1.AlgorithmtoDetermineRelativeProcessingPower isveryslowthersttimewegatherstatistics.ifweadjusttheload,wewill endupsendingmostofitsworktoanotherprocessor.thiswillcauseitto systemtooscillate.tomakethisclear,imagineaprocessorthatforsomereason Forthisreasonwehaveaddedsomehysteresisinoursystem.Weredistribute beveryfastthesecondtimearound,resultinginaredistributiononceagain. theloadonlyiftherelativepowerremainsconsistentlyatoddswithcurrent allocationthroughacertainnumberoftaskcreationpoints.similarly,loadis balancedonlyifthevarianceinrelativepowerexceedsathreshold.ifthetime ifthetimeoftheslowestprocessisnotwithin10%ofthetimeofthefastest communicationisgeneratedduetotheadjustedload.inourexperiments,we collectstatisticsfor10taskcreationpointsbeforetryingtoadjust,andthen changethedistributionofwork.otherwise,minoroscillationsmayresultas oftheslowestprocessiswithinn%ofthetimeofthefastestprocesswedon't processweredistributethework.thesecut-oswereheuristicallydetermined onthebasisofourexperimentalplatform,andareafunctionoftheamountof computationandanyextracommunication. asloadbalancing.thisisevenmoresoinsoftwaredsmwheretheprocessorsare toavoidunnecessarymovementofdataandatthesametimeminimizepage LoadBalancingvs.LocalityManagement nottightlycoupled,makingcommunicationexpensive.consequently,weneed sharing.inordertodealwiththisproblem,therun-timelibraryusestheinformationsuppliedbythecompileraboutwhatloopdistributionstrategytouse. SPMDregions.Changesinpartitioningthatmightresultinextracommunicationareavoidedinfavorofasmallamountofloadimbalance.Wecallthis Previouswork[20]hasshownthatlocalitymanagementisatleastasimportant Inaddition,itkeepstrackofaccessestothesharedarrayasdeclaredinprevious methodlocality-consciousloadbalancing.

2.4Example ConsidertheparallelloopofFigure2.Ourcompilerpasstransformsthisloop intothatinfigure3.thenewcodemakesaredistributecalltotherun-time libraryprovidingitwithallthenecessaryinformationtocomputetheaccess patterns(thearrays,thetypesofaccesses,theupperandlowerboundsofthe algorithmshowninfigure1),andthenusestheaccesspatterninformationto loopsandtheformatoftheexpressionsfortheindices). decidehowtodistributetheworkload. Theredistributecomputestherelativepowersoftheprocessors(usingthe for(i=lowerbound;i<upperbound;i+=stride) intsh_dat1[a*i+b]+=sh_dat2[c*i+d]; sh_dat1[n],sh_dat2[n]; Fig.2.Initialparallelloop. int redistribute( listoftypesofaccesses listofsharedarrays,/*sh_dat1,sh_dat2*/ sh_dat1[n],sh_dat2[n]; listofcoefficientsand constantsforindices listofupperbounds, listoflowerbounds, /*upper_bound*/ /*lower_bound*/ /*a,c,b,d*/ /*read/write*/ ); while(therearestillranges){ lowerbound=newlowerboundforthatrange; upperbound=newupperboundforthatrange; }Fig.3.Parallelloopwithaddedcodethatservesasaninterfacewiththerun-time for(i=lowerbound;i<upperbound;i+=stride) range=range->next; sh_dat1[a*i+b]+=sn_dat2[c*i+d]; library.therun-timesystemcanthenchangetheamountworkassignedtoeachparallel task.

3ExperimentalEvaluation 3.1Environment OurexperimentalenvironmentconsistsofeightDECAlphaServer21004/233 computers.eachalphaserverisequippedwithfour21064aprocessorsoperating networkinterface.eachalphaserverrunsdigitalunix4.0dwithtruclusterv.1.5extensions.theprograms,theruntimelibrary,andtreadmarkswere compiledwithgccversion2.7.2.1usingthe-o2optimizationag. 3.2LoadBalancingResults Weevaluateoursystemontwoapplications:amatrixmultiplicationofthree 256x256sharedmatricesoflongs(whichisrepeated100times)andJacobi,with amatrixsizeof2050x2050oats.thecurrentimplementationonlyusesthe oneoftheprocessorsofeachsmp.thisconsistsofatightloopthatwriteson rstpolicy,shiftingofloopboundariesanddoesnotuseprefetching.totestthe performanceofourloadbalancinglibrary,weintroducedanarticialloadon anarrayof10240longs.thisloadtakesup50%ofthecputime. timeson1,2,4,8,and16processors,usinguptofoursmps.weaddedone articialloadforeveryfourprocessorsexceptinthecaseoftwoprocessorswhere weonlyaddedoneload.theloadbalancingschemeweuseistheshiftingofloop boundaries(wedonotusemultipleloopbounds).therstcolumnshowsthe OurpreliminaryresultsappearinFigures4and5.Wepresentexecution at233mhzandwith256mbofsharedmemory,aswellasamemorychannel executiontimesforthecaseswheretherewasnoloadinthesystem.thesecond columnshowstheexecutiontimeswiththearticialload,andnallythelast columnisthecasewherethesystemisloadedbutweareusingourloadbalancing library. muchas100%inthecaseoftwoprocessors(withtheoverheadat4,8and16 notbeingfaro). Theintroductionofloadslowsdownbothmatrixmultiply,andJacobibyas obtainedusing8processorswithloadandourloadbalancescheme,withthat theresultsofourloadbalancingalgorithmare,wecomparetheexecutiontimes using7processorswithoutanyload.this7-processorrunservesasaboundon mancecomparedtoexecutiontimewithload.inordertodeterminehowgood Ourloadbalancingstrategyprovidesasignicantimprovementinperfor- howwellwecanperformwithloadbalancing,sincethatisthebestwecanhope theirpower,givingustheequivalentofsevenprocessors).theresultsarepresentedinfigure6.formatrixmultiply,ourloadbalancingalgorithmisonly toachieve(twoofoureightprocessorsareloaded,andoperateatonly50%of processorremainsthesame. duetothefactthatwhilecomputationcanberedistributed,communicationper 9%slowerthanthesevenprocessorloadfreecase.Jacobiis20%slower,partly tothaton8processorswithnoload,indicatingtherelativetimespentinuser InFigure7,wepresentabreakdownofthenormalizedexecutiontimerelative

Times in seconds 600 500 400 300 200 Matrix Multiplication No load With load With load balance 100 code,intheprotocol,andincommunicationandwaittime(atsynchronization Fig.4.ExecutionTimesforMatrixMultiply 0 1p 2p 4p 8p 16p waitingatsynchronizationpointsrelativetotheexecutiontimewithloadandno points).whenweuseourloadbalancingalgorithm,wereducethetimespent Number of processors loadbalancebecausewehavebetterdistributionofwork,andthereforeimprove overallperformance. Werunmatrixmultiplicationandjacobiinaloadfreeenvironmentwithand withoutuseofourrun-timelibrary.theresultsarepresentedinfigure8.inthe worstcaseweimposelessthan6%overhead. Finallywewantedtomeasurestheoverheadimposedbyourrun-timesystem. Fortheevaluationofourlocality-consciousloadbalancingpolicyweusedShallow,withinputsize514x514matricesofdoubles.Shallowoperatesontheinterior 3.3Locality-consciousLoadBalancingResults codeoranaiveimplementationwouldhaveeachprocessupdateapartofthe cesseswritingthesamepages,falsesharing.asmarterapproachistohavethe processesthatowntheboundarypagesdotheupdates,thiseliminatesfalse elementsofthearraysandthenupdatestheboundaries.compilerparallelized boundariesalongeachdimensioninparallel.thiscanresulttomultiplepro-

Times in seconds 300 250 200 150 100 Jacobi No load With load With load balance 50 sharing.ourintegratedcompiler/run-timesystemisabletomakethedecision 0 atrun-time,usingtheaccesspatterninformationprovidedbythecompiler.it Fig.5.ExecutionTimeforJacobi 1p 2p 4p 8p 16p identieswhichprocesscachesthedatacanrepartitiontheworksothatitmaximizeslocalityduceanyloadimbalancetooursystem,sincewewanttoevaluateourlocality- thatdoesn'tconsiderdataplacementperformsverypoorlyasthenumberof toeliminatefalsesharingassuggestedearlier.anaivecompilerparallelization consciousloadbalancingpolicy.wehaveoptimizedthemanualparallelization WepresentourresultsinFigure9.Intheseexperimentswedon'tintro- Number of processors code. processorsincreasesbecauseofthemultiplewritersonthesamepage.however balancingrun-timesystemtheperformanceisequivalenttothehandoptimized whenwecombinethecomplierparallelizationwithoutlocality-consciousload Inthisscheme,thereisacentralqueueofloopiterations.Onceaprocessorhas loadbalancing.perhapsthemostcommonapproachisthetaskqueuemodel. 4RelatedWork Therehavebeenseveralapproachestotheproblemsoflocalitymanagementand

80 70 Matrix Multiplication - Jacobi Manual 7p no load SUIF 8p with load balance Times in seconds 60 50 40 30 Fig.6.Comparisonoftherunningtimesoftheapplicationsusingourloadbalancingalgorithmon8loadedprocessors,comparedtothereperformanceon7loadfree processors. 20 10 0 nisheditsassignedportion,moreworkisobtainedfromthisqueue.thereare Matrix Multiplication Jacobi severalvariations,includingself-scheduling[23],xed-sizechunking[15],guided self-scheduling[22],andadaptiveguidedself-scheduling[7]. portantthanloadbalancinginthreadassignment.theyintroduceapolicythey callmemory-consciousschedulingthatassignsthreadstoprocessorswhoselocalmemoryholdsmostofthedatathethreadwillaccess.theirresultsshow thattheloosertheinterconnectionnetworkthemoreimportantthelocality MarkatosandLeBlancin[20],arguethatlocalitymanagementismoreim- veryimportant,anityschedulingwasintroducedin[19].theloopiterations management. them.kisaparameteroftheiralgorithmwhichtheydeneaspinmostof isidle,itremoves1/koftheiterationsinitslocalworkqueueandexecutes aredividedoveralltheprocessorsequallyinlocalqueues.whenaprocessor Basedontheobservationthatthelocalityofthedatathataloopaccessesis andexecutesthem,wherepisthenumberofprocessors. processoranditremoves1/poftheiterationsinthatprocessor'sworksqueue theirexperiments.ifaprocessor'sworkqueueisempty,itndsthemostloaded Theiralgorithmissimilartoanityschedulingbuttheirruntimesystemcan Buildingon[19],Yanetal.in[24],suggestadaptiveanityscheduling.

Normalized Times for MM - Jacobi Normalized Execution Time 200 180 160 140 120 100 80 60 40 20 0 User Communication & wait Protocol spentintheusertime,communicationandwaitatsynchronizationpointsandprotocol time. Fig.7.BreakupofnormalizedtimeformatrixmultiplicationandJacobi,intotime changingk,anexponentialadaptivemechanism,alinearadaptivemechanism, aconservativeadaptivemechanism,andagreedyadaptivemechanism. theloadedprocessor'slocalworkqueue.theypresentfourpossiblepoliciesfor modifykduringtheexecutionoftheprogram.whenaprocessorisloaded,kis increasedsothatotherprocessorswithalighterloadcangetloopiterationsfrom theoverallperformanceoftheapplication.similarly,moonandsaltz[21]also withrespecttoprograms,processorsandtheinterconnectionnetworks.theirresultsindicatethattakingintoaccounttherelativecomputationpoweraswellas In[4],Cierniaketal.studyloopschedulinginheterogeneousenvironments anyheterogeneityintheloopformatwhiledoingtheloopdistributionimproves repartitioningisrequiredateverytimestep. imbalance,theyintroduceperiodicre-mapping,orre-mappingatpredetermined pointsoftheexecution,anddynamicre-mapping,inwhichtheydetermineif lookedatapplicationswithirregularaccesspatterns.tocompensateforload Kaddourain[13]presentarun-timeapproachforhandlingsuchenvironments. Beforeeachparallelsectionoftheprogramtheycheckifthereisaneedtore- Inthecontextofdynamicallychangingenvironments,Edjlalietal.in[8]or MM 8p no load MM 8p with load MM 7p no load MM 8p with lb Jacobi 8p no load Jacobi 8p with load Jacobi 7p no load Jacobi 8p with lb

Times for MM - Jacobi without load 70 60 Execution Time 50 40 30 20 10 0 Fig.8.Runningtimesformatrixmultiplicationandjacobiinaloadfreeenvironment, withandwithoutoutuseofourrun-timelibrary. maptheloop.thisissimilartoourapproach.howevertheirapproachdeals withmessagepassingprograms. loadbalancingispresentedin[25].basedontheinformationtheyusetomake distributedamongtheprocessors.theauthorsarguethatdependingonthe loadbalancingdecisionstheycanbedividedintolocalandglobal.distributed andcentralizedreferstowhethertheloadbalancerisonemasterprocessoror Adiscussiononglobalvs.localanddistributedvs.centralizedstrategiesfor applicationandsystemparameterseachofthoseschemescanbemoresuitable DSMsystem.Itmonitorscommunicationandpagefaults,anddynamicallymodiesloopboundariessothattheprocessesaccessdatathatarelocalifpossible. Adaptisabletoextracttheaccesspatternsbyinspectingthepatternsofthe ThesystemthatseemmostrelatedtooursisAdapt,presentedin[17].Adapt thantheothers. isimplementedinconcertwiththedistributedfilamentssoftwarekernel[9],a pagefaults.itcanonlyrecognizetwopatterns:nearest-neighborandbroadcast, patternsandprovidesthemtotherun-timesystem,makingourapproachmore thislimitsitsexibility.inoursystemweusethecompilertoextracttheaccess generalandexible. tionofprocessesfromoneworkstationtoanother.however,suchsystemsdon't supportparallelprogramseciently. FinallytherearesystemslikeCondor[16],thatsupporttransparentmigra- MM 8p no lb MM 8p with lb Jacobi 8p no lb Jacobi 8p with lb

Times in seconds 180 160 140 120 100 80 60 Shallow Manual Parallelization Compiler Parallelization Adaptive Parallelization 40 Fig.9.RunningtimesofthethreedierentimplementationsofShallow,inseconds. Themanualparallelizationtakesintoaccountdataplacementinordertoavoidpage 20 sharing.thecompilerparallelizationdoesn'tconsiderdataplacement.theadaptive 0 theworkloadtakingintoaccountdataplacementdynamically. parallelizationusesthecompilerparallelizationwithourrun-timelibrarywhichadjusts 1p 2p 4p 8p 16p Number processors trasttocloselycoupledsharedmemoryormessagepassing.ourloadbalancing methodtargetsbothirregularitiesoftheloopsaswellaspossibleheterogeneous processorsandloadcausedbycompetingprograms.furthermore,oursystem addresseslocalitymanagementbytryingtominimizecommunicationandpage Oursystemdealswithsoftwaredistributedsharedmemoryprograms,incon- sharing. 5Conclusions ticsthatareattractive:itoerstheeaseofprogrammingofasharedmemory modelinawidelyavailableworkstation-basedmessagepassingenvironment. Inthispaper,weaddresstheproblemofloadbalancinginSDSMsystemsby couplingcompile-timeandrun-timeinformation.sdsmhasuniquecharacteris- However,multipleusersandlooselyconnectedprocessorschallengetheperfor-

manceofsdsmprogramsonsuchsystemsduetoloadimbalancesandhigh communicationlatencies. twoapplicationsandaxedloadindicatethattheperformancewithloadbalance usedtoprefetchdata.preliminaryresultsareencouraging.performancetestson powerandcommunicationspeeds.thesameaccesspatterninformationisalso dynamicallyadjustloadatrun-timebasedontheavailablerelativeprocessing Ourintegratedsystemusesaccessinformationavailableatcompile-timeto sharing.oursystemidentiedregionswherefalsesharingexistedandchanged iswithin9and20%oftheidealperformance.additionally,oursystemisableto theloopboundariestoavoidit.theperformanceonourthirdapplication,when partitiontheworksothatprocessesaccessonlytheirlocaldata,minimizingfalse partitioning. thenumberofprocessorswashigh,wasequivalenttothebestpossibleworkload factorusedwhendeterminingwhentoredistributework.thetradeobetween Inaddition,foramorethoroughevaluation,weneedtodeterminethesensitivity localitymanagementandloadmustalsobefurtherinvestigated. ofourstrategytodynamicchangesinload,aswellastochangesinthehysteresis Furtherworktocollectresultsonalargernumberofapplicationsisnecessary. References 2.C.Amza,A.L.Cox,S.Dwarkadas,P.Keleher,H.Lu,R.Rajamony,and 1.S.P.Amarasinghe,J.M.Anderson,M.S.Lam,andC.W.Tseng.TheSUIF compilerforscalableparallelmachines.inproceedingsofthe7thsiamconference onparallelprocessingforscienticcomputing,february1995. W.Zwaenepoel.TreadMarks:Sharedmemorycomputingonnetworksofworkstations.IEEEComputer,29(2):18{28,February1996. 4.MichalCierniak,WeiLi,andMohammedJaveedZaki.Loopschedulingforheterogeneity.InFourthInternationalSymposiumonHighPerformanceDistributed Computing,August1995. consistency-relatedinformationindistributedsharedmemorysystems.acm 3.J.B.Carter,J.K.Bennett,andW.Zwaenepoel.Techniquesforreducing TransactionsonComputerSystems,13(3):205{243,August1995. 5.A.L.Cox,S.Dwarkadas,H.Lu,andW.Zwaenepoel.Evaluatingtheperformance 6.S.Dwarkadas,A.L.Cox,andW.Zwaenepoel.Anintegratedcompile-time/run- ofsoftwaredistributedsharedmemoryasatargetforparallelizingcompilers.in 482,April1997. Proceedingsofthe11thInternationalParallelProcessingSymposium,pages474{ 7.D.L.EageandJ.Zahorjan.Adaptiveguidedself-scheduling.TechnicalReport October1996. timesoftwaredistributedsharedmemorysystem.inproceedingsofthe7thsympo- siumonarchitecturalsupportforprogramminglanguagesandoperatingsystems, 8.GuyEdjlali,GaganAgrawal,AlanSussman,andJoelSaltz.Dataparallelprogramminginanadaptiveenvironment.InInternationParallelProcessingSymposium, April1995. 92-01-01,DepartmentofComputerScience,UniversityofWashington,January 1992.

10.K.Gharachorloo,D.Lenoski,J.Laudon,P.Gibbons,A.Gupta,andJ.Hennessy. 9.V.W.Freeh,D.K.Lowenthal,andG.R.Andrews.Distributedlaments:Ecient ne-grainparallelismonaclusterofworkstations.inproceedingsofthefirst Memoryconsistencyandeventorderinginscalableshared-memorymultiprocessors.InProceedingsofthe17thAnnualInternationalSymposiumonComputer USENIXSymposiumonOperatingSystemDesignandImplementation,pages201{ 213,November1994. 12.P.HavlakandK.Kennedy.Animplementationofinterproceduralboundedregular 11.TheSUIFGroup.Anoverviewofthesuifcompilersystem. 360,July1991. sectionanalysis.ieeetransactionsonparallelanddistributedsystems,2(3):350{ Architecture,pages15{26,May1990. 14.P.Keleher,A.L.Cox,andW.Zwaenepoel.Lazyreleaseconsistencyforsoftwaredistributedsharedmemory.InProceedingsofthe19thAnnualInternational SymposiumonComputerArchitecture,pages13{21,May1992. ParallelComputing,pages173{183,February1997. stationnetwork.incommunicationandarchitecturesupportfornetwork-based 13.MaherKaddoura.Loadbalancingforregulardata-parallelapplicationsonwork- 17.DavidK.LowenthalandGregoryR.Andrews.Anadaptiveapproachtodata 16.M.LitzkowandM.Solomon.Supportingcheckpointingandprocessmigration 15.C.KruskalandA.Weiss.Allocatingindependentsubtasksonparallelprocessors. outsidetheunixkernel.inusenixwinterconference,1992. placement. InTransactionsonComputerSystems,October1985. 18.H.Lu,A.L.Cox,S.Dwarkadas,R.Rajamony,andW.Zwaenepoel.Software 19.EvangelosP.MarkatosandThomasJ.LeBlanc.Usingprocessoranityinloop distributedsharedmemorysupportforirregularapplications.inproceedingsof the6thsymposiumontheprinciplesandpracticeofparallelprogramming,pages 20.EvangelosP.MarkatosandThomasJ.LeBlanc.Loadbalancingversuslocality schedulingonshared-memorymultiprocessors.ieeetpds,5(4):379{400,april 48{56,June1996. 21.BongkiMoonandJoelSaltz.Adaptiveruntimesupportfordirectsimulationmonte 1994. I:258{267,August1992. managementinshared-memorymultiprocessors.procofthe1992icpp,pages 22.C.D.PolychronopoulosandD.J.Kuck.Guidedself-scheduling:apractical carlomethodsondistributedmemoryarchitectures.insalablehighperformance 23.P.TangandP.C.Yew.Processorself-scheduling:Apracticalschedulingschemefor September1992. ComputingComference,May1994. schedulingschemeforparallelsupercomputers.intransactionsoncomputers, 24.YongYan,CanmingJin,andXiaodongZhang.Adaptivelyschedulingparallelloops parallelcomputers.ininternationalconferenceonparallelprocessing,augoust 25.MohammedJaveedZaki,WeiLi,andSrinivasanParthasarathy.Customizeddy- 1986. indistributedshared-memorysystems.intransactionsonparallelandsitributed systems,volume8,january1997. namicloadbalancingforanetworkofworkstations.technicalreport602,de- partmentofcomputerscience,universityofrochester,december1995.