AHybridMultithreading/Message-PassingApproachforSolving IrregularProblemsonSMPClusters Jan-JanWu InstituteofInformationScience AcademiaSinica Taipei,Taiwan,R.O.C. Chia-LienChiang Nai-WeiLin Dept.ComputerScience NationalChungChengUniv. Chia-Yi,Taiwan,R.O.C. AbstractThispaperreportsthedesignofa runtimelibraryforsolvingirregularlystructured problemsonclustersofsymmetricmultiprocessors (SMPclusters).Ourdesignalgorithmsexploita hybridmethodologywhichmapsdirectlytotheunderlyinghieraricalmemorysysteminsmpclusters, bycombiningtwostylesofprogrammingmethodologies{threads(sharedmemoryprogramming)within asmpnodeandmessagepassingbetweensmp nodes.thishybridapproachhasbeenusedinthe implementationofaruntimelibrarysupportingthe inspector/executormodelofexecutionforsolving irregularproblems.experimentalresultsonaclusterofsunultrasparc-iiworkstationsarereported. Keywords:data-parallelprogramming,workstationclusters,symmetricmultiprocessor,multithreadedmessage-passing,collectivecommunication 1 Introduction Highperformanceparallelcomputersincorporateproprietaryinterconnectionnetworksallowinglow-latency,highbandwidthinterprocessorcommunications.Alargenumberof scienticcomputingapplicationshavebenet fromsuchparallelcomputers.nowadays,the currentlevelofhardwareperformancethatis foundinhigh-performanceworkstations,togetherwithadvancesindistributedsystemsarchitecture,makesclustersofworkstations(so callednowsystems)oneofthehighestperformanceandmostcosteectiveapproaches tocomputing.high-performanceworkstation clusterscanbeenrealisticallyusedforavariety ofapplicationstoreplacevectorsupercomputersandmassivelyparallelcomputers. Inrecentyears,wehaveseenanumberof suchworkstations(orpcs)introducedintothe market(e.g.thesgipowerchallenge,sunultraenterpriseseries,decalphaservers,ibm RS6000withmultipleCPUs,andIntelPentiumwithtwoorfourCPUs.)Thelargest commerciallyavailablesmpstoday(e.g.sgi PowerChallengewith32processors)candeliver peakratesofoveronebillionoatingpoint operationspersecond;thisisprojectedtoincreasebyoveranorderofmagnitudewithin thenextfewyears.suchcomputingpowerwill makethisclassofmachinesanattractivechoice forsolvinglargeproblems.itispredictedthat SMPswillbethebasisoffutureworkstation clusters[9]. ProcessorswithinaSMPshareacommon memory,whileeachsmphasitsownmemorywhichisnotdirectlyaccessiblefromprocessorsresidedonothersmps.theperformanceofadistributedmemorycomputerdependstoalargeextentonhowfastdatamovementcanbeperformed.despitesignicant improvementinhardwaretechnology,theimprovementsincommunicationcostarestillbehindthoseinthecomputationpowerofeach processingnode.asaresult,optimizationfor communicationremainsanimportantissuefor SMPclusters. WithanumberofimplementationsofMPI
andpvmbecomingavailable,parallelprogrammingonworkstationclustershasbecome aneasiertask.althoughmpiandpvmcan alsobeusedtomanagecommunicationinsmp clusters,theyarenotnecessarilythemostecientmethodologies.ourapproachexploitsa hybridmethodologywhichmapsdirectlytothe underlyinghieraricalmemorysysteminsmp clusters,bycombiningtwostylesofprogrammingmethodologies{threads(sharedmemory programming)withinasmpnodeandmessage passingbetweensmpnodes. Ourpreviousexperienceswiththehybrid methodologyforregulardata-parallelcomputation(i.e.wherereferencepatternsofeach dataelementareuniformandpredictable)are reportedin[3].inthispaper,weextendour previousworktosupportirregularcomputations,wherereferencepatternsmaydependon inputdataorruntime-computedvalues.this papergivesanoverviewofthehybridmethodologyforsolvingirregularproblemsandreportsthealgorithmicdesignofaruntimelibrarybasedonthishybridmethodology.the librarycollectsasetofcomputationandcommunicationprimitivesthatformtheruntime kernelforsupportingtheinspector/executor modelofexecutionondistributed-memory multiprocessors. Ourdesigngoalscenteronhighportability,basedonexistingstandardsforlightweight threadsandmessage-passingsystems,andhigh eciency,basedontheutilityofthreadsto improveeciencyofdatamovement.thelibraryisintendedtobeusedtosupportdirect programmingorasaruntimesupportlibrary forhigh-levellanguages(e.g.hpf2)targetingforsmpclusters.ourpreliminaryexperimentalresultsonaclusterofsunultrasparc- IIdual-CPUworkstationsdemonstrateperformanceimprovementofthehybridmethodology againstmessage-passingimplementation. 2 Overview PreprocessingforIrregularComputation Themostcommonlyusedmodelforexplicit data-parallelprogrammingondistributedmemorymultiprocessorsisthesingleprogrammultipledatamodel,wheredataaredistributedamongavailableprocessors,andthe sameprogramisexecutedconcurrentlyonthe processorswitheachprocessoronlyresponsible forthecomputationofaportionofthedata. Forcaseswheredataareuniformlydistributed,thelocationofandataelementcan bedeterminedstatically.inmanysituatations dynamicandirregulardatadistributionisrequiredtoobtainloadbalancinganddatalocality.itisnotpossibletoexpressirregulardata distributioninasimpleway.themappingof dataisusuallydescribedbyadistributedmappingtabledenedbyapartitioningalgorithm. Duringthepreprocessingorinspectorphase, wecarryoutasetofinterprocessorcommunicationsthatallowsustoanticipateexactly whichsendandreceivecommunicationcalls eachprocessormustexecute[1].oncepreprocessingiscompleted,weareinaposition tocarryouttheproblem'scommunicationand computation;thisphaseiscalledtheexecutor [1].Theexchangeofinterprocessordistributed arrayelementsdatafollowstheplainsestablishedbytheinspector.o-processordataare appendedtotheendofthelocalarray.once acommunicationphaseisover,eachprocessorcarriesoutitscomputationentirelyonthe locallystoreddatawithoutfurthercommunication.writestoo-processordataelements arealsoappendedtotheendofthelocalarray.whenthecomputationalphaseisnished, arrayelementstobestoredo-processorare obtainedfromtheendofthelocalarrayand senttotheappropriateo-processorlocations. WeusethePartilibrary(singlethreadon eachprocess)asthebasisforourmultithreaded implementation.theparti/chaosruntimelibrarydevelopedby[1,11]aimstosupportthe inspector/executorstyleofprogrammingusing
runtimeprimitivesincludecommunicationrou- arraysasthemaindatastructures.theparti tinesdesignedtosupportirregulardatadis- tributionandirregularpatternsofdistributed arrayaccess(figure3).figure1(a)showsthe data-parallelcodesegmentforasparsematrixvectormultiply.figure1(b)depictsthespmscripts(givenbyarraycol)topairsof translatestheindirectdatareferencesub- implementationwiththepartilibrary. processindexandlocalindex(proc,loc). Intheinspectorphase,index-translation tionschedule).intheexecutorphase,o- to/fromwhichprocesses(calledcommunica- build-scheduledetermineswhichprocess processordataaregatheredfromremotepro- shouldsend/receivewhichdataelements cessorsaccordingtotheplannedcommunica- tionscheduleandstoredattheendofthelocal remotelyfetcheddatainlocalarrayx.then arrayx.thelocalindexarraylocisupdated accordinglytoreecttheactuallocationofthe thetwonestedloopscarryoutthematrixvectormultiplyentirelyonthelocalarrayx. HybridModelofExecution ainitialthreadforhousekeepingnodeexecu- forsmpclustersisdepictedinfigure2. Whenassignedatask,eachSMPnodecreates Ourmodeloflooselysynchronousexecution berofworkthreads,eachassignedtoaproces- sorwithinthatnode,thatexecuteinparallel tion.theinitialthreadthenlaunchesanum- workthreadsmaysynchronizeandenterthe memory(computationphase).whenremote accessestoothersmpnodesarerequired,the andcoordinatewitheachotherviatheshared communicationphase.whenthecommunicationiscompleted,allthethreadsproceedto nextcomputationphase,andsoon. MultithreadedRuntimeLibrary ipateinparallelcomputationwithinasmp model,wedesignamultithreadedruntimelibrary.thesamesetofthreadsthatpartic- Basedonthelooselysynchronousexecution (a)data-parallelprogram doi=1,nrows sum=0 enddo doj=1,ncols(i) y(i)=y(i)+sum sum=sum+f(j,i)*x(cols(j,i))!!theinspectorphase sched=build-schedule(proc,loc)!!theexecutorphase (b)spmdprogram callgather(x,x,sched,loc) count=0 callindex-translation(proc,loc,cols) doi=1,locnrows sum=0 doj=1,ncols(i) enddo y(i)=y(i)+sum sum=sum+f(j,i)*x(loc(count)) count+=1 Figure1:Inspector/Executorimplementation ofsparsematrix-vectormultiply NODE 0 NODE 1 NODE N-1 threads threads threads computation phase communication phase Multithread Communication modelforsmpcluster Figure2: Looselysynchronousexecution computation phase Multithread Communication communication phase computation phase
Multithreaded Runtime Library togethertooptimizethecommunicationphase. lectivecommunications. nodeduringthecomputationphasealsowork Wehavedesignedseveraloptimizationsforcol- Usemultiplethreadstospeeduplocal Hidemessagelatencybyoverlappinglocal memorycopyandsending/receivingmessages,byusingdierentthreadsforthese datamovement, Pipelinemultiplemessagestothenetwork interfacebyusingmultiplethreadsfor twotasks, tithreadedruntimelibrary.posixthreads Figure3showstheorganizationofthemul- sending/receivingmultiplemessages. sagepassingbetweensmpnodes. orycomputationandmessagebueringwithin asmpnode.mpiprimitivesareusedformes- primitivesareusedtoparallelizelocalmem- Inspector: build-mapping-table, index-translation, build-communication-schedule Executor: gather, scatter, scatter-accumulate Global operations: Broadcast, Reduction, Barrier-sync, Figure3:Organizationofthemultithreaded MPI primitives runtimelibrary POSIX threads (Send, Receive, Send-and-receive, Broadcast, Reduce, Barrier-sync ) 3Currently,thelibrarysupportsasetofglobal operations(barriersynchronization,broadcast,reduction),inspectoroperations(index RuntimePrimitives andcollectivecommunication(gather,scatter, translation,buildcommunicationschedule) theruntimekernelformultithreadedexecution scatter-accumulate),whichalltogetherform describesthealgorithmsforsomeoftheprimitives. oftheinspector/executormodel.thissection 3.0.1BroadcastandReduction carriedoutintwosteps:shared-memoryreductionamongmultiplethreadswithinasmp AreductionoperationonaSMPclusteris betweensmpnodes. Algorithmreduce(op,source,target,ndata) node,andthendistributed-memoryreduction istobeperformed,targetstoresthereduction opisabinaryassociativeoperator, sourceisthedataarrayonwhichthereduction result, {Assumingkthreadswillbeusedforthereduction partitionedtothesekthreads. onasmpnode,andthedataelementsareevenly 1.Foreachthreaddo /*computesthereductiononthesectionof local-memory-reduce(op, localbufferforthisthread.*/ responsiblefor,andstoretheresulttoa thedataarraythattheparticularthreadis 2.Combinethepartialresultsfromthesethreads. Ifkissmall,thepartialresultscanbe starting_addr_in_source_for_my_thread, combinedusingasimplelockingmechanism: local-buffer-for-my-thread,ndata/k); thekthreads temp:toholdthecombinedpartialresultfrom Twosharedvariablesareused: Foreachthreaddo number-of-threads,initializedwithk mutual_exclusion_lock accumulatelocal-buffer-for-my-threadtotemp decreasenumber-of-threadsby1 thencallmpi_reduce(temp,target,...)to ifnumber-of-threads=0(i.e.thelastthread) mutual_exclusion_unlock elsewaitforotherthreadstofinish nodes,andwakeupalltheotherthreads combinefinalresultsfromallthesesmp }
3.0.2Gather/ScatterCommunication Datamovement,suchasgatherandscatter,inirregularcomputationcanallbeformulatedasall-to-somepersonalizedcommunication,inwhicheachSMPnodetransmitsadistinctblockofdatatoasubsetof SMPnodes.Thefollowingalgorithmperforms all-to-somegathercommunicationwithsingle threadpersmpnode.thedatastructure \sched"describesthecommunicationschedule(includingthenumberofdataelements andthelocalindicesofthedataelementsto send/receive)onthisnode.anodesends sched:ndatatosend[j]ofdataelementsinthe sourcearray,indexedbysched:nsendlist[i], tonodejandstorethereceiveddatainto properlocationsinthetargetarray. Algorithmgather(source,dest,sched) void*source,void*dest; SCHEDULE*sched; {/*senddata*/ for(i=0;i<num_nodes;i++){ if(sched.n_data_to_send[i]>0){ copytherequesteddataelementsfromthe sourcearraytoithsectionofthebuffer, andsendthebufferdatatonodei }} /*receivedata*/ for(i=0;i<sched.n_msgs_to_recv;i++){ pollforincomingmessage, copythereceiveddatatoproperlocations intargetarray } }MultithreadedGatherSupposethereare kcpusoneachsmpnode,andkissmaller thanthenumberofsmpnodes(numnodes), thenthegathercommunicationcanbefurther parallelizedbyusingkthreads{eachthreadis responsiblefortheexecutionofnumnodes=k send-loopiterations.thatis,eachthreadhandlesthecommunicationmwithnumnodes=k SMPnodes.Multithreadingalsoparallelizes localmemorycopytomessagebuers.if sourceanddestinationaredierentarrays, theneachthreadaccessesdierentmemorylocationandthusnomemorylockingisrequired. Dependingonthecharacteristicsofthetargetmachine,numberofthreadsinvolvedinthe gathercommunicationmaybetunedtoachieve optimalperformance. 4 ExperimentalResult TheexperimentswereconductedonanetworkoffourSunUltraSparc-IIworkstations locatedintheinstituteofinformationscience, AcademiaSinica.TheworkstationsareconnectedbyafastEthernetnetworkcapableof 100Mbpspernode.Eachworkstationisa SMPwithtwo296MHzCPUs,runningthe multithreadossunos5.5.1.thecommunicationlibraryintheframeworkisimplemented ontopofthempichimplementationofmpi. Wecomparedtwoimplementationapproaches,oneusingthehybridthread+MPI methodology,andtheotherusingmpionly. Forthehybridversion,wecreatedfourSPMD processesbythecommand\mpirun-np4", whichmapseachofthefourprocessestoa distinctworkstation.eachprocesslaunches twothreads,whichwerethenautomaticallyassignedtodierentcpusbytheoperatingsystem.threadswithinaprocesscommunicate bysharedvariablesandmemorylocks,while processescommunicatewitheachotherby passingmessages.forthempi-onlyversion, eightprocesses,eachrunningasinglethread, werecreatedusingthecommand\mpirun-np 8",whichassignseachprocesstoadistinct CPU.Allprocessescommunicatewitheach otherbypassingmessages. Figure4(a)showstheeectivenessofusing multiplethreadsinglobalreduction.using twothreadsperprocessimprovesperformance byabout70%.theresultalsoshowsthatthe overheadofshared-memorysynchronizationin thehybridapproachforsumminguppartial reductionresultsfrommultiplecpuswithin asmpnodeisnegligiblecomparedwiththe MPI-onlyimplementation. Figure4(b)reportstheexecutiontimeof all-to-somegathercommunication.usingtwo threadsperprocessachievesaspeedupfactor
overlappinglocalmemorycopyingwithinterprocesscommunication.theadvantageofthe of1.5,duetotheeectofmultithreadingin tosends/receiveslessmessagesthanthempitionisevenmoresignicant.thisisbecause bythehybridapproach,eachprocessneeds hybridapproachovermpi-onlyimplementaberofmessageswillbecomemoreevidentonlyimplementation.weexpectmoreperformanceimprovementonlargenumberofsmp nodes,wheretheadvantageofreducingnum- twothreadsperprocessshortensthecommunicationtimeforboththeinspectorphaseand sparsematrix-vectormultiplybypartitioning thesparsematrixintoblocksofrows.using Figure4(c)showstheexecutiontimeofthe theexecutorphase.thecomputationtimeis competitivetothatofthempi-onlyimplementation,resultinginoverallspeedupfactorof 1.5. 5Anumberofthreadpackagesupportinterpro- cessorcommunicationmechanisms.nexus[5] RelatedWork ofinterprocessorcommunicationintheform ofactivemessages(asynchronousremoteprocedurecalls).nexushasnotsupportedei- providesathreadabstractionthatiscapable largoalsandapproachestonexus.italso tivecommunication.panda[2]hassimi- supportsdistributedthreads.chant[6]is therpoint-to-pointcommunicationorcollec- Chantbyintroducingthenotionof\context", athread-basedcommunicationlibrarywhich ascopingmechanismforthreads,andprovidinggroupcommunicationbetweenthreadsina remoteprocedurecalls.ropes[7]extends supportsbothpoint-to-pointprimitivesand tiveoperations(e.g.barriersynchronization, broadcast)aresupported.thempc++multithreadtemplatelibraryonmpi[8]isa speciccontext.somesimplekindsofcollec- setoftemplatesandclassestosupportmultithreadedparallelprogramming.itincludes supportformulti-threadedlocal/remotefunc- tioninvocation,reduction,andbarriersyn- execution time (seconds) execution time (seconds) 7 6 4 processes, each uses 1 thread 4 processes, each uses 2 threads 8 processes, each uses 1 thread 5 4 (a)globalreduction 3 2 1 0 0 10 20 30 40 50 60 70 data size (x 1000k) 0.035 Hybrid, 1 thread per process Hybrid, 2 threads per process MPI-only 0.03 0.025 0.02 (b)all-to-somecommunication 0.015 0.01 0.005 0 8 8.5 9 9.5 10 10.5 11 data size n (matrix size = 2^n * 2^n) 0.07 Hybrid, 1 thread per process Hybrid, 2 threads per process MPI-only 0.06 0.05 0.04 0.03 Figure4:Executiontimeofsomecollective (c)sparsematrix-vectormultiply 0.02 0.01 workstations.comparethreeversions:hybridthread+mpiwithsinglethreadperprocess,hybridthread+mpiwithtwothreadsper communicationsonaclusteroffourdual-cpu 0 8 8.5 9 9.5 10 10.5 11 data size n (matrix size = 2^n * 2^n) process,andmpi-only(8processes,nomultithreading). execution time (seconds)
chronization. OnthepartofMessagePassingInterface, Skjellumetcl.[4]addressestheissueofmakingMPIthread-safewithrespecttointernal workerthreadsdesignedtoimprovetheeciencyofmpi.later,thesamegrouppropose manypossibleextensionstothempistandard, whichallowsuser-levelthreadstobemixed withmessages[13,12]. Justrecently,therearesomenewresults fromtheindustry.forexample,newversionof HPMPI(releasedinlateJunethisyear)[10] incorporatessupportformultithreadedmpi processes,intheformofathread-compliant library.thislibraryallowsuserstofreely mixposixthreadsandmessages.however, itdoesnotperformoptimizationforcollective communicationlikewedo. 6 Conclusion Inthispaper,wepresentahybridthreadsand message-passingmethodologyforsolvingirregularproblemsonsmpclusters.wehavealso describedtheimplementationofasetofcollectivecommunicationsbasedonthismethodology. Ourpreliminaryexperimentalresultsona networkofsunultrasparc-iishowsthatthe hybridmethodologyimprovesperformanceof globalreductionbyabout70%andtheperformanceofall-to-somegatherbyabout80%, comparedwithmpi-only,non-threadedimplementations. Asthreadsarecommonlyusedinsupportof parallelismandasynchronouscontrolinapplicationsandlanguageimplementations,webelievethatthehybridthreads/message-passing methodologyhasadvantageofbothexibility andeciency. SupportforthisworkisprovidedbyNationalScienceCouncilofTaiwanundergrant 88-2213-E-001-004. References [1]H.Berryman,J.Saltz,andJ.Scroggs.Execution timesupportforadaptivescienticalgorithmson distributedmemoryarchitectures.concurrency: PracticeandExperience,3(3):159{178,1991. [2]R.Bhoedjang,T.Ruhl,R.Hofman,K.Langendoen,H.Bal,andF.Kaashoek.Panda:Aportable platformtosupportparallelprogramminglanguages.insymposiumonexperienceswithdistributedandmultiprocessorsystems,sandiego, CA,1993. [3]C.-L.Chiang,J.-J.Wu,andN.-W.Lin.Toward SupportingData-ParallelProgrammingforClustersofSymmetricMultiprocessors.InInternationalConferenceonParallelandDistributedSystems,Tainan,Taiwan,1998. [4]A.K.Chowdappa,A.Skjellum,andN.E.Doss. Thread-safemessagepassingwithp4andmpi. TechnicalReportTR-CS-941025,ComputerScienceDepartmentandNSFEngineeringResearch Center,MississippiStateUniversity,April1994. [5]IanFoster,C.Kesselman,R.Olson,andS.Tuecke. Nexus:Aninteroperabilitylayerforparalleland distributedcomputersystems.technicalreport, ArgonneNationalLabs,Dec.1993. [6]M.Haines,D.Cronk,andP.Mehrotra.Onthe designofchant:atalkingthreadspackage.in ProceedingsofSupercomputing'94,1994. [7]M.Haines,P.Mehrotra,andD.Cronk.ROPES: Supportforcollectiveoperationsamongdistributedthreads.Technicalreport,ICASE,NASA LangleyResearchCenter,May1995. [8]YutakaIshikawa.Multithreadtemplatelibraryin c++.http://www.rwcp.or.jp/lab/pdslab/dist. [9]WilliamE.Johnston.Thecaseforusingcommercialsymmetricmultiprocessorsassupercomputers. Technicalreport,InformationandComputingSciencesDivision,LawrenceBerkeleyNationalLaboratory,1997. [10]HellatPacker.Hpmessagepassinginterface. http://www.hp.com/go/mpi. [11]R.Ponnusamy.Amanualforthechaosruntimelibrary.TechnicalReportTR93-105,ComputerScienceDept.Univ.ofMaryland,Dec.1993. [12]A.Skjellum,N.E.Doss,K.Viswanathan, A.Chowdappa,andP.V.Bangalore.Extending themessagepassinginterface(mpi).technicalreport,computersciencedepartmentandnsfengineeringresearchcenter,mississippistateuniversity,1997. [13]A.Skjellum,B.Protopopov,andS.Hebert.A ThreadTaxonomyforMPI.Technicalreport, ComputerScienceDepartmentandNSFEngineeringResearchCenter,MississippiStateUniversity, 1996.