threads threads threads

Size: px
Start display at page:

Download "threads threads threads"

Transcription

1 AHybridMultithreading/Message-PassingApproachforSolving IrregularProblemsonSMPClusters Jan-JanWu InstituteofInformationScience AcademiaSinica Taipei,Taiwan,R.O.C. Chia-LienChiang Nai-WeiLin Dept.ComputerScience NationalChungChengUniv. Chia-Yi,Taiwan,R.O.C. AbstractThispaperreportsthedesignofa runtimelibraryforsolvingirregularlystructured problemsonclustersofsymmetricmultiprocessors (SMPclusters).Ourdesignalgorithmsexploita hybridmethodologywhichmapsdirectlytotheunderlyinghieraricalmemorysysteminsmpclusters, bycombiningtwostylesofprogrammingmethodologies{threads(sharedmemoryprogramming)within asmpnodeandmessagepassingbetweensmp nodes.thishybridapproachhasbeenusedinthe implementationofaruntimelibrarysupportingthe inspector/executormodelofexecutionforsolving irregularproblems.experimentalresultsonaclusterofsunultrasparc-iiworkstationsarereported. Keywords:data-parallelprogramming,workstationclusters,symmetricmultiprocessor,multithreadedmessage-passing,collectivecommunication 1 Introduction Highperformanceparallelcomputersincorporateproprietaryinterconnectionnetworksallowinglow-latency,highbandwidthinterprocessorcommunications.Alargenumberof scienticcomputingapplicationshavebenet fromsuchparallelcomputers.nowadays,the currentlevelofhardwareperformancethatis foundinhigh-performanceworkstations,togetherwithadvancesindistributedsystemsarchitecture,makesclustersofworkstations(so callednowsystems)oneofthehighestperformanceandmostcosteectiveapproaches tocomputing.high-performanceworkstation clusterscanbeenrealisticallyusedforavariety ofapplicationstoreplacevectorsupercomputersandmassivelyparallelcomputers. Inrecentyears,wehaveseenanumberof suchworkstations(orpcs)introducedintothe market(e.g.thesgipowerchallenge,sunultraenterpriseseries,decalphaservers,ibm RS6000withmultipleCPUs,andIntelPentiumwithtwoorfourCPUs.)Thelargest commerciallyavailablesmpstoday(e.g.sgi PowerChallengewith32processors)candeliver peakratesofoveronebillionoatingpoint operationspersecond;thisisprojectedtoincreasebyoveranorderofmagnitudewithin thenextfewyears.suchcomputingpowerwill makethisclassofmachinesanattractivechoice forsolvinglargeproblems.itispredictedthat SMPswillbethebasisoffutureworkstation clusters[9]. ProcessorswithinaSMPshareacommon memory,whileeachsmphasitsownmemorywhichisnotdirectlyaccessiblefromprocessorsresidedonothersmps.theperformanceofadistributedmemorycomputerdependstoalargeextentonhowfastdatamovementcanbeperformed.despitesignicant improvementinhardwaretechnology,theimprovementsincommunicationcostarestillbehindthoseinthecomputationpowerofeach processingnode.asaresult,optimizationfor communicationremainsanimportantissuefor SMPclusters. WithanumberofimplementationsofMPI

2 andpvmbecomingavailable,parallelprogrammingonworkstationclustershasbecome aneasiertask.althoughmpiandpvmcan alsobeusedtomanagecommunicationinsmp clusters,theyarenotnecessarilythemostecientmethodologies.ourapproachexploitsa hybridmethodologywhichmapsdirectlytothe underlyinghieraricalmemorysysteminsmp clusters,bycombiningtwostylesofprogrammingmethodologies{threads(sharedmemory programming)withinasmpnodeandmessage passingbetweensmpnodes. Ourpreviousexperienceswiththehybrid methodologyforregulardata-parallelcomputation(i.e.wherereferencepatternsofeach dataelementareuniformandpredictable)are reportedin[3].inthispaper,weextendour previousworktosupportirregularcomputations,wherereferencepatternsmaydependon inputdataorruntime-computedvalues.this papergivesanoverviewofthehybridmethodologyforsolvingirregularproblemsandreportsthealgorithmicdesignofaruntimelibrarybasedonthishybridmethodology.the librarycollectsasetofcomputationandcommunicationprimitivesthatformtheruntime kernelforsupportingtheinspector/executor modelofexecutionondistributed-memory multiprocessors. Ourdesigngoalscenteronhighportability,basedonexistingstandardsforlightweight threadsandmessage-passingsystems,andhigh eciency,basedontheutilityofthreadsto improveeciencyofdatamovement.thelibraryisintendedtobeusedtosupportdirect programmingorasaruntimesupportlibrary forhigh-levellanguages(e.g.hpf2)targetingforsmpclusters.ourpreliminaryexperimentalresultsonaclusterofsunultrasparc- IIdual-CPUworkstationsdemonstrateperformanceimprovementofthehybridmethodology againstmessage-passingimplementation. 2 Overview PreprocessingforIrregularComputation Themostcommonlyusedmodelforexplicit data-parallelprogrammingondistributedmemorymultiprocessorsisthesingleprogrammultipledatamodel,wheredataaredistributedamongavailableprocessors,andthe sameprogramisexecutedconcurrentlyonthe processorswitheachprocessoronlyresponsible forthecomputationofaportionofthedata. Forcaseswheredataareuniformlydistributed,thelocationofandataelementcan bedeterminedstatically.inmanysituatations dynamicandirregulardatadistributionisrequiredtoobtainloadbalancinganddatalocality.itisnotpossibletoexpressirregulardata distributioninasimpleway.themappingof dataisusuallydescribedbyadistributedmappingtabledenedbyapartitioningalgorithm. Duringthepreprocessingorinspectorphase, wecarryoutasetofinterprocessorcommunicationsthatallowsustoanticipateexactly whichsendandreceivecommunicationcalls eachprocessormustexecute[1].oncepreprocessingiscompleted,weareinaposition tocarryouttheproblem'scommunicationand computation;thisphaseiscalledtheexecutor [1].Theexchangeofinterprocessordistributed arrayelementsdatafollowstheplainsestablishedbytheinspector.o-processordataare appendedtotheendofthelocalarray.once acommunicationphaseisover,eachprocessorcarriesoutitscomputationentirelyonthe locallystoreddatawithoutfurthercommunication.writestoo-processordataelements arealsoappendedtotheendofthelocalarray.whenthecomputationalphaseisnished, arrayelementstobestoredo-processorare obtainedfromtheendofthelocalarrayand senttotheappropriateo-processorlocations. WeusethePartilibrary(singlethreadon eachprocess)asthebasisforourmultithreaded implementation.theparti/chaosruntimelibrarydevelopedby[1,11]aimstosupportthe inspector/executorstyleofprogrammingusing

3 runtimeprimitivesincludecommunicationrou- arraysasthemaindatastructures.theparti tinesdesignedtosupportirregulardatadis- tributionandirregularpatternsofdistributed arrayaccess(figure3).figure1(a)showsthe data-parallelcodesegmentforasparsematrixvectormultiply.figure1(b)depictsthespmscripts(givenbyarraycol)topairsof translatestheindirectdatareferencesub- implementationwiththepartilibrary. processindexandlocalindex(proc,loc). Intheinspectorphase,index-translation tionschedule).intheexecutorphase,o- to/fromwhichprocesses(calledcommunica- build-scheduledetermineswhichprocess processordataaregatheredfromremotepro- shouldsend/receivewhichdataelements cessorsaccordingtotheplannedcommunica- tionscheduleandstoredattheendofthelocal remotelyfetcheddatainlocalarrayx.then arrayx.thelocalindexarraylocisupdated accordinglytoreecttheactuallocationofthe thetwonestedloopscarryoutthematrixvectormultiplyentirelyonthelocalarrayx. HybridModelofExecution ainitialthreadforhousekeepingnodeexecu- forsmpclustersisdepictedinfigure2. Whenassignedatask,eachSMPnodecreates Ourmodeloflooselysynchronousexecution berofworkthreads,eachassignedtoaproces- sorwithinthatnode,thatexecuteinparallel tion.theinitialthreadthenlaunchesanum- workthreadsmaysynchronizeandenterthe memory(computationphase).whenremote accessestoothersmpnodesarerequired,the andcoordinatewitheachotherviatheshared communicationphase.whenthecommunicationiscompleted,allthethreadsproceedto nextcomputationphase,andsoon. MultithreadedRuntimeLibrary ipateinparallelcomputationwithinasmp model,wedesignamultithreadedruntimelibrary.thesamesetofthreadsthatpartic- Basedonthelooselysynchronousexecution (a)data-parallelprogram doi=1,nrows sum=0 enddo doj=1,ncols(i) y(i)=y(i)+sum sum=sum+f(j,i)*x(cols(j,i))!!theinspectorphase sched=build-schedule(proc,loc)!!theexecutorphase (b)spmdprogram callgather(x,x,sched,loc) count=0 callindex-translation(proc,loc,cols) doi=1,locnrows sum=0 doj=1,ncols(i) enddo y(i)=y(i)+sum sum=sum+f(j,i)*x(loc(count)) count+=1 Figure1:Inspector/Executorimplementation ofsparsematrix-vectormultiply NODE 0 NODE 1 NODE N-1 threads threads threads computation phase communication phase Multithread Communication modelforsmpcluster Figure2: Looselysynchronousexecution computation phase Multithread Communication communication phase computation phase

4 Multithreaded Runtime Library togethertooptimizethecommunicationphase. lectivecommunications. nodeduringthecomputationphasealsowork Wehavedesignedseveraloptimizationsforcol- Usemultiplethreadstospeeduplocal Hidemessagelatencybyoverlappinglocal memorycopyandsending/receivingmessages,byusingdierentthreadsforthese datamovement, Pipelinemultiplemessagestothenetwork interfacebyusingmultiplethreadsfor twotasks, tithreadedruntimelibrary.posixthreads Figure3showstheorganizationofthemul- sending/receivingmultiplemessages. sagepassingbetweensmpnodes. orycomputationandmessagebueringwithin asmpnode.mpiprimitivesareusedformes- primitivesareusedtoparallelizelocalmem- Inspector: build-mapping-table, index-translation, build-communication-schedule Executor: gather, scatter, scatter-accumulate Global operations: Broadcast, Reduction, Barrier-sync, Figure3:Organizationofthemultithreaded MPI primitives runtimelibrary POSIX threads (Send, Receive, Send-and-receive, Broadcast, Reduce, Barrier-sync ) 3Currently,thelibrarysupportsasetofglobal operations(barriersynchronization,broadcast,reduction),inspectoroperations(index RuntimePrimitives andcollectivecommunication(gather,scatter, translation,buildcommunicationschedule) theruntimekernelformultithreadedexecution scatter-accumulate),whichalltogetherform describesthealgorithmsforsomeoftheprimitives. oftheinspector/executormodel.thissection 3.0.1BroadcastandReduction carriedoutintwosteps:shared-memoryreductionamongmultiplethreadswithinasmp AreductionoperationonaSMPclusteris betweensmpnodes. Algorithmreduce(op,source,target,ndata) node,andthendistributed-memoryreduction istobeperformed,targetstoresthereduction opisabinaryassociativeoperator, sourceisthedataarrayonwhichthereduction result, {Assumingkthreadswillbeusedforthereduction partitionedtothesekthreads. onasmpnode,andthedataelementsareevenly 1.Foreachthreaddo /*computesthereductiononthesectionof local-memory-reduce(op, localbufferforthisthread.*/ responsiblefor,andstoretheresulttoa thedataarraythattheparticularthreadis 2.Combinethepartialresultsfromthesethreads. Ifkissmall,thepartialresultscanbe starting_addr_in_source_for_my_thread, combinedusingasimplelockingmechanism: local-buffer-for-my-thread,ndata/k); thekthreads temp:toholdthecombinedpartialresultfrom Twosharedvariablesareused: Foreachthreaddo number-of-threads,initializedwithk mutual_exclusion_lock accumulatelocal-buffer-for-my-threadtotemp decreasenumber-of-threadsby1 thencallmpi_reduce(temp,target,...)to ifnumber-of-threads=0(i.e.thelastthread) mutual_exclusion_unlock elsewaitforotherthreadstofinish nodes,andwakeupalltheotherthreads combinefinalresultsfromallthesesmp }

5 3.0.2Gather/ScatterCommunication Datamovement,suchasgatherandscatter,inirregularcomputationcanallbeformulatedasall-to-somepersonalizedcommunication,inwhicheachSMPnodetransmitsadistinctblockofdatatoasubsetof SMPnodes.Thefollowingalgorithmperforms all-to-somegathercommunicationwithsingle threadpersmpnode.thedatastructure \sched"describesthecommunicationschedule(includingthenumberofdataelements andthelocalindicesofthedataelementsto send/receive)onthisnode.anodesends sched:ndatatosend[j]ofdataelementsinthe sourcearray,indexedbysched:nsendlist[i], tonodejandstorethereceiveddatainto properlocationsinthetargetarray. Algorithmgather(source,dest,sched) void*source,void*dest; SCHEDULE*sched; {/*senddata*/ for(i=0;i<num_nodes;i++){ if(sched.n_data_to_send[i]>0){ copytherequesteddataelementsfromthe sourcearraytoithsectionofthebuffer, andsendthebufferdatatonodei }} /*receivedata*/ for(i=0;i<sched.n_msgs_to_recv;i++){ pollforincomingmessage, copythereceiveddatatoproperlocations intargetarray } }MultithreadedGatherSupposethereare kcpusoneachsmpnode,andkissmaller thanthenumberofsmpnodes(numnodes), thenthegathercommunicationcanbefurther parallelizedbyusingkthreads{eachthreadis responsiblefortheexecutionofnumnodes=k send-loopiterations.thatis,eachthreadhandlesthecommunicationmwithnumnodes=k SMPnodes.Multithreadingalsoparallelizes localmemorycopytomessagebuers.if sourceanddestinationaredierentarrays, theneachthreadaccessesdierentmemorylocationandthusnomemorylockingisrequired. Dependingonthecharacteristicsofthetargetmachine,numberofthreadsinvolvedinthe gathercommunicationmaybetunedtoachieve optimalperformance. 4 ExperimentalResult TheexperimentswereconductedonanetworkoffourSunUltraSparc-IIworkstations locatedintheinstituteofinformationscience, AcademiaSinica.TheworkstationsareconnectedbyafastEthernetnetworkcapableof 100Mbpspernode.Eachworkstationisa SMPwithtwo296MHzCPUs,runningthe multithreadossunos5.5.1.thecommunicationlibraryintheframeworkisimplemented ontopofthempichimplementationofmpi. Wecomparedtwoimplementationapproaches,oneusingthehybridthread+MPI methodology,andtheotherusingmpionly. Forthehybridversion,wecreatedfourSPMD processesbythecommand\mpirun-np4", whichmapseachofthefourprocessestoa distinctworkstation.eachprocesslaunches twothreads,whichwerethenautomaticallyassignedtodierentcpusbytheoperatingsystem.threadswithinaprocesscommunicate bysharedvariablesandmemorylocks,while processescommunicatewitheachotherby passingmessages.forthempi-onlyversion, eightprocesses,eachrunningasinglethread, werecreatedusingthecommand\mpirun-np 8",whichassignseachprocesstoadistinct CPU.Allprocessescommunicatewitheach otherbypassingmessages. Figure4(a)showstheeectivenessofusing multiplethreadsinglobalreduction.using twothreadsperprocessimprovesperformance byabout70%.theresultalsoshowsthatthe overheadofshared-memorysynchronizationin thehybridapproachforsumminguppartial reductionresultsfrommultiplecpuswithin asmpnodeisnegligiblecomparedwiththe MPI-onlyimplementation. Figure4(b)reportstheexecutiontimeof all-to-somegathercommunication.usingtwo threadsperprocessachievesaspeedupfactor

6 overlappinglocalmemorycopyingwithinterprocesscommunication.theadvantageofthe of1.5,duetotheeectofmultithreadingin tosends/receiveslessmessagesthanthempitionisevenmoresignicant.thisisbecause bythehybridapproach,eachprocessneeds hybridapproachovermpi-onlyimplementaberofmessageswillbecomemoreevidentonlyimplementation.weexpectmoreperformanceimprovementonlargenumberofsmp nodes,wheretheadvantageofreducingnum- twothreadsperprocessshortensthecommunicationtimeforboththeinspectorphaseand sparsematrix-vectormultiplybypartitioning thesparsematrixintoblocksofrows.using Figure4(c)showstheexecutiontimeofthe theexecutorphase.thecomputationtimeis competitivetothatofthempi-onlyimplementation,resultinginoverallspeedupfactorof Anumberofthreadpackagesupportinterpro- cessorcommunicationmechanisms.nexus[5] RelatedWork ofinterprocessorcommunicationintheform ofactivemessages(asynchronousremoteprocedurecalls).nexushasnotsupportedei- providesathreadabstractionthatiscapable largoalsandapproachestonexus.italso tivecommunication.panda[2]hassimi- supportsdistributedthreads.chant[6]is therpoint-to-pointcommunicationorcollec- Chantbyintroducingthenotionof\context", athread-basedcommunicationlibrarywhich ascopingmechanismforthreads,andprovidinggroupcommunicationbetweenthreadsina remoteprocedurecalls.ropes[7]extends supportsbothpoint-to-pointprimitivesand tiveoperations(e.g.barriersynchronization, broadcast)aresupported.thempc++multithreadtemplatelibraryonmpi[8]isa speciccontext.somesimplekindsofcollec- setoftemplatesandclassestosupportmultithreadedparallelprogramming.itincludes supportformulti-threadedlocal/remotefunc- tioninvocation,reduction,andbarriersyn- execution time (seconds) execution time (seconds) processes, each uses 1 thread 4 processes, each uses 2 threads 8 processes, each uses 1 thread 5 4 (a)globalreduction data size (x 1000k) Hybrid, 1 thread per process Hybrid, 2 threads per process MPI-only (b)all-to-somecommunication data size n (matrix size = 2^n * 2^n) 0.07 Hybrid, 1 thread per process Hybrid, 2 threads per process MPI-only Figure4:Executiontimeofsomecollective (c)sparsematrix-vectormultiply workstations.comparethreeversions:hybridthread+mpiwithsinglethreadperprocess,hybridthread+mpiwithtwothreadsper communicationsonaclusteroffourdual-cpu data size n (matrix size = 2^n * 2^n) process,andmpi-only(8processes,nomultithreading). execution time (seconds)

7 chronization. OnthepartofMessagePassingInterface, Skjellumetcl.[4]addressestheissueofmakingMPIthread-safewithrespecttointernal workerthreadsdesignedtoimprovetheeciencyofmpi.later,thesamegrouppropose manypossibleextensionstothempistandard, whichallowsuser-levelthreadstobemixed withmessages[13,12]. Justrecently,therearesomenewresults fromtheindustry.forexample,newversionof HPMPI(releasedinlateJunethisyear)[10] incorporatessupportformultithreadedmpi processes,intheformofathread-compliant library.thislibraryallowsuserstofreely mixposixthreadsandmessages.however, itdoesnotperformoptimizationforcollective communicationlikewedo. 6 Conclusion Inthispaper,wepresentahybridthreadsand message-passingmethodologyforsolvingirregularproblemsonsmpclusters.wehavealso describedtheimplementationofasetofcollectivecommunicationsbasedonthismethodology. Ourpreliminaryexperimentalresultsona networkofsunultrasparc-iishowsthatthe hybridmethodologyimprovesperformanceof globalreductionbyabout70%andtheperformanceofall-to-somegatherbyabout80%, comparedwithmpi-only,non-threadedimplementations. Asthreadsarecommonlyusedinsupportof parallelismandasynchronouscontrolinapplicationsandlanguageimplementations,webelievethatthehybridthreads/message-passing methodologyhasadvantageofbothexibility andeciency. SupportforthisworkisprovidedbyNationalScienceCouncilofTaiwanundergrant E References [1]H.Berryman,J.Saltz,andJ.Scroggs.Execution timesupportforadaptivescienticalgorithmson distributedmemoryarchitectures.concurrency: PracticeandExperience,3(3):159{178,1991. [2]R.Bhoedjang,T.Ruhl,R.Hofman,K.Langendoen,H.Bal,andF.Kaashoek.Panda:Aportable platformtosupportparallelprogramminglanguages.insymposiumonexperienceswithdistributedandmultiprocessorsystems,sandiego, CA,1993. [3]C.-L.Chiang,J.-J.Wu,andN.-W.Lin.Toward SupportingData-ParallelProgrammingforClustersofSymmetricMultiprocessors.InInternationalConferenceonParallelandDistributedSystems,Tainan,Taiwan,1998. [4]A.K.Chowdappa,A.Skjellum,andN.E.Doss. Thread-safemessagepassingwithp4andmpi. TechnicalReportTR-CS ,ComputerScienceDepartmentandNSFEngineeringResearch Center,MississippiStateUniversity,April1994. [5]IanFoster,C.Kesselman,R.Olson,andS.Tuecke. Nexus:Aninteroperabilitylayerforparalleland distributedcomputersystems.technicalreport, ArgonneNationalLabs,Dec [6]M.Haines,D.Cronk,andP.Mehrotra.Onthe designofchant:atalkingthreadspackage.in ProceedingsofSupercomputing'94,1994. [7]M.Haines,P.Mehrotra,andD.Cronk.ROPES: Supportforcollectiveoperationsamongdistributedthreads.Technicalreport,ICASE,NASA LangleyResearchCenter,May1995. [8]YutakaIshikawa.Multithreadtemplatelibraryin c++. [9]WilliamE.Johnston.Thecaseforusingcommercialsymmetricmultiprocessorsassupercomputers. Technicalreport,InformationandComputingSciencesDivision,LawrenceBerkeleyNationalLaboratory,1997. [10]HellatPacker.Hpmessagepassinginterface. [11]R.Ponnusamy.Amanualforthechaosruntimelibrary.TechnicalReportTR93-105,ComputerScienceDept.Univ.ofMaryland,Dec [12]A.Skjellum,N.E.Doss,K.Viswanathan, A.Chowdappa,andP.V.Bangalore.Extending themessagepassinginterface(mpi).technicalreport,computersciencedepartmentandnsfengineeringresearchcenter,mississippistateuniversity,1997. [13]A.Skjellum,B.Protopopov,andS.Hebert.A ThreadTaxonomyforMPI.Technicalreport, ComputerScienceDepartmentandNSFEngineeringResearchCenter,MississippiStateUniversity, 1996.

Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

More information

Hybrid Programming with MPI and OpenMP

Hybrid Programming with MPI and OpenMP Hybrid Programming with and OpenMP Ricardo Rocha and Fernando Silva Computer Science Department Faculty of Sciences University of Porto Parallel Computing 2015/2016 R. Rocha and F. Silva (DCC-FCUP) Programming

More information

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov

More information

Performance Tools for Parallel Java Environments

Performance Tools for Parallel Java Environments Performance Tools for Parallel Java Environments Sameer Shende and Allen D. Malony Department of Computer and Information Science, University of Oregon {sameer,malony}@cs.uoregon.edu http://www.cs.uoregon.edu/research/paracomp/tau

More information

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware OpenMP & MPI CISC 879 Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction OpenMP MPI Model Language extension: directives-based

More information

Hadoop Parallel Data Processing

Hadoop Parallel Data Processing MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

More information

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden [email protected],

More information

MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp

MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source

More information

Multi-core Programming System Overview

Multi-core Programming System Overview Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,

More information

Big Data in Test and Evaluation by Udaya Ranawake (HPCMP PETTT/Engility Corporation)

Big Data in Test and Evaluation by Udaya Ranawake (HPCMP PETTT/Engility Corporation) Big Data in Test and Evaluation by Udaya Ranawake (HPCMP PETTT/Engility Corporation) Approved for Public Release. Distribution Unlimited. Data Intensive Applications in T&E Win-T at ATC Automotive Data

More information

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008 ParFUM: A Parallel Framework for Unstructured Meshes Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008 What is ParFUM? A framework for writing parallel finite element

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC Goals of the session Overview of parallel MATLAB Why parallel MATLAB? Multiprocessing in MATLAB Parallel MATLAB using the Parallel Computing

More information

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) ( TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx

More information

Windows HPC Server 2008 R2 Service Pack 3 (V3 SP3)

Windows HPC Server 2008 R2 Service Pack 3 (V3 SP3) Windows HPC Server 2008 R2 Service Pack 3 (V3 SP3) Greg Burgess, Principal Development Manager Windows Azure High Performance Computing Microsoft Corporation HPC Server Components Job Scheduler Distributed

More information

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

Workshare Process of Thread Programming and MPI Model on Multicore Architecture Vol., No. 7, 011 Workshare Process of Thread Programming and MPI Model on Multicore Architecture R. Refianti 1, A.B. Mutiara, D.T Hasta 3 Faculty of Computer Science and Information Technology, Gunadarma

More information

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Xuan Shi GRA: Bowei Xue University of Arkansas Spatiotemporal Modeling of Human Dynamics

More information

Object-Oriented Software Construction

Object-Oriented Software Construction 1 Object-Oriented Software Construction Bertrand Meyer 2 Lecture 24: Concurrent O-O principles The goal 3 Provide a practical and general foundation for concurrent and distributed programming, ensuring

More information

Fig. 3. PostgreSQL subsystems

Fig. 3. PostgreSQL subsystems Development of a Parallel DBMS on the Basis of PostgreSQL C. S. Pan [email protected] South Ural State University Abstract. The paper describes the architecture and the design of PargreSQL parallel database

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens

More information

PMI: A Scalable Parallel Process-Management Interface for Extreme-Scale Systems

PMI: A Scalable Parallel Process-Management Interface for Extreme-Scale Systems PMI: A Scalable Parallel Process-Management Interface for Extreme-Scale Systems Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, Jayesh Krishna 1, Ewing Lusk 1, and Rajeev Thakur 1

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Performance and scalability of MPI on PC clusters

Performance and scalability of MPI on PC clusters CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 24; 16:79 17 (DOI: 1.12/cpe.749) Performance Performance and scalability of MPI on PC clusters Glenn R. Luecke,,

More information

BLM 413E - Parallel Programming Lecture 3

BLM 413E - Parallel Programming Lecture 3 BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several

More information

Embedded/Real-Time Software Development with PathMATE and IBM Rational Systems Developer

Embedded/Real-Time Software Development with PathMATE and IBM Rational Systems Developer Generate Results. Real Models. Real Code. Real Fast. Embedded/Real-Time Software Development with PathMATE and IBM Rational Systems Developer Andreas Henriksson, Ericsson [email protected]

More information

Thesis Proposal: Improving the Performance of Synchronization in Concurrent Haskell

Thesis Proposal: Improving the Performance of Synchronization in Concurrent Haskell Thesis Proposal: Improving the Performance of Synchronization in Concurrent Haskell Ryan Yates 5-5-2014 1/21 Introduction Outline Thesis Why Haskell? Preliminary work Hybrid TM for GHC Obstacles to Performance

More information

Using the Windows Cluster

Using the Windows Cluster Using the Windows Cluster Christian Terboven [email protected] aachen.de Center for Computing and Communication RWTH Aachen University Windows HPC 2008 (II) September 17, RWTH Aachen Agenda o Windows Cluster

More information

Cloud9 Parallel Symbolic Execution for Automated Real-World Software Testing

Cloud9 Parallel Symbolic Execution for Automated Real-World Software Testing Cloud9 Parallel Symbolic Execution for Automated Real-World Software Testing Stefan Bucur, Vlad Ureche, Cristian Zamfir, George Candea School of Computer and Communication Sciences Automated Software Testing

More information

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs 1.1 Introduction Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs For brevity, the Lavastorm Analytics Library (LAL) Predictive and Statistical Analytics Node Pack will be

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer

More information

A quick tutorial on Intel's Xeon Phi Coprocessor

A quick tutorial on Intel's Xeon Phi Coprocessor A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be [email protected] Architecture Setup Programming The beginning of wisdom is the definition of terms. * Name Is a... As opposed

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

Performance Analysis of a Hybrid MPI/OpenMP Application on Multi-core Clusters

Performance Analysis of a Hybrid MPI/OpenMP Application on Multi-core Clusters Performance Analysis of a Hybrid MPI/OpenMP Application on Multi-core Clusters Martin J. Chorley a, David W. Walker a a School of Computer Science and Informatics, Cardiff University, Cardiff, UK Abstract

More information

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

and RISC Optimization Techniques for the Hitachi SR8000 Architecture 1 KONWIHR Project: Centre of Excellence for High Performance Computing Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture F. Deserno, G. Hager, F. Brechtefeld, G.

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

How to Use Open SpeedShop BGP and Cray XT/XE

How to Use Open SpeedShop BGP and Cray XT/XE How to Use Open SpeedShop BGP and Cray XT/XE ASC Booth Presentation @ SC 2010 New Orleans, LA 1 Why Open SpeedShop? Open Source Performance Analysis Tool Framework Most common performance analysis steps

More information

Overview of HPC Resources at Vanderbilt

Overview of HPC Resources at Vanderbilt Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources

More information

PARALLEL PROGRAMMING

PARALLEL PROGRAMMING PARALLEL PROGRAMMING TECHNIQUES AND APPLICATIONS USING NETWORKED WORKSTATIONS AND PARALLEL COMPUTERS 2nd Edition BARRY WILKINSON University of North Carolina at Charlotte Western Carolina University MICHAEL

More information

Kommunikation in HPC-Clustern

Kommunikation in HPC-Clustern Kommunikation in HPC-Clustern Communication/Computation Overlap in MPI W. Rehm and T. Höfler Department of Computer Science TU Chemnitz http://www.tu-chemnitz.de/informatik/ra 11.11.2005 Outline 1 2 Optimize

More information

Performance Enhancement of Multicore Processors using Dynamic Load Balancing

Performance Enhancement of Multicore Processors using Dynamic Load Balancing Performance Enhancement of Multicore Processors using Dynamic oad Balancing jay Tiwari School of Computer Science Devi hilya niversity (DVV) Indore, India e-mail: [email protected] bstract Introduction

More information

On the Importance of Thread Placement on Multicore Architectures

On the Importance of Thread Placement on Multicore Architectures On the Importance of Thread Placement on Multicore Architectures HPCLatAm 2011 Keynote Cordoba, Argentina August 31, 2011 Tobias Klug Motivation: Many possibilities can lead to non-deterministic runtimes...

More information

MAQAO Performance Analysis and Optimization Tool

MAQAO Performance Analysis and Optimization Tool MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22

More information

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

More information

Architectures for massive data management

Architectures for massive data management Architectures for massive data management Apache Spark Albert Bifet [email protected] October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache

More information

Technical Computing Suite Job Management Software

Technical Computing Suite Job Management Software Technical Computing Suite Job Management Software Toshiaki Mikamo Fujitsu Limited Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster Outline System Configuration and Software Stack Features The major functions

More information

Running COMSOL in parallel

Running COMSOL in parallel Running COMSOL in parallel COMSOL can run a job on many cores in parallel (Shared-memory processing or multithreading) COMSOL can run a job run on many physical nodes (cluster computing) Both parallel

More information

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH Equalizer Parallel OpenGL Application Framework Stefan Eilemann, Eyescale Software GmbH Outline Overview High-Performance Visualization Equalizer Competitive Environment Equalizer Features Scalability

More information

Distributed communication-aware load balancing with TreeMatch in Charm++

Distributed communication-aware load balancing with TreeMatch in Charm++ Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration

More information

Overview. CMSC 330: Organization of Programming Languages. Synchronization Example (Java 1.5) Lock Interface (Java 1.5) ReentrantLock Class (Java 1.

Overview. CMSC 330: Organization of Programming Languages. Synchronization Example (Java 1.5) Lock Interface (Java 1.5) ReentrantLock Class (Java 1. CMSC 330: Organization of Programming Languages Java 1.5, Ruby, and MapReduce Overview Java 1.5 ReentrantLock class Condition interface - await( ), signalall( ) Ruby multithreading Concurrent programming

More information

Using computing resources with IBM ILOG CPLEX CO@W2015

Using computing resources with IBM ILOG CPLEX CO@W2015 CPLEX Optimization IBM Germany 2015-10-06 Using computing resources with IBM ILOG CPLEX CO@W2015 Hardware resources Multiple cores/threads Multiple machines No machines Software resources Interfacing with

More information

The Design and Implementation of a Java Virtual Machine on a Cluster of Workstations Carlos Daniel Cavanna

The Design and Implementation of a Java Virtual Machine on a Cluster of Workstations Carlos Daniel Cavanna The Design and Implementation of a Java Virtual Machine on a Cluster of Workstations by Carlos Daniel Cavanna A thesis submitted in conformity with the requirements for the degree of Master of Applied

More information

Status of the US LQCD Infrastructure Project

Status of the US LQCD Infrastructure Project Status of the US LQCD Infrastructure Project Bob Sugar Allhands Meeting, April 6-7, 2006 p.1/15 Overview SciDAC-2 Proposal The LQCD Computing Project Allhands Meeting, April 6-7, 2006 p.2/15 SciDAC-2 Proposal

More information

Dynamic Load Balancing in CP2K

Dynamic Load Balancing in CP2K Dynamic Load Balancing in CP2K Pradeep Shivadasan August 19, 2014 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2014 Abstract CP2K is a widely used atomistic simulation

More information

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,

More information

A Collaborative Network Security Management System in Metropolitan Area Network

A Collaborative Network Security Management System in Metropolitan Area Network A Collaborative Network Security Management System in Metropolitan Area Network Beipeng Mu and Xinming Chen Department of Automation Tsinghua University Beijing, China Email: {mbp7, chen-xm}@mails.tsinghua.edu.cn

More information

PERFORMANCE COMPARISON OF PARALLEL PROGRAMMING PARADIGMS ON A MULTICORE CLUSTER

PERFORMANCE COMPARISON OF PARALLEL PROGRAMMING PARADIGMS ON A MULTICORE CLUSTER PERFORMANCE COMPARISON OF PARALLEL PROGRAMMING PARADIGMS ON A MULTICORE CLUSTER Enzo Rucci, Franco Chichizola, Marcelo Naiouf and Armando De Giusti Institute of Research in Computer Science LIDI (III-LIDI)

More information

CGMgraph/CGMlib: Implementing and Testing CGM Graph Algorithms on PC Clusters

CGMgraph/CGMlib: Implementing and Testing CGM Graph Algorithms on PC Clusters CGMgraph/CGMlib: Implementing and Testing CGM Graph Algorithms on PC Clusters Albert Chan and Frank Dehne School of Computer Science, Carleton University, Ottawa, Canada http://www.scs.carleton.ca/ achan

More information

Symmetric Multiprocessing

Symmetric Multiprocessing Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called

More information

It s Not A Disease: The Parallel Solver Packages MUMPS, PaStiX & SuperLU

It s Not A Disease: The Parallel Solver Packages MUMPS, PaStiX & SuperLU It s Not A Disease: The Parallel Solver Packages MUMPS, PaStiX & SuperLU A. Windisch PhD Seminar: High Performance Computing II G. Haase March 29 th, 2012, Graz Outline 1 MUMPS 2 PaStiX 3 SuperLU 4 Summary

More information

FLOW-3D Performance Benchmark and Profiling. September 2012

FLOW-3D Performance Benchmark and Profiling. September 2012 FLOW-3D Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: FLOW-3D, Dell, Intel, Mellanox Compute

More information

Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp

Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from non-petascale? What has changed since

More information

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability

More information

Keys to node-level performance analysis and threading in HPC applications

Keys to node-level performance analysis and threading in HPC applications Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION

More information

An HPC Application Deployment Model on Azure Cloud for SMEs

An HPC Application Deployment Model on Azure Cloud for SMEs An HPC Application Deployment Model on Azure Cloud for SMEs Fan Ding CLOSER 2013, Aachen, Germany, May 9th,2013 Rechen- und Kommunikationszentrum (RZ) Agenda Motivation Windows Azure Relevant Technology

More information

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code F. Rossi, S. Sinigardi, P. Londrillo & G. Turchetti University of Bologna & INFN GPU2014, Rome, Sept 17th

More information

Min Si. Argonne National Laboratory Mathematics and Computer Science Division

Min Si. Argonne National Laboratory Mathematics and Computer Science Division Min Si Contact Information Address 9700 South Cass Avenue, Bldg. 240, Lemont, IL 60439, USA Office +1 630-252-4249 Mobile +1 630-880-4388 E-mail [email protected] Homepage http://www.mcs.anl.gov/~minsi/ Current

More information

DB2 Connect for NT and the Microsoft Windows NT Load Balancing Service

DB2 Connect for NT and the Microsoft Windows NT Load Balancing Service DB2 Connect for NT and the Microsoft Windows NT Load Balancing Service Achieving Scalability and High Availability Abstract DB2 Connect Enterprise Edition for Windows NT provides fast and robust connectivity

More information

Introduction to Hybrid Programming

Introduction to Hybrid Programming Introduction to Hybrid Programming Hristo Iliev Rechen- und Kommunikationszentrum aixcelerate 2012 / Aachen 10. Oktober 2012 Version: 1.1 Rechen- und Kommunikationszentrum (RZ) Motivation for hybrid programming

More information

High Performance Computing

High Performance Computing High Performance Computing Trey Breckenridge Computing Systems Manager Engineering Research Center Mississippi State University What is High Performance Computing? HPC is ill defined and context dependent.

More information

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures 11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the

More information