threads threads threads
|
|
|
- Dortha Todd
- 10 years ago
- Views:
Transcription
1 AHybridMultithreading/Message-PassingApproachforSolving IrregularProblemsonSMPClusters Jan-JanWu InstituteofInformationScience AcademiaSinica Taipei,Taiwan,R.O.C. Chia-LienChiang Nai-WeiLin Dept.ComputerScience NationalChungChengUniv. Chia-Yi,Taiwan,R.O.C. AbstractThispaperreportsthedesignofa runtimelibraryforsolvingirregularlystructured problemsonclustersofsymmetricmultiprocessors (SMPclusters).Ourdesignalgorithmsexploita hybridmethodologywhichmapsdirectlytotheunderlyinghieraricalmemorysysteminsmpclusters, bycombiningtwostylesofprogrammingmethodologies{threads(sharedmemoryprogramming)within asmpnodeandmessagepassingbetweensmp nodes.thishybridapproachhasbeenusedinthe implementationofaruntimelibrarysupportingthe inspector/executormodelofexecutionforsolving irregularproblems.experimentalresultsonaclusterofsunultrasparc-iiworkstationsarereported. Keywords:data-parallelprogramming,workstationclusters,symmetricmultiprocessor,multithreadedmessage-passing,collectivecommunication 1 Introduction Highperformanceparallelcomputersincorporateproprietaryinterconnectionnetworksallowinglow-latency,highbandwidthinterprocessorcommunications.Alargenumberof scienticcomputingapplicationshavebenet fromsuchparallelcomputers.nowadays,the currentlevelofhardwareperformancethatis foundinhigh-performanceworkstations,togetherwithadvancesindistributedsystemsarchitecture,makesclustersofworkstations(so callednowsystems)oneofthehighestperformanceandmostcosteectiveapproaches tocomputing.high-performanceworkstation clusterscanbeenrealisticallyusedforavariety ofapplicationstoreplacevectorsupercomputersandmassivelyparallelcomputers. Inrecentyears,wehaveseenanumberof suchworkstations(orpcs)introducedintothe market(e.g.thesgipowerchallenge,sunultraenterpriseseries,decalphaservers,ibm RS6000withmultipleCPUs,andIntelPentiumwithtwoorfourCPUs.)Thelargest commerciallyavailablesmpstoday(e.g.sgi PowerChallengewith32processors)candeliver peakratesofoveronebillionoatingpoint operationspersecond;thisisprojectedtoincreasebyoveranorderofmagnitudewithin thenextfewyears.suchcomputingpowerwill makethisclassofmachinesanattractivechoice forsolvinglargeproblems.itispredictedthat SMPswillbethebasisoffutureworkstation clusters[9]. ProcessorswithinaSMPshareacommon memory,whileeachsmphasitsownmemorywhichisnotdirectlyaccessiblefromprocessorsresidedonothersmps.theperformanceofadistributedmemorycomputerdependstoalargeextentonhowfastdatamovementcanbeperformed.despitesignicant improvementinhardwaretechnology,theimprovementsincommunicationcostarestillbehindthoseinthecomputationpowerofeach processingnode.asaresult,optimizationfor communicationremainsanimportantissuefor SMPclusters. WithanumberofimplementationsofMPI
2 andpvmbecomingavailable,parallelprogrammingonworkstationclustershasbecome aneasiertask.althoughmpiandpvmcan alsobeusedtomanagecommunicationinsmp clusters,theyarenotnecessarilythemostecientmethodologies.ourapproachexploitsa hybridmethodologywhichmapsdirectlytothe underlyinghieraricalmemorysysteminsmp clusters,bycombiningtwostylesofprogrammingmethodologies{threads(sharedmemory programming)withinasmpnodeandmessage passingbetweensmpnodes. Ourpreviousexperienceswiththehybrid methodologyforregulardata-parallelcomputation(i.e.wherereferencepatternsofeach dataelementareuniformandpredictable)are reportedin[3].inthispaper,weextendour previousworktosupportirregularcomputations,wherereferencepatternsmaydependon inputdataorruntime-computedvalues.this papergivesanoverviewofthehybridmethodologyforsolvingirregularproblemsandreportsthealgorithmicdesignofaruntimelibrarybasedonthishybridmethodology.the librarycollectsasetofcomputationandcommunicationprimitivesthatformtheruntime kernelforsupportingtheinspector/executor modelofexecutionondistributed-memory multiprocessors. Ourdesigngoalscenteronhighportability,basedonexistingstandardsforlightweight threadsandmessage-passingsystems,andhigh eciency,basedontheutilityofthreadsto improveeciencyofdatamovement.thelibraryisintendedtobeusedtosupportdirect programmingorasaruntimesupportlibrary forhigh-levellanguages(e.g.hpf2)targetingforsmpclusters.ourpreliminaryexperimentalresultsonaclusterofsunultrasparc- IIdual-CPUworkstationsdemonstrateperformanceimprovementofthehybridmethodology againstmessage-passingimplementation. 2 Overview PreprocessingforIrregularComputation Themostcommonlyusedmodelforexplicit data-parallelprogrammingondistributedmemorymultiprocessorsisthesingleprogrammultipledatamodel,wheredataaredistributedamongavailableprocessors,andthe sameprogramisexecutedconcurrentlyonthe processorswitheachprocessoronlyresponsible forthecomputationofaportionofthedata. Forcaseswheredataareuniformlydistributed,thelocationofandataelementcan bedeterminedstatically.inmanysituatations dynamicandirregulardatadistributionisrequiredtoobtainloadbalancinganddatalocality.itisnotpossibletoexpressirregulardata distributioninasimpleway.themappingof dataisusuallydescribedbyadistributedmappingtabledenedbyapartitioningalgorithm. Duringthepreprocessingorinspectorphase, wecarryoutasetofinterprocessorcommunicationsthatallowsustoanticipateexactly whichsendandreceivecommunicationcalls eachprocessormustexecute[1].oncepreprocessingiscompleted,weareinaposition tocarryouttheproblem'scommunicationand computation;thisphaseiscalledtheexecutor [1].Theexchangeofinterprocessordistributed arrayelementsdatafollowstheplainsestablishedbytheinspector.o-processordataare appendedtotheendofthelocalarray.once acommunicationphaseisover,eachprocessorcarriesoutitscomputationentirelyonthe locallystoreddatawithoutfurthercommunication.writestoo-processordataelements arealsoappendedtotheendofthelocalarray.whenthecomputationalphaseisnished, arrayelementstobestoredo-processorare obtainedfromtheendofthelocalarrayand senttotheappropriateo-processorlocations. WeusethePartilibrary(singlethreadon eachprocess)asthebasisforourmultithreaded implementation.theparti/chaosruntimelibrarydevelopedby[1,11]aimstosupportthe inspector/executorstyleofprogrammingusing
3 runtimeprimitivesincludecommunicationrou- arraysasthemaindatastructures.theparti tinesdesignedtosupportirregulardatadis- tributionandirregularpatternsofdistributed arrayaccess(figure3).figure1(a)showsthe data-parallelcodesegmentforasparsematrixvectormultiply.figure1(b)depictsthespmscripts(givenbyarraycol)topairsof translatestheindirectdatareferencesub- implementationwiththepartilibrary. processindexandlocalindex(proc,loc). Intheinspectorphase,index-translation tionschedule).intheexecutorphase,o- to/fromwhichprocesses(calledcommunica- build-scheduledetermineswhichprocess processordataaregatheredfromremotepro- shouldsend/receivewhichdataelements cessorsaccordingtotheplannedcommunica- tionscheduleandstoredattheendofthelocal remotelyfetcheddatainlocalarrayx.then arrayx.thelocalindexarraylocisupdated accordinglytoreecttheactuallocationofthe thetwonestedloopscarryoutthematrixvectormultiplyentirelyonthelocalarrayx. HybridModelofExecution ainitialthreadforhousekeepingnodeexecu- forsmpclustersisdepictedinfigure2. Whenassignedatask,eachSMPnodecreates Ourmodeloflooselysynchronousexecution berofworkthreads,eachassignedtoaproces- sorwithinthatnode,thatexecuteinparallel tion.theinitialthreadthenlaunchesanum- workthreadsmaysynchronizeandenterthe memory(computationphase).whenremote accessestoothersmpnodesarerequired,the andcoordinatewitheachotherviatheshared communicationphase.whenthecommunicationiscompleted,allthethreadsproceedto nextcomputationphase,andsoon. MultithreadedRuntimeLibrary ipateinparallelcomputationwithinasmp model,wedesignamultithreadedruntimelibrary.thesamesetofthreadsthatpartic- Basedonthelooselysynchronousexecution (a)data-parallelprogram doi=1,nrows sum=0 enddo doj=1,ncols(i) y(i)=y(i)+sum sum=sum+f(j,i)*x(cols(j,i))!!theinspectorphase sched=build-schedule(proc,loc)!!theexecutorphase (b)spmdprogram callgather(x,x,sched,loc) count=0 callindex-translation(proc,loc,cols) doi=1,locnrows sum=0 doj=1,ncols(i) enddo y(i)=y(i)+sum sum=sum+f(j,i)*x(loc(count)) count+=1 Figure1:Inspector/Executorimplementation ofsparsematrix-vectormultiply NODE 0 NODE 1 NODE N-1 threads threads threads computation phase communication phase Multithread Communication modelforsmpcluster Figure2: Looselysynchronousexecution computation phase Multithread Communication communication phase computation phase
4 Multithreaded Runtime Library togethertooptimizethecommunicationphase. lectivecommunications. nodeduringthecomputationphasealsowork Wehavedesignedseveraloptimizationsforcol- Usemultiplethreadstospeeduplocal Hidemessagelatencybyoverlappinglocal memorycopyandsending/receivingmessages,byusingdierentthreadsforthese datamovement, Pipelinemultiplemessagestothenetwork interfacebyusingmultiplethreadsfor twotasks, tithreadedruntimelibrary.posixthreads Figure3showstheorganizationofthemul- sending/receivingmultiplemessages. sagepassingbetweensmpnodes. orycomputationandmessagebueringwithin asmpnode.mpiprimitivesareusedformes- primitivesareusedtoparallelizelocalmem- Inspector: build-mapping-table, index-translation, build-communication-schedule Executor: gather, scatter, scatter-accumulate Global operations: Broadcast, Reduction, Barrier-sync, Figure3:Organizationofthemultithreaded MPI primitives runtimelibrary POSIX threads (Send, Receive, Send-and-receive, Broadcast, Reduce, Barrier-sync ) 3Currently,thelibrarysupportsasetofglobal operations(barriersynchronization,broadcast,reduction),inspectoroperations(index RuntimePrimitives andcollectivecommunication(gather,scatter, translation,buildcommunicationschedule) theruntimekernelformultithreadedexecution scatter-accumulate),whichalltogetherform describesthealgorithmsforsomeoftheprimitives. oftheinspector/executormodel.thissection 3.0.1BroadcastandReduction carriedoutintwosteps:shared-memoryreductionamongmultiplethreadswithinasmp AreductionoperationonaSMPclusteris betweensmpnodes. Algorithmreduce(op,source,target,ndata) node,andthendistributed-memoryreduction istobeperformed,targetstoresthereduction opisabinaryassociativeoperator, sourceisthedataarrayonwhichthereduction result, {Assumingkthreadswillbeusedforthereduction partitionedtothesekthreads. onasmpnode,andthedataelementsareevenly 1.Foreachthreaddo /*computesthereductiononthesectionof local-memory-reduce(op, localbufferforthisthread.*/ responsiblefor,andstoretheresulttoa thedataarraythattheparticularthreadis 2.Combinethepartialresultsfromthesethreads. Ifkissmall,thepartialresultscanbe starting_addr_in_source_for_my_thread, combinedusingasimplelockingmechanism: local-buffer-for-my-thread,ndata/k); thekthreads temp:toholdthecombinedpartialresultfrom Twosharedvariablesareused: Foreachthreaddo number-of-threads,initializedwithk mutual_exclusion_lock accumulatelocal-buffer-for-my-threadtotemp decreasenumber-of-threadsby1 thencallmpi_reduce(temp,target,...)to ifnumber-of-threads=0(i.e.thelastthread) mutual_exclusion_unlock elsewaitforotherthreadstofinish nodes,andwakeupalltheotherthreads combinefinalresultsfromallthesesmp }
5 3.0.2Gather/ScatterCommunication Datamovement,suchasgatherandscatter,inirregularcomputationcanallbeformulatedasall-to-somepersonalizedcommunication,inwhicheachSMPnodetransmitsadistinctblockofdatatoasubsetof SMPnodes.Thefollowingalgorithmperforms all-to-somegathercommunicationwithsingle threadpersmpnode.thedatastructure \sched"describesthecommunicationschedule(includingthenumberofdataelements andthelocalindicesofthedataelementsto send/receive)onthisnode.anodesends sched:ndatatosend[j]ofdataelementsinthe sourcearray,indexedbysched:nsendlist[i], tonodejandstorethereceiveddatainto properlocationsinthetargetarray. Algorithmgather(source,dest,sched) void*source,void*dest; SCHEDULE*sched; {/*senddata*/ for(i=0;i<num_nodes;i++){ if(sched.n_data_to_send[i]>0){ copytherequesteddataelementsfromthe sourcearraytoithsectionofthebuffer, andsendthebufferdatatonodei }} /*receivedata*/ for(i=0;i<sched.n_msgs_to_recv;i++){ pollforincomingmessage, copythereceiveddatatoproperlocations intargetarray } }MultithreadedGatherSupposethereare kcpusoneachsmpnode,andkissmaller thanthenumberofsmpnodes(numnodes), thenthegathercommunicationcanbefurther parallelizedbyusingkthreads{eachthreadis responsiblefortheexecutionofnumnodes=k send-loopiterations.thatis,eachthreadhandlesthecommunicationmwithnumnodes=k SMPnodes.Multithreadingalsoparallelizes localmemorycopytomessagebuers.if sourceanddestinationaredierentarrays, theneachthreadaccessesdierentmemorylocationandthusnomemorylockingisrequired. Dependingonthecharacteristicsofthetargetmachine,numberofthreadsinvolvedinthe gathercommunicationmaybetunedtoachieve optimalperformance. 4 ExperimentalResult TheexperimentswereconductedonanetworkoffourSunUltraSparc-IIworkstations locatedintheinstituteofinformationscience, AcademiaSinica.TheworkstationsareconnectedbyafastEthernetnetworkcapableof 100Mbpspernode.Eachworkstationisa SMPwithtwo296MHzCPUs,runningthe multithreadossunos5.5.1.thecommunicationlibraryintheframeworkisimplemented ontopofthempichimplementationofmpi. Wecomparedtwoimplementationapproaches,oneusingthehybridthread+MPI methodology,andtheotherusingmpionly. Forthehybridversion,wecreatedfourSPMD processesbythecommand\mpirun-np4", whichmapseachofthefourprocessestoa distinctworkstation.eachprocesslaunches twothreads,whichwerethenautomaticallyassignedtodierentcpusbytheoperatingsystem.threadswithinaprocesscommunicate bysharedvariablesandmemorylocks,while processescommunicatewitheachotherby passingmessages.forthempi-onlyversion, eightprocesses,eachrunningasinglethread, werecreatedusingthecommand\mpirun-np 8",whichassignseachprocesstoadistinct CPU.Allprocessescommunicatewitheach otherbypassingmessages. Figure4(a)showstheeectivenessofusing multiplethreadsinglobalreduction.using twothreadsperprocessimprovesperformance byabout70%.theresultalsoshowsthatthe overheadofshared-memorysynchronizationin thehybridapproachforsumminguppartial reductionresultsfrommultiplecpuswithin asmpnodeisnegligiblecomparedwiththe MPI-onlyimplementation. Figure4(b)reportstheexecutiontimeof all-to-somegathercommunication.usingtwo threadsperprocessachievesaspeedupfactor
6 overlappinglocalmemorycopyingwithinterprocesscommunication.theadvantageofthe of1.5,duetotheeectofmultithreadingin tosends/receiveslessmessagesthanthempitionisevenmoresignicant.thisisbecause bythehybridapproach,eachprocessneeds hybridapproachovermpi-onlyimplementaberofmessageswillbecomemoreevidentonlyimplementation.weexpectmoreperformanceimprovementonlargenumberofsmp nodes,wheretheadvantageofreducingnum- twothreadsperprocessshortensthecommunicationtimeforboththeinspectorphaseand sparsematrix-vectormultiplybypartitioning thesparsematrixintoblocksofrows.using Figure4(c)showstheexecutiontimeofthe theexecutorphase.thecomputationtimeis competitivetothatofthempi-onlyimplementation,resultinginoverallspeedupfactorof Anumberofthreadpackagesupportinterpro- cessorcommunicationmechanisms.nexus[5] RelatedWork ofinterprocessorcommunicationintheform ofactivemessages(asynchronousremoteprocedurecalls).nexushasnotsupportedei- providesathreadabstractionthatiscapable largoalsandapproachestonexus.italso tivecommunication.panda[2]hassimi- supportsdistributedthreads.chant[6]is therpoint-to-pointcommunicationorcollec- Chantbyintroducingthenotionof\context", athread-basedcommunicationlibrarywhich ascopingmechanismforthreads,andprovidinggroupcommunicationbetweenthreadsina remoteprocedurecalls.ropes[7]extends supportsbothpoint-to-pointprimitivesand tiveoperations(e.g.barriersynchronization, broadcast)aresupported.thempc++multithreadtemplatelibraryonmpi[8]isa speciccontext.somesimplekindsofcollec- setoftemplatesandclassestosupportmultithreadedparallelprogramming.itincludes supportformulti-threadedlocal/remotefunc- tioninvocation,reduction,andbarriersyn- execution time (seconds) execution time (seconds) processes, each uses 1 thread 4 processes, each uses 2 threads 8 processes, each uses 1 thread 5 4 (a)globalreduction data size (x 1000k) Hybrid, 1 thread per process Hybrid, 2 threads per process MPI-only (b)all-to-somecommunication data size n (matrix size = 2^n * 2^n) 0.07 Hybrid, 1 thread per process Hybrid, 2 threads per process MPI-only Figure4:Executiontimeofsomecollective (c)sparsematrix-vectormultiply workstations.comparethreeversions:hybridthread+mpiwithsinglethreadperprocess,hybridthread+mpiwithtwothreadsper communicationsonaclusteroffourdual-cpu data size n (matrix size = 2^n * 2^n) process,andmpi-only(8processes,nomultithreading). execution time (seconds)
7 chronization. OnthepartofMessagePassingInterface, Skjellumetcl.[4]addressestheissueofmakingMPIthread-safewithrespecttointernal workerthreadsdesignedtoimprovetheeciencyofmpi.later,thesamegrouppropose manypossibleextensionstothempistandard, whichallowsuser-levelthreadstobemixed withmessages[13,12]. Justrecently,therearesomenewresults fromtheindustry.forexample,newversionof HPMPI(releasedinlateJunethisyear)[10] incorporatessupportformultithreadedmpi processes,intheformofathread-compliant library.thislibraryallowsuserstofreely mixposixthreadsandmessages.however, itdoesnotperformoptimizationforcollective communicationlikewedo. 6 Conclusion Inthispaper,wepresentahybridthreadsand message-passingmethodologyforsolvingirregularproblemsonsmpclusters.wehavealso describedtheimplementationofasetofcollectivecommunicationsbasedonthismethodology. Ourpreliminaryexperimentalresultsona networkofsunultrasparc-iishowsthatthe hybridmethodologyimprovesperformanceof globalreductionbyabout70%andtheperformanceofall-to-somegatherbyabout80%, comparedwithmpi-only,non-threadedimplementations. Asthreadsarecommonlyusedinsupportof parallelismandasynchronouscontrolinapplicationsandlanguageimplementations,webelievethatthehybridthreads/message-passing methodologyhasadvantageofbothexibility andeciency. SupportforthisworkisprovidedbyNationalScienceCouncilofTaiwanundergrant E References [1]H.Berryman,J.Saltz,andJ.Scroggs.Execution timesupportforadaptivescienticalgorithmson distributedmemoryarchitectures.concurrency: PracticeandExperience,3(3):159{178,1991. [2]R.Bhoedjang,T.Ruhl,R.Hofman,K.Langendoen,H.Bal,andF.Kaashoek.Panda:Aportable platformtosupportparallelprogramminglanguages.insymposiumonexperienceswithdistributedandmultiprocessorsystems,sandiego, CA,1993. [3]C.-L.Chiang,J.-J.Wu,andN.-W.Lin.Toward SupportingData-ParallelProgrammingforClustersofSymmetricMultiprocessors.InInternationalConferenceonParallelandDistributedSystems,Tainan,Taiwan,1998. [4]A.K.Chowdappa,A.Skjellum,andN.E.Doss. Thread-safemessagepassingwithp4andmpi. TechnicalReportTR-CS ,ComputerScienceDepartmentandNSFEngineeringResearch Center,MississippiStateUniversity,April1994. [5]IanFoster,C.Kesselman,R.Olson,andS.Tuecke. Nexus:Aninteroperabilitylayerforparalleland distributedcomputersystems.technicalreport, ArgonneNationalLabs,Dec [6]M.Haines,D.Cronk,andP.Mehrotra.Onthe designofchant:atalkingthreadspackage.in ProceedingsofSupercomputing'94,1994. [7]M.Haines,P.Mehrotra,andD.Cronk.ROPES: Supportforcollectiveoperationsamongdistributedthreads.Technicalreport,ICASE,NASA LangleyResearchCenter,May1995. [8]YutakaIshikawa.Multithreadtemplatelibraryin c++. [9]WilliamE.Johnston.Thecaseforusingcommercialsymmetricmultiprocessorsassupercomputers. Technicalreport,InformationandComputingSciencesDivision,LawrenceBerkeleyNationalLaboratory,1997. [10]HellatPacker.Hpmessagepassinginterface. [11]R.Ponnusamy.Amanualforthechaosruntimelibrary.TechnicalReportTR93-105,ComputerScienceDept.Univ.ofMaryland,Dec [12]A.Skjellum,N.E.Doss,K.Viswanathan, A.Chowdappa,andP.V.Bangalore.Extending themessagepassinginterface(mpi).technicalreport,computersciencedepartmentandnsfengineeringresearchcenter,mississippistateuniversity,1997. [13]A.Skjellum,B.Protopopov,andS.Hebert.A ThreadTaxonomyforMPI.Technicalreport, ComputerScienceDepartmentandNSFEngineeringResearchCenter,MississippiStateUniversity, 1996.
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
Hybrid Programming with MPI and OpenMP
Hybrid Programming with and OpenMP Ricardo Rocha and Fernando Silva Computer Science Department Faculty of Sciences University of Porto Parallel Computing 2015/2016 R. Rocha and F. Silva (DCC-FCUP) Programming
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov
Performance Tools for Parallel Java Environments
Performance Tools for Parallel Java Environments Sameer Shende and Allen D. Malony Department of Computer and Information Science, University of Oregon {sameer,malony}@cs.uoregon.edu http://www.cs.uoregon.edu/research/paracomp/tau
OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware
OpenMP & MPI CISC 879 Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction OpenMP MPI Model Language extension: directives-based
Hadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden [email protected],
MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
Multi-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
Big Data in Test and Evaluation by Udaya Ranawake (HPCMP PETTT/Engility Corporation)
Big Data in Test and Evaluation by Udaya Ranawake (HPCMP PETTT/Engility Corporation) Approved for Public Release. Distribution Unlimited. Data Intensive Applications in T&E Win-T at ATC Automotive Data
ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008
ParFUM: A Parallel Framework for Unstructured Meshes Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008 What is ParFUM? A framework for writing parallel finite element
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC
Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC Goals of the session Overview of parallel MATLAB Why parallel MATLAB? Multiprocessing in MATLAB Parallel MATLAB using the Parallel Computing
How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx
Windows HPC Server 2008 R2 Service Pack 3 (V3 SP3)
Windows HPC Server 2008 R2 Service Pack 3 (V3 SP3) Greg Burgess, Principal Development Manager Windows Azure High Performance Computing Microsoft Corporation HPC Server Components Job Scheduler Distributed
Workshare Process of Thread Programming and MPI Model on Multicore Architecture
Vol., No. 7, 011 Workshare Process of Thread Programming and MPI Model on Multicore Architecture R. Refianti 1, A.B. Mutiara, D.T Hasta 3 Faculty of Computer Science and Information Technology, Gunadarma
Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age
Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Xuan Shi GRA: Bowei Xue University of Arkansas Spatiotemporal Modeling of Human Dynamics
Object-Oriented Software Construction
1 Object-Oriented Software Construction Bertrand Meyer 2 Lecture 24: Concurrent O-O principles The goal 3 Provide a practical and general foundation for concurrent and distributed programming, ensuring
Fig. 3. PostgreSQL subsystems
Development of a Parallel DBMS on the Basis of PostgreSQL C. S. Pan [email protected] South Ural State University Abstract. The paper describes the architecture and the design of PargreSQL parallel database
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens
PMI: A Scalable Parallel Process-Management Interface for Extreme-Scale Systems
PMI: A Scalable Parallel Process-Management Interface for Extreme-Scale Systems Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, Jayesh Krishna 1, Ewing Lusk 1, and Rajeev Thakur 1
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
Performance and scalability of MPI on PC clusters
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 24; 16:79 17 (DOI: 1.12/cpe.749) Performance Performance and scalability of MPI on PC clusters Glenn R. Luecke,,
BLM 413E - Parallel Programming Lecture 3
BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several
Embedded/Real-Time Software Development with PathMATE and IBM Rational Systems Developer
Generate Results. Real Models. Real Code. Real Fast. Embedded/Real-Time Software Development with PathMATE and IBM Rational Systems Developer Andreas Henriksson, Ericsson [email protected]
Thesis Proposal: Improving the Performance of Synchronization in Concurrent Haskell
Thesis Proposal: Improving the Performance of Synchronization in Concurrent Haskell Ryan Yates 5-5-2014 1/21 Introduction Outline Thesis Why Haskell? Preliminary work Hybrid TM for GHC Obstacles to Performance
Using the Windows Cluster
Using the Windows Cluster Christian Terboven [email protected] aachen.de Center for Computing and Communication RWTH Aachen University Windows HPC 2008 (II) September 17, RWTH Aachen Agenda o Windows Cluster
Cloud9 Parallel Symbolic Execution for Automated Real-World Software Testing
Cloud9 Parallel Symbolic Execution for Automated Real-World Software Testing Stefan Bucur, Vlad Ureche, Cristian Zamfir, George Candea School of Computer and Communication Sciences Automated Software Testing
Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs
1.1 Introduction Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs For brevity, the Lavastorm Analytics Library (LAL) Predictive and Statistical Analytics Node Pack will be
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
A quick tutorial on Intel's Xeon Phi Coprocessor
A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be [email protected] Architecture Setup Programming The beginning of wisdom is the definition of terms. * Name Is a... As opposed
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Performance Analysis of a Hybrid MPI/OpenMP Application on Multi-core Clusters
Performance Analysis of a Hybrid MPI/OpenMP Application on Multi-core Clusters Martin J. Chorley a, David W. Walker a a School of Computer Science and Informatics, Cardiff University, Cardiff, UK Abstract
and RISC Optimization Techniques for the Hitachi SR8000 Architecture
1 KONWIHR Project: Centre of Excellence for High Performance Computing Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture F. Deserno, G. Hager, F. Brechtefeld, G.
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
How to Use Open SpeedShop BGP and Cray XT/XE
How to Use Open SpeedShop BGP and Cray XT/XE ASC Booth Presentation @ SC 2010 New Orleans, LA 1 Why Open SpeedShop? Open Source Performance Analysis Tool Framework Most common performance analysis steps
Overview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
PARALLEL PROGRAMMING
PARALLEL PROGRAMMING TECHNIQUES AND APPLICATIONS USING NETWORKED WORKSTATIONS AND PARALLEL COMPUTERS 2nd Edition BARRY WILKINSON University of North Carolina at Charlotte Western Carolina University MICHAEL
Kommunikation in HPC-Clustern
Kommunikation in HPC-Clustern Communication/Computation Overlap in MPI W. Rehm and T. Höfler Department of Computer Science TU Chemnitz http://www.tu-chemnitz.de/informatik/ra 11.11.2005 Outline 1 2 Optimize
Performance Enhancement of Multicore Processors using Dynamic Load Balancing
Performance Enhancement of Multicore Processors using Dynamic oad Balancing jay Tiwari School of Computer Science Devi hilya niversity (DVV) Indore, India e-mail: [email protected] bstract Introduction
On the Importance of Thread Placement on Multicore Architectures
On the Importance of Thread Placement on Multicore Architectures HPCLatAm 2011 Keynote Cordoba, Argentina August 31, 2011 Tobias Klug Motivation: Many possibilities can lead to non-deterministic runtimes...
MAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
Architectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet [email protected] October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
Technical Computing Suite Job Management Software
Technical Computing Suite Job Management Software Toshiaki Mikamo Fujitsu Limited Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster Outline System Configuration and Software Stack Features The major functions
Running COMSOL in parallel
Running COMSOL in parallel COMSOL can run a job on many cores in parallel (Shared-memory processing or multithreading) COMSOL can run a job run on many physical nodes (cluster computing) Both parallel
Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH
Equalizer Parallel OpenGL Application Framework Stefan Eilemann, Eyescale Software GmbH Outline Overview High-Performance Visualization Equalizer Competitive Environment Equalizer Features Scalability
Distributed communication-aware load balancing with TreeMatch in Charm++
Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration
Overview. CMSC 330: Organization of Programming Languages. Synchronization Example (Java 1.5) Lock Interface (Java 1.5) ReentrantLock Class (Java 1.
CMSC 330: Organization of Programming Languages Java 1.5, Ruby, and MapReduce Overview Java 1.5 ReentrantLock class Condition interface - await( ), signalall( ) Ruby multithreading Concurrent programming
Using computing resources with IBM ILOG CPLEX CO@W2015
CPLEX Optimization IBM Germany 2015-10-06 Using computing resources with IBM ILOG CPLEX CO@W2015 Hardware resources Multiple cores/threads Multiple machines No machines Software resources Interfacing with
The Design and Implementation of a Java Virtual Machine on a Cluster of Workstations Carlos Daniel Cavanna
The Design and Implementation of a Java Virtual Machine on a Cluster of Workstations by Carlos Daniel Cavanna A thesis submitted in conformity with the requirements for the degree of Master of Applied
Status of the US LQCD Infrastructure Project
Status of the US LQCD Infrastructure Project Bob Sugar Allhands Meeting, April 6-7, 2006 p.1/15 Overview SciDAC-2 Proposal The LQCD Computing Project Allhands Meeting, April 6-7, 2006 p.2/15 SciDAC-2 Proposal
Dynamic Load Balancing in CP2K
Dynamic Load Balancing in CP2K Pradeep Shivadasan August 19, 2014 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2014 Abstract CP2K is a widely used atomistic simulation
Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA
Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,
A Collaborative Network Security Management System in Metropolitan Area Network
A Collaborative Network Security Management System in Metropolitan Area Network Beipeng Mu and Xinming Chen Department of Automation Tsinghua University Beijing, China Email: {mbp7, chen-xm}@mails.tsinghua.edu.cn
PERFORMANCE COMPARISON OF PARALLEL PROGRAMMING PARADIGMS ON A MULTICORE CLUSTER
PERFORMANCE COMPARISON OF PARALLEL PROGRAMMING PARADIGMS ON A MULTICORE CLUSTER Enzo Rucci, Franco Chichizola, Marcelo Naiouf and Armando De Giusti Institute of Research in Computer Science LIDI (III-LIDI)
CGMgraph/CGMlib: Implementing and Testing CGM Graph Algorithms on PC Clusters
CGMgraph/CGMlib: Implementing and Testing CGM Graph Algorithms on PC Clusters Albert Chan and Frank Dehne School of Computer Science, Carleton University, Ottawa, Canada http://www.scs.carleton.ca/ achan
Symmetric Multiprocessing
Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called
It s Not A Disease: The Parallel Solver Packages MUMPS, PaStiX & SuperLU
It s Not A Disease: The Parallel Solver Packages MUMPS, PaStiX & SuperLU A. Windisch PhD Seminar: High Performance Computing II G. Haase March 29 th, 2012, Graz Outline 1 MUMPS 2 PaStiX 3 SuperLU 4 Summary
FLOW-3D Performance Benchmark and Profiling. September 2012
FLOW-3D Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: FLOW-3D, Dell, Intel, Mellanox Compute
Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp
Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from non-petascale? What has changed since
LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
Keys to node-level performance analysis and threading in HPC applications
Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION
An HPC Application Deployment Model on Azure Cloud for SMEs
An HPC Application Deployment Model on Azure Cloud for SMEs Fan Ding CLOSER 2013, Aachen, Germany, May 9th,2013 Rechen- und Kommunikationszentrum (RZ) Agenda Motivation Windows Azure Relevant Technology
Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code
Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code F. Rossi, S. Sinigardi, P. Londrillo & G. Turchetti University of Bologna & INFN GPU2014, Rome, Sept 17th
Min Si. Argonne National Laboratory Mathematics and Computer Science Division
Min Si Contact Information Address 9700 South Cass Avenue, Bldg. 240, Lemont, IL 60439, USA Office +1 630-252-4249 Mobile +1 630-880-4388 E-mail [email protected] Homepage http://www.mcs.anl.gov/~minsi/ Current
DB2 Connect for NT and the Microsoft Windows NT Load Balancing Service
DB2 Connect for NT and the Microsoft Windows NT Load Balancing Service Achieving Scalability and High Availability Abstract DB2 Connect Enterprise Edition for Windows NT provides fast and robust connectivity
Introduction to Hybrid Programming
Introduction to Hybrid Programming Hristo Iliev Rechen- und Kommunikationszentrum aixcelerate 2012 / Aachen 10. Oktober 2012 Version: 1.1 Rechen- und Kommunikationszentrum (RZ) Motivation for hybrid programming
High Performance Computing
High Performance Computing Trey Breckenridge Computing Systems Manager Engineering Research Center Mississippi State University What is High Performance Computing? HPC is ill defined and context dependent.
A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
