threads threads threads



Similar documents
Multicore Parallel Computing with OpenMP

Hybrid Programming with MPI and OpenMP

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Performance Tools for Parallel Java Environments

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Hadoop Parallel Data Processing

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

MPI and Hybrid Programming Models. William Gropp

Multi-core Programming System Overview

Big Data in Test and Evaluation by Udaya Ranawake (HPCMP PETTT/Engility Corporation)

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Windows HPC Server 2008 R2 Service Pack 3 (V3 SP3)

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Object-Oriented Software Construction

Fig. 3. PostgreSQL subsystems

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

PMI: A Scalable Parallel Process-Management Interface for Extreme-Scale Systems

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Performance and scalability of MPI on PC clusters

BLM 413E - Parallel Programming Lecture 3

Embedded/Real-Time Software Development with PathMATE and IBM Rational Systems Developer

Thesis Proposal: Improving the Performance of Synchronization in Concurrent Haskell

Using the Windows Cluster

Cloud9 Parallel Symbolic Execution for Automated Real-World Software Testing

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Overlapping Data Transfer With Application Execution on Clusters

A quick tutorial on Intel's Xeon Phi Coprocessor

Parallel Programming Survey

Performance Analysis of a Hybrid MPI/OpenMP Application on Multi-core Clusters

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

How to Use Open SpeedShop BGP and Cray XT/XE

Overview of HPC Resources at Vanderbilt

PARALLEL PROGRAMMING

Kommunikation in HPC-Clustern

Performance Enhancement of Multicore Processors using Dynamic Load Balancing

On the Importance of Thread Placement on Multicore Architectures

MAQAO Performance Analysis and Optimization Tool

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Architectures for massive data management

Technical Computing Suite Job Management Software

Running COMSOL in parallel

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

Distributed communication-aware load balancing with TreeMatch in Charm++

Overview. CMSC 330: Organization of Programming Languages. Synchronization Example (Java 1.5) Lock Interface (Java 1.5) ReentrantLock Class (Java 1.

Using computing resources with IBM ILOG CPLEX

The Design and Implementation of a Java Virtual Machine on a Cluster of Workstations Carlos Daniel Cavanna

Status of the US LQCD Infrastructure Project

Dynamic Load Balancing in CP2K

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

A Collaborative Network Security Management System in Metropolitan Area Network

PERFORMANCE COMPARISON OF PARALLEL PROGRAMMING PARADIGMS ON A MULTICORE CLUSTER

CGMgraph/CGMlib: Implementing and Testing CGM Graph Algorithms on PC Clusters

Symmetric Multiprocessing

It s Not A Disease: The Parallel Solver Packages MUMPS, PaStiX & SuperLU

FLOW-3D Performance Benchmark and Profiling. September 2012

Petascale Software Challenges. William Gropp

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Keys to node-level performance analysis and threading in HPC applications

An HPC Application Deployment Model on Azure Cloud for SMEs

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code

Min Si. Argonne National Laboratory Mathematics and Computer Science Division

DB2 Connect for NT and the Microsoft Windows NT Load Balancing Service

Introduction to Hybrid Programming

High Performance Computing

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Transcription:

AHybridMultithreading/Message-PassingApproachforSolving IrregularProblemsonSMPClusters Jan-JanWu InstituteofInformationScience AcademiaSinica Taipei,Taiwan,R.O.C. Chia-LienChiang Nai-WeiLin Dept.ComputerScience NationalChungChengUniv. Chia-Yi,Taiwan,R.O.C. AbstractThispaperreportsthedesignofa runtimelibraryforsolvingirregularlystructured problemsonclustersofsymmetricmultiprocessors (SMPclusters).Ourdesignalgorithmsexploita hybridmethodologywhichmapsdirectlytotheunderlyinghieraricalmemorysysteminsmpclusters, bycombiningtwostylesofprogrammingmethodologies{threads(sharedmemoryprogramming)within asmpnodeandmessagepassingbetweensmp nodes.thishybridapproachhasbeenusedinthe implementationofaruntimelibrarysupportingthe inspector/executormodelofexecutionforsolving irregularproblems.experimentalresultsonaclusterofsunultrasparc-iiworkstationsarereported. Keywords:data-parallelprogramming,workstationclusters,symmetricmultiprocessor,multithreadedmessage-passing,collectivecommunication 1 Introduction Highperformanceparallelcomputersincorporateproprietaryinterconnectionnetworksallowinglow-latency,highbandwidthinterprocessorcommunications.Alargenumberof scienticcomputingapplicationshavebenet fromsuchparallelcomputers.nowadays,the currentlevelofhardwareperformancethatis foundinhigh-performanceworkstations,togetherwithadvancesindistributedsystemsarchitecture,makesclustersofworkstations(so callednowsystems)oneofthehighestperformanceandmostcosteectiveapproaches tocomputing.high-performanceworkstation clusterscanbeenrealisticallyusedforavariety ofapplicationstoreplacevectorsupercomputersandmassivelyparallelcomputers. Inrecentyears,wehaveseenanumberof suchworkstations(orpcs)introducedintothe market(e.g.thesgipowerchallenge,sunultraenterpriseseries,decalphaservers,ibm RS6000withmultipleCPUs,andIntelPentiumwithtwoorfourCPUs.)Thelargest commerciallyavailablesmpstoday(e.g.sgi PowerChallengewith32processors)candeliver peakratesofoveronebillionoatingpoint operationspersecond;thisisprojectedtoincreasebyoveranorderofmagnitudewithin thenextfewyears.suchcomputingpowerwill makethisclassofmachinesanattractivechoice forsolvinglargeproblems.itispredictedthat SMPswillbethebasisoffutureworkstation clusters[9]. ProcessorswithinaSMPshareacommon memory,whileeachsmphasitsownmemorywhichisnotdirectlyaccessiblefromprocessorsresidedonothersmps.theperformanceofadistributedmemorycomputerdependstoalargeextentonhowfastdatamovementcanbeperformed.despitesignicant improvementinhardwaretechnology,theimprovementsincommunicationcostarestillbehindthoseinthecomputationpowerofeach processingnode.asaresult,optimizationfor communicationremainsanimportantissuefor SMPclusters. WithanumberofimplementationsofMPI

andpvmbecomingavailable,parallelprogrammingonworkstationclustershasbecome aneasiertask.althoughmpiandpvmcan alsobeusedtomanagecommunicationinsmp clusters,theyarenotnecessarilythemostecientmethodologies.ourapproachexploitsa hybridmethodologywhichmapsdirectlytothe underlyinghieraricalmemorysysteminsmp clusters,bycombiningtwostylesofprogrammingmethodologies{threads(sharedmemory programming)withinasmpnodeandmessage passingbetweensmpnodes. Ourpreviousexperienceswiththehybrid methodologyforregulardata-parallelcomputation(i.e.wherereferencepatternsofeach dataelementareuniformandpredictable)are reportedin[3].inthispaper,weextendour previousworktosupportirregularcomputations,wherereferencepatternsmaydependon inputdataorruntime-computedvalues.this papergivesanoverviewofthehybridmethodologyforsolvingirregularproblemsandreportsthealgorithmicdesignofaruntimelibrarybasedonthishybridmethodology.the librarycollectsasetofcomputationandcommunicationprimitivesthatformtheruntime kernelforsupportingtheinspector/executor modelofexecutionondistributed-memory multiprocessors. Ourdesigngoalscenteronhighportability,basedonexistingstandardsforlightweight threadsandmessage-passingsystems,andhigh eciency,basedontheutilityofthreadsto improveeciencyofdatamovement.thelibraryisintendedtobeusedtosupportdirect programmingorasaruntimesupportlibrary forhigh-levellanguages(e.g.hpf2)targetingforsmpclusters.ourpreliminaryexperimentalresultsonaclusterofsunultrasparc- IIdual-CPUworkstationsdemonstrateperformanceimprovementofthehybridmethodology againstmessage-passingimplementation. 2 Overview PreprocessingforIrregularComputation Themostcommonlyusedmodelforexplicit data-parallelprogrammingondistributedmemorymultiprocessorsisthesingleprogrammultipledatamodel,wheredataaredistributedamongavailableprocessors,andthe sameprogramisexecutedconcurrentlyonthe processorswitheachprocessoronlyresponsible forthecomputationofaportionofthedata. Forcaseswheredataareuniformlydistributed,thelocationofandataelementcan bedeterminedstatically.inmanysituatations dynamicandirregulardatadistributionisrequiredtoobtainloadbalancinganddatalocality.itisnotpossibletoexpressirregulardata distributioninasimpleway.themappingof dataisusuallydescribedbyadistributedmappingtabledenedbyapartitioningalgorithm. Duringthepreprocessingorinspectorphase, wecarryoutasetofinterprocessorcommunicationsthatallowsustoanticipateexactly whichsendandreceivecommunicationcalls eachprocessormustexecute[1].oncepreprocessingiscompleted,weareinaposition tocarryouttheproblem'scommunicationand computation;thisphaseiscalledtheexecutor [1].Theexchangeofinterprocessordistributed arrayelementsdatafollowstheplainsestablishedbytheinspector.o-processordataare appendedtotheendofthelocalarray.once acommunicationphaseisover,eachprocessorcarriesoutitscomputationentirelyonthe locallystoreddatawithoutfurthercommunication.writestoo-processordataelements arealsoappendedtotheendofthelocalarray.whenthecomputationalphaseisnished, arrayelementstobestoredo-processorare obtainedfromtheendofthelocalarrayand senttotheappropriateo-processorlocations. WeusethePartilibrary(singlethreadon eachprocess)asthebasisforourmultithreaded implementation.theparti/chaosruntimelibrarydevelopedby[1,11]aimstosupportthe inspector/executorstyleofprogrammingusing

runtimeprimitivesincludecommunicationrou- arraysasthemaindatastructures.theparti tinesdesignedtosupportirregulardatadis- tributionandirregularpatternsofdistributed arrayaccess(figure3).figure1(a)showsthe data-parallelcodesegmentforasparsematrixvectormultiply.figure1(b)depictsthespmscripts(givenbyarraycol)topairsof translatestheindirectdatareferencesub- implementationwiththepartilibrary. processindexandlocalindex(proc,loc). Intheinspectorphase,index-translation tionschedule).intheexecutorphase,o- to/fromwhichprocesses(calledcommunica- build-scheduledetermineswhichprocess processordataaregatheredfromremotepro- shouldsend/receivewhichdataelements cessorsaccordingtotheplannedcommunica- tionscheduleandstoredattheendofthelocal remotelyfetcheddatainlocalarrayx.then arrayx.thelocalindexarraylocisupdated accordinglytoreecttheactuallocationofthe thetwonestedloopscarryoutthematrixvectormultiplyentirelyonthelocalarrayx. HybridModelofExecution ainitialthreadforhousekeepingnodeexecu- forsmpclustersisdepictedinfigure2. Whenassignedatask,eachSMPnodecreates Ourmodeloflooselysynchronousexecution berofworkthreads,eachassignedtoaproces- sorwithinthatnode,thatexecuteinparallel tion.theinitialthreadthenlaunchesanum- workthreadsmaysynchronizeandenterthe memory(computationphase).whenremote accessestoothersmpnodesarerequired,the andcoordinatewitheachotherviatheshared communicationphase.whenthecommunicationiscompleted,allthethreadsproceedto nextcomputationphase,andsoon. MultithreadedRuntimeLibrary ipateinparallelcomputationwithinasmp model,wedesignamultithreadedruntimelibrary.thesamesetofthreadsthatpartic- Basedonthelooselysynchronousexecution (a)data-parallelprogram doi=1,nrows sum=0 enddo doj=1,ncols(i) y(i)=y(i)+sum sum=sum+f(j,i)*x(cols(j,i))!!theinspectorphase sched=build-schedule(proc,loc)!!theexecutorphase (b)spmdprogram callgather(x,x,sched,loc) count=0 callindex-translation(proc,loc,cols) doi=1,locnrows sum=0 doj=1,ncols(i) enddo y(i)=y(i)+sum sum=sum+f(j,i)*x(loc(count)) count+=1 Figure1:Inspector/Executorimplementation ofsparsematrix-vectormultiply NODE 0 NODE 1 NODE N-1 threads threads threads computation phase communication phase Multithread Communication modelforsmpcluster Figure2: Looselysynchronousexecution computation phase Multithread Communication communication phase computation phase

Multithreaded Runtime Library togethertooptimizethecommunicationphase. lectivecommunications. nodeduringthecomputationphasealsowork Wehavedesignedseveraloptimizationsforcol- Usemultiplethreadstospeeduplocal Hidemessagelatencybyoverlappinglocal memorycopyandsending/receivingmessages,byusingdierentthreadsforthese datamovement, Pipelinemultiplemessagestothenetwork interfacebyusingmultiplethreadsfor twotasks, tithreadedruntimelibrary.posixthreads Figure3showstheorganizationofthemul- sending/receivingmultiplemessages. sagepassingbetweensmpnodes. orycomputationandmessagebueringwithin asmpnode.mpiprimitivesareusedformes- primitivesareusedtoparallelizelocalmem- Inspector: build-mapping-table, index-translation, build-communication-schedule Executor: gather, scatter, scatter-accumulate Global operations: Broadcast, Reduction, Barrier-sync, Figure3:Organizationofthemultithreaded MPI primitives runtimelibrary POSIX threads (Send, Receive, Send-and-receive, Broadcast, Reduce, Barrier-sync ) 3Currently,thelibrarysupportsasetofglobal operations(barriersynchronization,broadcast,reduction),inspectoroperations(index RuntimePrimitives andcollectivecommunication(gather,scatter, translation,buildcommunicationschedule) theruntimekernelformultithreadedexecution scatter-accumulate),whichalltogetherform describesthealgorithmsforsomeoftheprimitives. oftheinspector/executormodel.thissection 3.0.1BroadcastandReduction carriedoutintwosteps:shared-memoryreductionamongmultiplethreadswithinasmp AreductionoperationonaSMPclusteris betweensmpnodes. Algorithmreduce(op,source,target,ndata) node,andthendistributed-memoryreduction istobeperformed,targetstoresthereduction opisabinaryassociativeoperator, sourceisthedataarrayonwhichthereduction result, {Assumingkthreadswillbeusedforthereduction partitionedtothesekthreads. onasmpnode,andthedataelementsareevenly 1.Foreachthreaddo /*computesthereductiononthesectionof local-memory-reduce(op, localbufferforthisthread.*/ responsiblefor,andstoretheresulttoa thedataarraythattheparticularthreadis 2.Combinethepartialresultsfromthesethreads. Ifkissmall,thepartialresultscanbe starting_addr_in_source_for_my_thread, combinedusingasimplelockingmechanism: local-buffer-for-my-thread,ndata/k); thekthreads temp:toholdthecombinedpartialresultfrom Twosharedvariablesareused: Foreachthreaddo number-of-threads,initializedwithk mutual_exclusion_lock accumulatelocal-buffer-for-my-threadtotemp decreasenumber-of-threadsby1 thencallmpi_reduce(temp,target,...)to ifnumber-of-threads=0(i.e.thelastthread) mutual_exclusion_unlock elsewaitforotherthreadstofinish nodes,andwakeupalltheotherthreads combinefinalresultsfromallthesesmp }

3.0.2Gather/ScatterCommunication Datamovement,suchasgatherandscatter,inirregularcomputationcanallbeformulatedasall-to-somepersonalizedcommunication,inwhicheachSMPnodetransmitsadistinctblockofdatatoasubsetof SMPnodes.Thefollowingalgorithmperforms all-to-somegathercommunicationwithsingle threadpersmpnode.thedatastructure \sched"describesthecommunicationschedule(includingthenumberofdataelements andthelocalindicesofthedataelementsto send/receive)onthisnode.anodesends sched:ndatatosend[j]ofdataelementsinthe sourcearray,indexedbysched:nsendlist[i], tonodejandstorethereceiveddatainto properlocationsinthetargetarray. Algorithmgather(source,dest,sched) void*source,void*dest; SCHEDULE*sched; {/*senddata*/ for(i=0;i<num_nodes;i++){ if(sched.n_data_to_send[i]>0){ copytherequesteddataelementsfromthe sourcearraytoithsectionofthebuffer, andsendthebufferdatatonodei }} /*receivedata*/ for(i=0;i<sched.n_msgs_to_recv;i++){ pollforincomingmessage, copythereceiveddatatoproperlocations intargetarray } }MultithreadedGatherSupposethereare kcpusoneachsmpnode,andkissmaller thanthenumberofsmpnodes(numnodes), thenthegathercommunicationcanbefurther parallelizedbyusingkthreads{eachthreadis responsiblefortheexecutionofnumnodes=k send-loopiterations.thatis,eachthreadhandlesthecommunicationmwithnumnodes=k SMPnodes.Multithreadingalsoparallelizes localmemorycopytomessagebuers.if sourceanddestinationaredierentarrays, theneachthreadaccessesdierentmemorylocationandthusnomemorylockingisrequired. Dependingonthecharacteristicsofthetargetmachine,numberofthreadsinvolvedinthe gathercommunicationmaybetunedtoachieve optimalperformance. 4 ExperimentalResult TheexperimentswereconductedonanetworkoffourSunUltraSparc-IIworkstations locatedintheinstituteofinformationscience, AcademiaSinica.TheworkstationsareconnectedbyafastEthernetnetworkcapableof 100Mbpspernode.Eachworkstationisa SMPwithtwo296MHzCPUs,runningthe multithreadossunos5.5.1.thecommunicationlibraryintheframeworkisimplemented ontopofthempichimplementationofmpi. Wecomparedtwoimplementationapproaches,oneusingthehybridthread+MPI methodology,andtheotherusingmpionly. Forthehybridversion,wecreatedfourSPMD processesbythecommand\mpirun-np4", whichmapseachofthefourprocessestoa distinctworkstation.eachprocesslaunches twothreads,whichwerethenautomaticallyassignedtodierentcpusbytheoperatingsystem.threadswithinaprocesscommunicate bysharedvariablesandmemorylocks,while processescommunicatewitheachotherby passingmessages.forthempi-onlyversion, eightprocesses,eachrunningasinglethread, werecreatedusingthecommand\mpirun-np 8",whichassignseachprocesstoadistinct CPU.Allprocessescommunicatewitheach otherbypassingmessages. Figure4(a)showstheeectivenessofusing multiplethreadsinglobalreduction.using twothreadsperprocessimprovesperformance byabout70%.theresultalsoshowsthatthe overheadofshared-memorysynchronizationin thehybridapproachforsumminguppartial reductionresultsfrommultiplecpuswithin asmpnodeisnegligiblecomparedwiththe MPI-onlyimplementation. Figure4(b)reportstheexecutiontimeof all-to-somegathercommunication.usingtwo threadsperprocessachievesaspeedupfactor

overlappinglocalmemorycopyingwithinterprocesscommunication.theadvantageofthe of1.5,duetotheeectofmultithreadingin tosends/receiveslessmessagesthanthempitionisevenmoresignicant.thisisbecause bythehybridapproach,eachprocessneeds hybridapproachovermpi-onlyimplementaberofmessageswillbecomemoreevidentonlyimplementation.weexpectmoreperformanceimprovementonlargenumberofsmp nodes,wheretheadvantageofreducingnum- twothreadsperprocessshortensthecommunicationtimeforboththeinspectorphaseand sparsematrix-vectormultiplybypartitioning thesparsematrixintoblocksofrows.using Figure4(c)showstheexecutiontimeofthe theexecutorphase.thecomputationtimeis competitivetothatofthempi-onlyimplementation,resultinginoverallspeedupfactorof 1.5. 5Anumberofthreadpackagesupportinterpro- cessorcommunicationmechanisms.nexus[5] RelatedWork ofinterprocessorcommunicationintheform ofactivemessages(asynchronousremoteprocedurecalls).nexushasnotsupportedei- providesathreadabstractionthatiscapable largoalsandapproachestonexus.italso tivecommunication.panda[2]hassimi- supportsdistributedthreads.chant[6]is therpoint-to-pointcommunicationorcollec- Chantbyintroducingthenotionof\context", athread-basedcommunicationlibrarywhich ascopingmechanismforthreads,andprovidinggroupcommunicationbetweenthreadsina remoteprocedurecalls.ropes[7]extends supportsbothpoint-to-pointprimitivesand tiveoperations(e.g.barriersynchronization, broadcast)aresupported.thempc++multithreadtemplatelibraryonmpi[8]isa speciccontext.somesimplekindsofcollec- setoftemplatesandclassestosupportmultithreadedparallelprogramming.itincludes supportformulti-threadedlocal/remotefunc- tioninvocation,reduction,andbarriersyn- execution time (seconds) execution time (seconds) 7 6 4 processes, each uses 1 thread 4 processes, each uses 2 threads 8 processes, each uses 1 thread 5 4 (a)globalreduction 3 2 1 0 0 10 20 30 40 50 60 70 data size (x 1000k) 0.035 Hybrid, 1 thread per process Hybrid, 2 threads per process MPI-only 0.03 0.025 0.02 (b)all-to-somecommunication 0.015 0.01 0.005 0 8 8.5 9 9.5 10 10.5 11 data size n (matrix size = 2^n * 2^n) 0.07 Hybrid, 1 thread per process Hybrid, 2 threads per process MPI-only 0.06 0.05 0.04 0.03 Figure4:Executiontimeofsomecollective (c)sparsematrix-vectormultiply 0.02 0.01 workstations.comparethreeversions:hybridthread+mpiwithsinglethreadperprocess,hybridthread+mpiwithtwothreadsper communicationsonaclusteroffourdual-cpu 0 8 8.5 9 9.5 10 10.5 11 data size n (matrix size = 2^n * 2^n) process,andmpi-only(8processes,nomultithreading). execution time (seconds)

chronization. OnthepartofMessagePassingInterface, Skjellumetcl.[4]addressestheissueofmakingMPIthread-safewithrespecttointernal workerthreadsdesignedtoimprovetheeciencyofmpi.later,thesamegrouppropose manypossibleextensionstothempistandard, whichallowsuser-levelthreadstobemixed withmessages[13,12]. Justrecently,therearesomenewresults fromtheindustry.forexample,newversionof HPMPI(releasedinlateJunethisyear)[10] incorporatessupportformultithreadedmpi processes,intheformofathread-compliant library.thislibraryallowsuserstofreely mixposixthreadsandmessages.however, itdoesnotperformoptimizationforcollective communicationlikewedo. 6 Conclusion Inthispaper,wepresentahybridthreadsand message-passingmethodologyforsolvingirregularproblemsonsmpclusters.wehavealso describedtheimplementationofasetofcollectivecommunicationsbasedonthismethodology. Ourpreliminaryexperimentalresultsona networkofsunultrasparc-iishowsthatthe hybridmethodologyimprovesperformanceof globalreductionbyabout70%andtheperformanceofall-to-somegatherbyabout80%, comparedwithmpi-only,non-threadedimplementations. Asthreadsarecommonlyusedinsupportof parallelismandasynchronouscontrolinapplicationsandlanguageimplementations,webelievethatthehybridthreads/message-passing methodologyhasadvantageofbothexibility andeciency. SupportforthisworkisprovidedbyNationalScienceCouncilofTaiwanundergrant 88-2213-E-001-004. References [1]H.Berryman,J.Saltz,andJ.Scroggs.Execution timesupportforadaptivescienticalgorithmson distributedmemoryarchitectures.concurrency: PracticeandExperience,3(3):159{178,1991. [2]R.Bhoedjang,T.Ruhl,R.Hofman,K.Langendoen,H.Bal,andF.Kaashoek.Panda:Aportable platformtosupportparallelprogramminglanguages.insymposiumonexperienceswithdistributedandmultiprocessorsystems,sandiego, CA,1993. [3]C.-L.Chiang,J.-J.Wu,andN.-W.Lin.Toward SupportingData-ParallelProgrammingforClustersofSymmetricMultiprocessors.InInternationalConferenceonParallelandDistributedSystems,Tainan,Taiwan,1998. [4]A.K.Chowdappa,A.Skjellum,andN.E.Doss. Thread-safemessagepassingwithp4andmpi. TechnicalReportTR-CS-941025,ComputerScienceDepartmentandNSFEngineeringResearch Center,MississippiStateUniversity,April1994. [5]IanFoster,C.Kesselman,R.Olson,andS.Tuecke. Nexus:Aninteroperabilitylayerforparalleland distributedcomputersystems.technicalreport, ArgonneNationalLabs,Dec.1993. [6]M.Haines,D.Cronk,andP.Mehrotra.Onthe designofchant:atalkingthreadspackage.in ProceedingsofSupercomputing'94,1994. [7]M.Haines,P.Mehrotra,andD.Cronk.ROPES: Supportforcollectiveoperationsamongdistributedthreads.Technicalreport,ICASE,NASA LangleyResearchCenter,May1995. [8]YutakaIshikawa.Multithreadtemplatelibraryin c++.http://www.rwcp.or.jp/lab/pdslab/dist. [9]WilliamE.Johnston.Thecaseforusingcommercialsymmetricmultiprocessorsassupercomputers. Technicalreport,InformationandComputingSciencesDivision,LawrenceBerkeleyNationalLaboratory,1997. [10]HellatPacker.Hpmessagepassinginterface. http://www.hp.com/go/mpi. [11]R.Ponnusamy.Amanualforthechaosruntimelibrary.TechnicalReportTR93-105,ComputerScienceDept.Univ.ofMaryland,Dec.1993. [12]A.Skjellum,N.E.Doss,K.Viswanathan, A.Chowdappa,andP.V.Bangalore.Extending themessagepassinginterface(mpi).technicalreport,computersciencedepartmentandnsfengineeringresearchcenter,mississippistateuniversity,1997. [13]A.Skjellum,B.Protopopov,andS.Hebert.A ThreadTaxonomyforMPI.Technicalreport, ComputerScienceDepartmentandNSFEngineeringResearchCenter,MississippiStateUniversity, 1996.