TheEectofNetworkTotalOrder,Broadcast,andRemote-Write CapabilityonNetwork-BasedSharedMemoryComputing RobertStets,SandhyaDwarkadas,LeonidasKontothanassisy,MichaelL.Scott DepartmentofComputerScienceyCompaqCambridgeResearchLab Rochester,NY14627{0226 UniversityofRochester DRAFTCOPY{Pleasedonotredistribute. OneKendallSq.,Bldg.700 Cambridge,MA02139 cationoverhead.suchfeaturesincludereducedlatency,protectedremotememoryaccess,cheapbroadcasting, andorderingguaranteesfornetworkpackets.someofthesefeaturescomeattheexpenseofscalabilityforthe Emergingsystem-areanetworksprovideavarietyoffeaturesthatcandramaticallyreducenetworkcommuni- Abstract networkfabricoratasignicantimplementationcost.inthispaperweevaluatetheimpactofthesefeatures ontheimplementationofsoftwaredistributedsharedmemory(sdsm)systems,inparticular,onthecashmere writeaccesstoremotememory,inexpensivebroadcast,andtotalorderingofnetworkpackets.ourevaluation protocol.cashmerehasbeenimplementedonforthecompaqmemorychannelnetwork,whichhassupportfor frameworkdividessdsmprotocolcommunicationintothreeareas:shareddatapropagation,protocolmeta-data maintenance,andsynchronization;demonstratestheperformanceimpactofexploitingmemorychannelfeatures ineachofthesethreeareas. Cashmere,whichmaximizesitsleverageoftheMemoryChannelfeatures,thanonacomparableversionthatuses explicitmessages(andnobroadcast)forallprotocolcommunication.theperformancedierenceis37%inone threeapplicationsshownoperformancedierences.ingeneral,thedierencesareduetoreducedprotocol-induced application,whichdynamicallydistributesitswork,and11%orlessintheothersevenapplications.theremaining Wefoundthateightofelevenwell-knownbenchmarkapplicationsperformbetteronthebaseversionof showthatthisoptimizationrecoupstheperformancelostbyabandoningtheuseofspecialmemorychannel features.infact,theoptimizationissoeectivethatthreeoftheapplicationsperform18%to34%betterona applicationperturbationandmoreecientmeta-datamaintenance.reducedperturbationaccountsforthelarge 37%improvementbydecreasinginterferencewiththeapplication'sdynamicworkdistributionmechanism. protocolwithmigrationandexplicitmessagesthanonourbaseprotocolthatfullyleveragesthememorychannel. Themessage-basedprotocolhastheadditionaladvantageofallowingsharedmemorytogrowbeyondtheamount Inaddition,wehavealsoinvestigatedHomenodemigrationtoreduceshareddatapropagation.Ourresults researchgrantfromcompaq. thatcanbemappedthroughthenetworkinterface. ThisworkwassupportedinpartbyNSFgrantsCDA{9401142,CCR{9702466,andCCR{9705594;andanexternal 1
1Introduction therelativelyhighcostofinter-processorcommunication.recenttrends,particularlytheintroductions parabletospecial-purposeparallelmachines.inpractice,however,performancehasbeenlimitedby Clustersofworkstationsconnectedbycommoditynetworkshavelongprovidedaggregatepowercomsaginglatency,manySANsprovideotheroverhead-reducingfeaturessuchasremotememoryaccess, areanetworks(sans),haveimprovedthepotentialperformanceofclusters.inadditiontolowmes- ofcommodity-pricedsymmetricmultiprocessors(smps)andlow-latency(inthemicroseconds)system inexpensivebroadcast,andtotalorderinginthenetwork[10,15,16].onsmpclustersconnectedby SANs,communicationoverheadcanbegreatlyreduced.Communicationwithinthesamenodecan performancenetwork. occurthroughhardware,whileacrosssmps,communicationoverheadcanbeamelioratedbythehigh grammingparadigmfortheseclustersissoftwaredistributedsharedmemory(sdsm)sinceitutilizes thehardwarewithinanodeeciently.severalstudieshavealreadydeterminedthepositiveimpactof SincesharedmemoryisavailableinhardwarewithinSMPnodes,perhapsthemostnaturalpro- SMP-basedclustersonSDSMperformance[12,14,,21,22,25].Manyofthesesamestudiesutilized lowlatencynetworks.however,thebenetsofadvancednetworkfeatures(forexample,remotememory access)havenotbeendirectlyquantied. state-of-the-artcashmere-2l[25]protocol.thecashmereprotocolusesthevirtualmemorysubsystem totrackdataaccesses,allowsmultipleconcurrentwriters,employshomenodes(i.e.maintainsone Inthispaper,weexaminetheimpactofadvancednetworkingfeaturesontheperformanceofthe mastercopyofeachshareddatapage),andleveragessharedmemorywithinsmpstoreduceprotocol overhead.inpractice,cashmere-2lhasbeenshowntohaveverygoodperformance[12,17,25]. paqmemorychannelnetwork,whichoerslowmessaginglatencies,writeaccesstoremotememory, inexpensivebroadcast,andtotalordering.cashmerethereforeattemptedtomaximizeperformanceby CashmerewasoriginallydesignedforaclusterconsistingofAlphaServerSMPsconnectedbyaCom- placingshareddatadirectlyinremotelyaccessiblememory,usingbroadcasttoreplicatethedirectory amongthenodes,andrelyingonnetworktotalorderandreliabilitytoavoidacknowledgingthereceipt ofmeta-datainformation. 2
WehavestructuredourevaluationtodeterminenotonlytheoverallimpactofthespecialMemory Channelfeatures,butalsotheirimpactonprotocolcommunicationandrelateddesign.Ingeneral,an Thepurposeofthispaperistoevaluatetheperformanceimpactofeachofthesedesigndecisions. impactofnetworksupportintheseterms,wehaveconstructedsixcashmerevariants.fourofthe ofinternalprotocoldatastructures(calledprotocolmeta-data),andsynchronization.toevaluatethe SDSMprotocolincurscommunicationinthreeareas:thepropagationofshareddata,themaintenance variantsareusedtoisolatetheimpactofthememorychannelfeaturesonprotocolcommunication migratetoactivewriters,therebyreducingremotepropagationofshareddata.thisoptimizationisonly possiblewhenshareddataisnotinremotelyaccessiblememory,sincemigrationofremotelyaccessible intheaboveareas.thenaltwovariantsemployaprotocoloptimizationthatallowshomenodesto elevenstandardbenchmarks.thelargestimprovementis37%andoccursinaprogramwithdynamic memoryisanexpensiveoperationinvolvingsynchronizationofallthenodesmappingthedata. workdistribution.thisapplicationbenetsfromreducedprotocol-inducedoverheadimbalances,resultinginmoreeectiveworkdistribution.importantly,theuseofthememorychannelfeaturesnever degradesperformanceinanyoftheapplications.intermsofprotocoldesign,meta-datamaintenance recovermostofthebenetslostbynotusingremotewriteaccesstopropagateshareddata,andimpor- benetsthemostfromthenetworksupport.inaddition,wefoundthathomenodemigrationcan OurresultsshowthattheMemoryChannelfeaturesimproveperformancebyanaverageof8%across space.1threeoftheapplicationsactuallyobtaintheirbestperformance(byafactorof18-34%)ona protocolwithmigrationandexplicitmessages. tantly,allowsshareddatasizetoscalebeyondtheournetwork'slimitedremotely-accessiblememory protocol.section3evaluatestheimpactofthememorychannelfeaturesandthehomenodemigration optimization.section4coversrelatedwork,andsection5outlinesourconclusions. ThenextsectiondiscussestheMemoryChannelanditsspecialfeatures,alongwiththeCashmere inatethisrestrictionareafocusofongoingresearch[6,26]. 1Mostcurrentcommodityremoteaccessnetworkshavealimitedremotely-accessiblememoryspace.Methodstoelim- 3
2ProtocolVariantsandImplementation systemshasquantiedthebenetsofsmpnodestosdsmperformance.inthispaper,wewillexamine Compaq'sMemoryChannelnetwork[15].Earlierwork[12,14,,21,22,25]onCashmereandother CashmerewasdesignedforSMPclustersconnectedbyahighperformancenetwork,specically, theperformanceimpactofthenetworkfeaturesexploited. FollowingthisoverviewisadescriptionoftheCashmereprotocol.Inkeepingwiththefocusofthis paper,thedesigndiscussionwillprimarilyfocusontheaspectsofnetworkcommunication.adiscussion WebeginbyprovidinganoverviewoftheMemoryChannelnetworkanditsprogramminginterface. 2.1MemoryChannel ofthedesigndecisionsrelatedtothesmpnodescanbefoundinanearlierpaper[25]. bility,whichallowsprocessorstomodifyremotememorywithoutremoteprocessorintervention.the MemoryChannelusesamemory-mapped,programmedI/Ointerface.Toworkwithremotely-accessible TheMemoryChannelisareliable,low-latencynetwork.Thehardwareprovidesremote-writecapa- memory,aprocessormustattachtoregionsinthememorychannel'saddressspace.theregionscan inparticular,toaddressesonthememorychannel'snetworkadapter.i/ospaceisuncacheable,but bemappedforeithertransmitorreceive.thephysicaladdressesoftransmitregionsmaptoi/ospace, writescanbecoalescedintheprocessor'swritebuer.receiveregionsmapdirectlytophysicalmemory. regionsareroutedtothenetworkadapter(onthepcibus),whichautomaticallyconstructsandlaunches adatamessage.uponmessagereception,anode'sadapterperformsadmaaccesstomainmemoryif Afterinitialconnectionsetup,thenetworkcanbeaccesseddirectlyfromuserlevel.Writestotransmit theregionismappedforreceive.otherwise,themessageisdropped. itselfintheoutgoingmessage'sdestination.themessagewillmovethroughthehubandarrivebackat node.byplacingaregioninloopbackmode,however,wecanarrangeforthesourceadaptertoinclude Normally,awritetoatransmitregionisnotreectedtoacorrespondingreceiveregiononthesource thesource,whereitwillbeprocessedasanormalincomingmessage.theadapterwillplacethedata intheappropriatereceiveregion. TheMemoryChannelguaranteestotalorder{allwritestothenetworkareobservedinthesame 4
orderbyallreceivers.thisguaranteeisprovidedbyaserializinghubthatconnectsallthemachines inthecluster.thehubisbus-based,whichensuresserializationandalsoaccountsforthenetwork's inexpensivebroadcastsupport.(thesecondgenerationofthememorychannelhasacrossbarhub. 2.2ProtocolOverview Broadcastsupportwillstillavailable,howeveratahighercost.) theunitofcoherenceisavirtualmemorypage(8konoursystem).tousecashmere,anapplication mustbedata-race-free[1].simplystated,oneprocessmustsynchronizewithanotherinordertosee Cashmereusesthevirtualmemory(VM)subsystemtotrackaccessestoshareddata,andsonaturally, section;thelatterisusedtoexit. itsmodications.also,allsynchronizationprimitivesmustbevisibletothesystem.theseprimitives canbeconstructedfrombasicacquireandreleaseoperations.theformerisusedtoentertoacritical modelsimplementedbymunin[5]andtreadmarks[2].intheformer,modicationsbecomevisibleat visibleataprocessoratthetimeofitsnextacquireoperation.thismodelliesinbetweentheconsistency Cashmereimplementsavariantofdelayedconsistency[9].Inthisvariant,datamodicationsbecome thetimeofthemodier'sreleaseoperation.inthelatter,modicationsbecomevisibleatthetimeof initiallyassignedusingarst-touchpolicy.thehomenodecollectsallmodicationsintoamastercopy thenextcausallyrelatedacquire. ofthepage.sharingsetinformationandhomenodelocationsaremaintainedinadirectorycontaining InCashmere,eachpageofsharedmemoryhasasingle,distinguishedhomenode.Homenodesare oneentryperpage. protocolupdatesthesharingsetinformationinthedirectoryandobtainsanup-to-datecopyofthe pagefromthehomenode.ifthefaultisduetoawriteaccess,theprotocolwillalsocreateapristine Themainprotocolentrypointsarepagefaultsandsynchronizationoperations.Onapagefault,the copy(calledatwin)ofthepageandaddthepagetothedirtylist.asanoptimizationinthewritefault handler,apagethatissharedbyonlyonenodeismovedintoexclusivemode.inthiscase,thetwin anddirtylistoperationareskipped,andthepagewillincurnoprotocoloverheaduntilanothersharer emerges. 5
ProtocolName CSM-DMS CSM-MS CSM-S Data MC Meta-dataSynchronizationHomeMigration MC CSM-MS-Mg CSM-None-MgExplicit Explicit MC Explicit MC Yes No Table1:Theseprotocolvariantshavebeenchosentoisolatetheperformanceimpactofspecialnetwork featuresontheareasofsdsmcommunication.useofspecialmemorychannelfeaturesisdenoted bya\mc"undertheareaofcommunication.otherwise,theexplicitmessagesareused.theuseof MemoryChannelfeaturesisalsodenotedintheprotocolsux(D,M,and/orS),asistheuseofhome nodemigration(mg). pagetoitstwininordertouncoverthemodications.thesemodicationsareplacedinadimessage andsenttothehomenodetobeincorporatedintothemastercopyofthepage.uponcompletionof Atthenextreleaseoperation,theprotocolexamineseachpageinthedirtylistandcomparesthe thedimessage,theprotocoldowngradespermissionsonthedirtypagesandsendswritenoticesto allnodesinthesharingset.thesewritenoticesareaccumulatedintoalistatthedestinationand processedatthenode'snextacquireoperation.allpagesnamedbywritenoticesareinvalidatedas partoftheacquire. 2.3ProtocolCommunication thespecialmemorychannelfeaturesonthesethreeareas,wehavepreparedsixvariantsofthecashmere propagation,protocolmeta-datamaintenance,andsynchronization.inordertoisolatetheeectsof Asdescribedearlier,protocolcommunicationcanbebrokendownintothreeareas:shareddata protocol.table1liststhevariantsandcharacterizestheiruseofthememorychannel.foreachofthe betweenprocessors.weassumeareliablenetwork(asiscommonincurrentsans).ifwewishto areasofprotocolcommunication,theprotocolseitherleveragethefullmemorychannelcapabilities establishordering,however,explicitmessagesrequireanacknowledgement. (i.e.remotewriteaccess,totalordering,andinexpensivebroadcast)orinsteadsendexplicitmessages 6
2.3.1CSM-DMS:Data,Meta-data,andSynchronizationusingMemoryChannel Thebaseprotocol,denotedCSM-DMS,isthesameCashmere-2Lprotocoldescribedinourstudyon theeectsofsmpclusters[25].asdescribedinthesubsequentparagraphs,thisprotocolfullyexploits thememorychannelforallsdsmcommunication:topropagateshareddata,tomaintainprotocol meta-data,andforsynchronization.thefollowingtextdescribeshowthefeaturesareleveraged. Data:Shareddataisfetchedfromthehomenodeandmodicationsarewrittenback,intheform ofdis,tothehomenode.2thefetchoperationcouldbeoptimizedbyaremotereadoperationor byallowingthehomenodetowritethedatadirectlytotheworkingaddressontherequestingnode. 128MofMemoryChanneladdressspace,thissignicantlylimitsthemaximumdatasetsize.(Foreight requiresshareddatatobemappedatdistinctmemorychanneladdressesoneachnode.withonly Unfortunately,therstoptimizationisnotavailableontheMemoryChannel.Thesecondoptimization nodes,themaximumdatasetwouldbeonlyabout16m.)forthisreason,csm-dmsdoesnotusethe secondoptimizationeither. homenodecopiesonly.thisstilllimitsdatasetsize,butthelimitismuchhigher.withhomenode copiesinmemorychannelspace,aprocessorcanuseremotewritestoapplydisatreleasetime.this InsteadofusingMemoryChanneladdressspaceforallshareddatacopies,CSM-DMSusesitfor usageavoidstheneedtointerruptahomenodeprocessor. onthememorychannel'stotalordering.csm-dmsperformsalldioperationsandthencompletes section.ratherthanrequiringhomenodestoreturndiacknowledgements,csm-dmsinsteadrelies Toavoidraceconditions,Cashmeremustbesurealldisarecompletedbeforeexitingacritical thereleaseoperationbyresettingthecorrespondingsynchronizationlocationinmemorychannelspace. Sincethenetworkistotallyordered,thediisguaranteedtobecompletedbythetimeotherprocessors Meta-data:System-widemeta-datainCSM-DMSconsistsofthepagedirectoryandwritenotices. observethecompletionofthereleaseoperation. usebandwidthmoreecientlythanwrite-through,andtoprovidebetterperformance. CSM-DMSreplicatesthepagedirectoryoneachnodeandthenusesaremotewritetobroadcastall 2AnearlierCashmerestudy[17]investigatedusingwrite-throughtopropagatedatamodications.Diswerefoundto 7
changes.cashmerealsousesremotewritestodeliverwritenoticestoawell-knownlocationoneach node.atanacquire,thenodesimplyreadsthewritenoticesfromthatlocation.aswithdis,cashmere takesadvantageoftheguaranteednetworkorderingtoavoidwritenoticeacknowledgements. andbyatest-and-setagoneachnode.aprocessbeginsagloballockacquireoperationbyrst andwriteorderingcapabilities.locksarerepresentedbyan8-entryarrayinmemorychannelspace, Synchronization:Applicationlocks,barriers,andagsallleveragetheMemoryChannel'sbroadcast acquiringthelocaltest-and-setlock.thentheprocessassertsitsnodeentryinthe8-entryarray,waits forthewritetoappearvialoop-back,andthenreadstheentirearray.ifanyoftheotherentriesare set,theprocessresetsitsentry,backso,andtriesagain.ifnootherentriesareset,thelockhasbeen todetermineifitisthelastprocessoronthenodetoenterthebarrier.ifso,theprocessorupdates acquired.barriersarerepresentedbya8-entryarray,a\sense"variableinmemorychannelspace thenode'sentryinthe8-entryarray.asinglemasterprocessorwaitsforallnodestoarriveandthen andalocalcounteroneachnode.aprocessoratomicallyreadsandincrementsthelocalnodecounter togglesthesensevariable.thisreleasesallthenodes,whicharespinningonthesensevariable.flags simplyusethememorychannel'sremotewriteandbroadcast. 2.3.2CSM-MS:Meta-dataandSynchronizationusingMemoryChannel andsoavoidsnetwork-inducedlimitationsondatasetsize.thetradeoisthatcsm-mscannotleverage whichlimitsthemaximumdatasetsize.csm-msdoesnotplaceshareddatainmemorychannelspace Asmentionedabove,CSM-DMSplaceshomenodepagecopiesintheMemoryChanneladdressspace, viamemorychannelwritestoasharedag,coupledwithreceive-sidepollingonloopbackedges.)in mustinterruptthehomenodeandwhichrequireexplicitacknowledgements.(interruptsareachieved thememorychanneltooptimizedicommunication.instead,disaresentasexplicitmessages,which 2.3.3CSM-S:SynchronizationusingMemoryChannel CSM-MS,meta-dataandsynchronizationstillleverageallMemoryChannelfeatures. Thethirdprotocolvariant,CSM-S,onlyleveragestheMemoryChannelforsynchronization.Explicit messagesareusedbothtopropagateshareddataandtomaintainmeta-data.insteadofbroadcasting 8
adirectorychange,aprocessmustsendthechangetothehomenodeinanexplicitmessage.thehome nodeupdatestheentryandacknowledgestherequest.thehomenodeistheonlynodeguaranteedto haveanup-to-datedirectoryentry. canbepiggybackedontoanexistingmessage.forexample,adirectoryupdateisimplicitinapage fetchrequestandsocanbepiggybacked.also,writenoticesalwaysfollowdioperations,sothehome Inmostcases,anseparatedirectoryupdate(orread)messagecanbeavoided.Instead,theupdate nodecansimplypiggybackthesharingset(neededtoidentifywheretosendwritenotices)ontothedi 2.3.4CSM-None:NoUseofSpecialMemoryChannelFeatures acknowledgment.infact,anexplicitdirectorymessageisneededonlywhenapageisinvalidated.3 Thefourthprotocol,CSM-None,usesexplicitmessages(andacknowledgments)forallcommunication. Thisprotocolvariantreliesonlyonlow-latencymessaging,andsocouldeasilybeportedtoother reliesontheecientpollingmechanismdescribedabove.earliercashmerework[17]foundthatthe low-latencynetworkarchitectures.ratherthaninter-processorinterrupts,ourlow-latencymessaging expensivekerneltransitionincurredbyinter-processorinterruptslimitedthebenetsofthelow-latency network.inourimplementation,wepollawell-knownmessagearrivalagthatisupdatedthrough remote-write.thismechanismshouldbeconsideredindependentofouraboveuseofremotewrite,since ecientpollingcanbeimplementedonothernetworkinterfaces[10,26]thatlacktheabilitytowrite toarbitrary,user-denedlocations. Alloftheaboveprotocolvariantsuserst-touchhomenodeassignment[18].Homeassignmentis 2.3.5CSM-MS-MgandCSM-None-Mg:HomeNodeMigration extremelyimportantbecauseprocessorsonthehomenodewritedirectlytomastercopyandsodonot incurcostlytwinanddioverheads.ifapagehasmultiplewritersduringthecourseofexecution, whendataisremotelyaccessible.hence,csm-ms-mgandcsm-none-mgbothkeepshareddatain protocoloverheadcanpotentiallybereducedbymigratingthehomenodetotheactivewriters. DuetothehighcostofremappingMemoryChanneladdresses,migratinghomenodescannotbeused oftenover-estimatethenumberofsharersandcompromisetheeectivenessofcashmere'sexclusivemodeoptimization. 3Theprotocolcouldbedesignedtolazilydowngradethedirectoryentryinthiscase.Howeverthedirectoryentrywould 9
privatememory,andallowthehometomigrateduringexecution.whenaprocessorincursawrite fault,theprotocolchecksthelocalcopyofthedirectorytoseeifthehomeisactivelywritingthepage. Ifnot,amigrationrequestissenttothehome.Therequestisgrantedifreceivedwhenthehomeisnot whilecsm-none-mgusesonlyexplicitmessages.thelatterprotocolcansuerfromunnecessary writingthepage.ifgranted,thehomesimplychangesthedirectoryentrytopointtothenewhome. migrationrequestssincethecacheddirectoryentriesmaybeout-of-date.wedonotpresentcsm-s-mg TheCSM-MS-MgusesMemoryChannelfeaturesformeta-datamaintenanceandforsynchronization, ofwhetherthehomenodeisxedormigrating. sincetheresultsofusingthememorychannelforsynchronizationarequalitativelythesameregardless 3Results Next,wediscusstheresultsofourinvestigationoftheimpactofMemoryChannelfeaturesandthe homenodemigrationoptimization. Webeginthissectionwithabriefdescriptionofourhardwareplatformandourapplicationsuite. 3.1PlatformandBasicOperationCosts phaserverisequippedwithfour21064aprocessorsoperatingat233mhzandwith256mbofshared memory,aswellasamemorychannelnetworkinterface.the21064ahastwoon-chipcaches:a16k OurexperimentalenvironmentconsistsoffourDECAlphaServer21004/233computers.EachAl- bytes.eachalphaserverrunsdigitalunix4.0dwithtruclusterv.1.5(memorychannel)extensions. instructioncacheand16kdatacache.theo-chipsecondarycachesizeis1mbyte.acachelineis64 Thesystemsexecuteinmulti-usermode,butwiththeexceptionofnormalUnixdaemonsnootherprocesseswereactiveduringthetests.Inordertoincreasecacheeciency,applicationprocessesarepinned toaprocessoratstartup.nootherprocessorsareconnectedtothememorychannel.executiontimes One-waylatencyfora64-bitremote-writeoperationis4.3secs.Inpractice,theround-triplatencyfor representthemedianvaluesofthreeruns. nullmessageincashmereis39secs.thistimeincludesthetransferofthemessageheaderandthe Onourplatform,theMemoryChannelhasapoint-to-pointbandwidthofapproximately33MBytes/sec. invocationofanullhandlerfunction. 10
Operation Di(secs) LockAcquire(secs) MemoryChannelFeaturesExplicitMessages 290{363 46 485{760 Table2:Basicoperationcostsat16-processors.Dicostvariesaccordingtothesizeofthedi. Barrier(secs) 103 158 Barnes CLU Program 128Kbodies(26Mbytes) 48x48(33Mbytes) ProblemSize Time(sec.) EM3D 2500x2500(50Mbytes) 469.4 Gauss 64000nodes(52Mbytes) 137.3 294.7 Ilink 48x48(33Mbytes) 254.8 SOR TSP 3072x4096(50Mbytes) CLP(15Mbytes) 948.1 Volrend 17cities(1Mbyte) 4036.24 755.9 Water-nsquared Head(23Mbytes) 194.8 Water-spatial 9261mols.(16Mbytes) 9261mols.(6Mbytes) 11.6 12.01 Table3:Datasetsizesandsequentialexecutiontimeofapplications. 74.0 directoryupdates,writenoticepropagation,andsynchronization.table2showsthecostsfordi operations,lockacquires,andbarriers,bothwhenleveragingandnotleveragingthememorychannel Asdescribedearlier,MemoryChannelfeaturescanbeusedtosignicantlyreducethecostofdis, notices,andagsynchronizationallusethememorychannel'sremote-writeandtotalorderingfeatures. (Directoryupdatesandagsynchronizationalsorelyontheinexpensivebroadcastsupport.)Without features.thecostofdioperationsvariesaccordingtothesizeofthedi.directoryupdates,write messageswithsimplehandlers,sotheircostisonlyslightlymorethanthecostofanullmessage.the costofwritenoticeswilldependgreatlyonthewritenoticecountanddestinations.writenotices thesefeatures,theseoperationsareaccomplishedviaexplicitmessages.directoryupdatesaresmall senttodierentdestinationscanbeoverlapped,thusreducingtheoperation'soveralllatency.flagsare inherentlybroadcastoperations,butagaintheagupdatemessagestotheprocessorscanbeoverlapped soperceivedlatencyshouldnotbemuchmorethanthatofanullmessage. 11
3.2ApplicationSuite bution. Barnes:anN-bodysimulationfromtheTreadMarks[2]distribution(andbasedonthesameapplication Ourapplicationsarewell-knownbenchmarksthathavenotbeenmodiedfromtheiroriginaldistri- usedtocontrolthecomputation.synchronizationconsistsofbarriersbetweenphases. inthesplash-1[23]suite),usingthehierarchicalbarnes-hutmethod.bodiesinthesimulationspace areplacedintonodesinatreestructurebasedontheirphysicallocations,andthistreestructureis CLU:fromtheSPLASH-2[27]benchmark.Thekernelfactorsamatrixintotheproductofalowertriangularandanupper-triangularmatrix.Workisdistributedbysplittingthematrixintoblocksand assigningeachblocktoaprocessor.blocksmodiedbyasingleprocessorareallocatedcontiguouslyin contiguously,resultinginmultiplewritesharerspercoherenceblock. LU:AlsofromSplash-2.TheimplementationisidenticaltoCLUexceptthatblocksarenotallocated ordertoincreasespatiallocality.barriersareusedforsynchronization. EM3D:aprogramtosimulateelectromagneticwavepropagationthrough3Dobjects[8].Theprimary computationalelementisasetofmagneticandelectricnodesthatareequallydistributedamongthe synchronizedthroughbarriers. processors.thesenodesareonlysharedamongstneighboringprocessors.phasesofthesimulationare Gauss:alocally-developedsolverforasystemoflinearequationsAX=BusingGaussianElimination Ilink:awidelyusedgeneticlinkageanalysisprogramfromtheFASTLINK2.3P[11]packagethatlocates tosignalwhenapivotrowbecomesavailable. andback-substitution.rowsaredistributedamongprocessorscyclically.synchronizationagsareused diseasegenesonchromosomes.amasterprocessorperformsaround-robinassignmentofelementsina sparsearraytoapoolofslaveprocessors.theslavesperformcalculationsontheassignedprobabilities andbetweeniterationsintheprogram. SOR:aRed-BlackSuccessiveOver-RelaxationprogramfromtheTreadMarksdistribution.Theprogramsolvespartialdierentialequations.Theredandblackarraysaredividedintoroughlyequalsize bandsofrows,witheachbandassignedtoadierentprocessor.processorssynchronizeusingbarriers. andreporttheresultstothemaster.barriersareusedtosynchronizebetweenthemasterandslaves 12
TSP:abranch-and-boundsolutiontothetravelingsalesmanproblem.Theprogram,alsofrominthe TreadMarksdistribution,distributesworkthroughataskqueue.Itisnon-deterministic,inthatparts ofthesearchspacecanbepruned,dependingonwhenshortpathsarefound.thetaskqueuesare Volrend:aSPLASH-2applicationthatrendersathree-dimensionalvolumeusingaraycastingtech- protectedbylocks. nique.theimageplaneispartitionedamongprocessorsincontiguousblocks,whicharefurtherparti- tionedintosmalltiles.thesetilesserveasthebasicunitofworkandaredistributedthroughasetof Water-nsquared:auidowsimulationfromtheSPLASH-2benchmarksuite.Themoleculestructuresarekeptinasharedarraythatisdividedintocontiguouschunksandassignedtoprocessors. taskqueues.again,thetaskqueuesareprotectedbylocks. Thebulkoftheinterprocessorcommunicationoccursduringaphasethatupdatesintermolecularforces resultinginamigratorysharingpattern. (fromwithinaradiusofn/2molecules,wherenisthenumberofmolecules),usingper-moleculelocks, Water-spatial:anotherSPLASH-2uidowsimulationthatsolvesthesameproblemasWaternsquared.Thesimulationspaceisplacedunderauniform3-Dgridofcells,witheachcellassignedto aprocessor.sharingoccurswhenmoleculesmovefromonecelltoanother.incomparisonwithwaternsquared,thisapplicationalsousesamoreecient,linearalgorithm.theapplicationusesbarriersand Thesizeofsharedmemoryspaceislistedinparentheses.Executiontimesweremeasuredbyrunning lockstosynchronize. eachuninstrumentedapplicationsequentiallywithoutlinkingittotheprotocollibrary. ThedatasetsizesanduniprocessorexecutiontimesfortheseapplicationsarepresentedinTable3. 3.3Performance writecapabilities,inexpensivebroadcast,andtotal-orderingproperties,onthethreetypesofprotocol communication:shareddatapropagation,protocolmeta-datamaintenance,andsynchronization.all ThissubsectionbeginsbydiscussingtheimpactofMemoryChannelsupport,inparticular,remote- dierences.toelminatethesedierencesandisolatememorychannelimpact,wecapturedtherst-touchassignments protocolsdescribedinthissubsectionusearst-touchhomenodeassigment.4wefoundthateightofour 4Inthecaseofmultiplesharersperpage,thetimingdierencesbetweenprotocolvariantscanleadtorst-touch 13
distributeswork.inthiscase,thespecialmemorychannelfeaturesservetoreduceprotocol-induced canbeespeciallylarge(upto37%overanexplicitmessagingprotocol)inanapplicationthatdynamically elevenbenchmarkapplicationsbenetedfromthespecialmemorychannelfeatures.theimprovement overhead,therebyreducingloadimbalanceandcostlyworkre-distributions. executiontime,normalizedtothatofthecsm-dmsprotocol,forthesixprotocolsvariants.execution timeisbrokendowntoshowthetimespentexecutingapplicationcode(user),executingprotocol Throughoutthissection,wewillrefertoFigure1andTable4.Figure1showsabreakdownof code(protocol),waitingonsynchronizationoperations(synchronization),andsendingorreceiving messages(message).table4liststhespeedupsandstatisticsonprotocolcommunicationforeach oftheapplicationsrunningon16processors.thestatisticsincludethenumberofpagetransfers, invalidations,anddioperations.thetablealsolistthenumberofhomemigrations,alongwiththe numberofmigrationattempts(listedinparentheses). 3.3.1TheImpactofMemoryChannelFeatures (fullyleveragingmemorychannelfeatures)asopposedtocsm-none(usingexplicitmessages).volrend runs37%fasteroncsm-dmsthanitdoesoncsm-none.barnes,em3d,lu,andwater-nsquared EightofourelevenapplicationsshowmeasurableperformanceimprovementsrunningonCSM-DMS run7-11%faster.gaussandsorrunlessthan4%faster.threeapplications,clu,ilink,andtsp, dierencesacrossourprotocols. arenotsensitivetotheuseofmemorychannelfeaturesanddonotshowanysignicantperformance numberofpagetransfersanddisincreasesasshareddatapropagationandprotocolmeta-datamaintenancenolongerleveragememorychannelfeatures.despiteperformingallprotocolcommunication withexplicitmessages,csm-noneperformsbetterthancsm-s.oncsm-none,theapplicationhas protocolinducesloadimbalanceandtriggersexpensivetaskstealing.ascanbeseenfromtable4,the Volrend'sperformanceisverysensitivetoitsworkloaddistribution.Peturbationintroducedbythe fromcsm-dmsandusedthemtoexplicitlyassignhomenodesintheotherprotocols. betterloadbalanceandincurslesstaskstealing.csm-noneperformsfewerdioperations,andinstrumentationshowsitalsoperformsfeweraccessestothetaskqueuelock.regardess,volrendperforms 14
poorlyoverallonallprotocolversions{thebestachievedspeedupisonlytwoon16processors. totalorderinghavethebiggestimpactonthecostofmeta-datamaintenance.thesefeaturespermit overhead.performanceslowlydegradesacrosstheprotocols.inthisapplication,remotewritesand Barnesexhibitsahighdegreeofsharingandincursalargeamountofprotocolandsynchronization anapproximate5%reductioninthelargenumberofinvalidations.withouttheuseofmemorychannelfeatures,theinvalidationsrequireexplicitmessagestoupdatethemasterdirectoryentry.these messagesresultinhigherprotocoloverheadandpoorersynchronizationcharacteristics(seefigure1). oftheotherapplications.thelargenumberoflocksreducesper-lockcontention,andlargelylimitsthe However,acrossprotocols,theapplicationdoesnotshowalargesynchronizationtimerelativetosome Water-nsquaredhasalargenumberofsynchronizationoperationsduetoitsuseofper-moleculelocks. synchronizationoverheadtothesynchronizationmechanismandassociatedprotocoloverhead.this applicationbenetsmostfromusingthememorychannelfeaturestooptimizelocksynchronization. withinthesamesmpnode.however,asprotocol-inducedoverheadincreasesacrosstheprotocols, encouragelockhandosbetweenneighboringprocessors,whichincurnegligibleprotocoloverheadif Figure1showsthatthesynchronizationcostishighestinCSM-None.Theapplicationiswrittento disandhigherprotocolandsynchronizationtimes.asinvolrend,csm-dmsincurstheleastamount morelockhandosoccurbetweenprocessorsondierentnodes.theseinter-nodehandosleadtomore avoidingdioperations. ofperturbation(imbalance)duetotheprotocol,whichhelpskeeplockaccessesinsidenodes,thereby ofthedisandinvalidationoperations.theprotocolsusingexplicitmessagesshowhigherprotocoland sharingatrowboundaries.remotewritesandtotalorderingareveryeectiveatreducingtheoverhead Atthegivenmatrixsize,LUincursalargeamountofprotocolcommunicationduetothewrite-write synchronizationoverhead,duetomoreexpensivedisandinvalidations. Channelsupport.Intheseapplications,ourinstrumentationshowsthatmostdisarehandledbyan idleprocessor.fortheseapplications,meta-datamaintenanceisagaintheareathatbenetsmostfrom EM3D,Gauss,SOR,andWater-spatialallbenetfromprotocolsthatleveragethespecialMemory specialmemorychannelsupport. Oftheremainingapplications,CLU,Ilink,andTSParenotnoticeablyaectedbytheunderlying 15
MemoryChannelsupport.CLUandTSPhavelittlecommunicationthatcanbeoptimized.Ilink, however,performsalargenumberofdis,andmightbeexpectedtobenetsignicantlyfromremotewritesupport.however,90%ofthedisareappliedatthehomenodebyidleprocessors,sotheextra overheadissomewhathiddenfromapplicationcomputation. 3.3.2HomeNodeMigration:OptimizationforaScalableDataSpace activewriters.ourresultsshowthatthisoptimizationisveryeective.sixofourelevenapplications Homenodemigrationcanreducethenumberofremotememoryaccessesbymovingthehomenodeto performbetterusinghomenodemigrationandexplicitdatapropagation(csm-ms-mg)thanusing overheadbyreducingthenumberoftwin/disandinvalidations.infact,thisreductioncanbesogreat rst-touchandremote-writedatapropagation(csm-dms).5homenodemigrationcanreduceprotocol thatthreeofourapplicationsobtainthebestoverallperformancewhenusingmigrationandexplicit messagesforallprotocolcommunication. attendanttwin)operationsissignicantlyreduced(seetable4).infact,fortheseapplications,csm- None-Mg,whichdoesnotleveragethespecialMemoryChannelfeaturesatall,outperformsthefull Volrend,Water-spatial,andLUallbenetgreatlyfrommigrationbecausethenumberofdi(and componentofexecutiontimeissignicantlydecreasedfortheseapplications.involrend,thisdecrease isespeciallyimportantsincethereducedprotocoloverheadleadstobetterloadbalanceandlesstask MemoryChannelprotocolCSM-DMSbyarangeof18%to34%.Figure1showsthattheprotocol stealing. betterthantheirrst-touchcounterpartsthatuseexplicitmessagesforatleastsomeprotocolcommunication(csm-ms,csm-s,andcsm-none).themigrationoptimizationagainreducesthenumber OnEM3DandWater-nsquared,themigrationprotocolsCSM-MS-MgandCSM-None-Mgperform ofdioperations.however,thisgainisosetbyincreasedoverheadofmigrationrequests.thetwo migrationprotocolsperformbasicallythesameasthefullmcprotocol,csm-dms. ofremapping. BarnesandGaussaretheonlytwoapplicationstosuerunderthemigrationoptimization.In 5Migrationcannotbeusedwhendataisplacedinremotely-accessiblenetworkaddressspace,becauseofthehighcost 16
1 Barnes 1 CLU 1 LU 100 100 100 Execution Breakdown (%) 80 60 40 Execution Breakdown (%) 80 60 40 Execution Breakdown (%) 80 60 40 0 CSM-DMS 1 CSM-MS CSM-S CSM-None EM3D CSM-MS-MG CSM-None-MG 0 CSM-DMS 1 CSM-MS CSM-S CSM-None Ilink CSM-MS-MG CSM-None-MG 0 CSM-DMS 1 CSM-MS CSM-S CSM-None Gauss CSM-MS-MG CSM-None-MG 100 100 100 Execution Breakdown (%) 80 60 40 Execution Breakdown (%) 80 60 40 Execution Breakdown (%) 80 60 40 0 0 0 CSM-DMS 1 CSM-MS CSM-S CSM-None SOR CSM-MS-MG CSM-None-MG CSM-DMS 1 CSM-MS CSM-S CSM-None TSP CSM-MS-MG CSM-None-MG CSM-DMS 0 CSM-MS CSM-S CSM-None Volrend CSM-MS-MG CSM-None-MG 180 100 100 160 Execution Breakdown (%) 80 60 40 Execution Breakdown (%) 80 60 40 Execution Breakdown (%) 140 1 100 80 60 40 0 0 0 CSM-DMS 1 CSM-MS CSM-S CSM-None Water-NSQ CSM-MS-MG CSM-None-MG CSM-DMS 1 CSM-MS CSM-S CSM-None Water-SP CSM-MS-MG CSM-None-MG CSM-DMS CSM-MS CSM-S CSM-None CSM-MS-MG CSM-None-MG 100 100 Execution Breakdown (%) 80 Message 60 60 Synchronization Protocol User 40 40 ofmemorychannelfeatures).mgdenotesamigratinghomenodepolicy. ThesuxontheprotocolnamerepresentstheareasofcommunicationusingMemoryChannelfeatures (D:sharedDatapropagation,M:protocolMeta-datamaintenance,S:Synchronization,None:Nouse Figure1:Normalizedexecutiontimebreakdownfortheapplicationsontheprotocolsat16processors. 0 0 17 CSM-DMS CSM-MS CSM-S CSM-None CSM-MS-MG CSM-None-MG Execution Breakdown (%) 80 CSM-DMS CSM-MS CSM-S CSM-None CSM-MS-MG CSM-None-MG
Table4:Applicationspeedupsandstatisticsat16processors.Thesuxontheprotocolnamerepresents theareasofcommunicationusingmemorychannelfeatures(d:shareddatapropagation,m:protocol amigratinghomenodepolicy. Meta-datamaintenance,S:Synchronization,None:NouseofMemoryChannelfeatures).Mgdenotes 18
Barnes,thedegreeofsharingisveryhighandthereisalargenumberofmigrationrequests.Theextra overheadoftheserequestsbalancesthereductionofdioperationsincsm-ms-mg.csm-none-mg losesperformancesincedirectorystateisnolongerkeptconsistentglobally.asaresult,csm-none-mg sendsapproximately580kunsuccessfulmigrationrequests.asshownintable4,gaussperformsmany overheadwithrespecttotherst-touchprotocols. moreinvalidationswhenusingmigration.theseinvalidationsresultinincreasedprotocolandmessaging thebenetsareosetbyincreasedoverheadduetomigrationcosts. tothemigrationmechanism.inilinkthenumberofdioperationsissignicantlyreduced,butagain CLU,Ilink,andTSPagainarerelativelyinsensitivetotheunderlyingMemoryChannelsupportor 4RelatedWork totalordering.theirresultsshowthatadvancednetworkfeaturesprovidelargeimprovementsinsdsm performance.theirnetworkhasbothremote-writeandremote-readcapabilities,butnobroadcastor Inatechnicalreport,Bilasetal.[4]alsoexaminetheimpactofspecialnetworkfeaturesonSDSM performance.however,theirbaseprotocolusesinter-processorinterruptstosignalmessagingdelivery. Interruptsoncommoditymachinesaretypicallyontheorderofhundredsofmicrosends,andsolargely erasethebenetsofalow-latencynetwork.ourevaluationhereassumesthatmessagescanbedetected throughamuchmoreecientpollingmechanism,asisfoundwithothersans[10,13],andsoeachof ourprotocolsbenetfromthesamelowmessaginglatency. dis.)ourhomenodemigrationschemeissimilarinprinciple.ifapagehasonlyasinglewriter,the operationsonsharedpageswithonlyasinglewriter.(pageswithmultiplewritersstillusetwinsand Amzaetal.[3]describeadaptiveextensionstotheTreadMarks[2]protocolthatavoidtwin/di homealwaysmigratestothatwriter,andsotwin/dioperationsareavoided.inthepresenceofmultiple avoidingtwin/dioverheadatonenode.cashmereisalsoabletotakeadvantageofthereplicated concurrentwriters,ourschemewillalwaysmigratetooneofthemultipleconcurrentwriters,thereby terns.inamigratoryaccesspattern,apieceofdataisreadandwrittenbyasuccessionofprocessors directorywhenmakingmigrationdecisions(i.e.todetermineifthehomeiscurrentlywritingthepage). Therehasalsobeenmuchworkinadaptingcoherenceprotocoloperationstomigratoryaccesspat- 19
usuallyinvolvestwocoherenceoperations(eachwithmultiplemessages),oneforthereadandonefor inalockstepmanner.thispatternresultsinthetransferofdatafromoneprocessortoanother,and thewrite.recentwork[24,7,19]inbothhardwareandsoftwarecoherentsystemsdiscussesmethods couldbebuiltintooursystem,andmaybeveryhelpfulinreducingtheoverheadduetounnecessary toclassifymigratorydataandthencollapsingthetwocoherencemessagesintoone.thistechnique migrationrequests. 5Conclusions inexpensivebroadcast,andtotalordering,onsdsm.ourevaluationusedthestate-of-the-artcashmere protocol,whichwasdesignedwiththesenetworkfeaturesspecicallyinmind. Inthispaper,wehavestudiedtheeectofadvancednetworkfeatures,inparticular,remotewrites, manceimprovements(upto11%)formostapplications.theimprovementsareduetoadecreasein communication,andcorrespondinglyprotocol,overhead.oneapplication,however,improvesdramaticallyby37%.thisapplicationusesadynamicworkdistributionscheme,whichoperatesmoreeectively withthereducedprotocoloverhead.unfortunately,evenaftertheimprovement,theapplicationonly obtainsanextremelypoorspeedupoftwoon16processors. Theuseofremotewritestopropagatedatamodicationshaslittleimpact.Inbarrier-basedprograms, thiscanbeexpected:instrumentationshowsthatmostdimessagesarehandledbyidleprocessors.the Virtuallyalloftheperformancedierenceswehaveseenareduetooptimizedmeta-datamaintenance. Wehavefoundthatthesefeaturesneverhurtsperformanceanddoesindeedleadtomodestperfor- networkfeatureshavelittleeectontheoperationalcostofsynchronizationprimitives,sooptimization inthisareahaslittleeectonoverallperformance. numberoftwin/dioperationsandtheresultingprotocoloverhead.themechanismissoeectivethat thebenetsoutweighthosefromusingthenetworkfeaturesforshareddatapropagation.shareddata Finally,wealsofoundthathomenodemigrationisaveryeectivemechanismforreducingthe canthussafelybeplacedinthenode'sprivatememory.thepressureonremotelyaccessiblememory istherebygreatlyreduced,providingmoreexibilityandscalabilityforthesystem.
References [1] [2] S.V.AdveandM.D.Hill.AUniedFormulationofFourShared-MemoryModels.IEEETransactionson [3] ParallelandDistributedSystems,4(6):613{624,June1993. C.Amza,A.L.Cox,S.Dwarkadas,P.Keleher,H.Lu,R.Rajamony,W.Yu,andW.Zwaenepoel.Tread- Marks:SharedMemoryComputingonNetworksofWorkstations.Computer,29(2):18{28,February1996. [4] ComputerArchitecture,SanAntonio,TX,February1997. WriterandMultipleWriter.InProceedingsoftheThirdInternationalSymposiumonHighPerformance C.Amza,A.Cox,S.Dwarkadas,andW.Zwaenepoel.SoftwareDSMProtocolsthatAdaptbetweenSingle [5] A.Bilas,C.Liao,andJ.P.Singh.NetworkInterfaceSupportforSharedVirtualMemoryonClusters. TechnicalReportTR-579-98,DepartmentofComputerScience,PrincetonUniversity,March1998. [6] ofthethirteenthacmsymposiumonoperatingsystemsprinciples,pages152{164,pacicgrove,ca, J.B.Carter,J.K.Bennett,andW.Zwaenepoel.ImplementationandPerformanceofMunin.InProceedings ProgrammingLanguagesandOperatingSystems,SanJose,CA,October1998. onnetworkinterfaces.inproceedingsoftheeighthinternationalconferenceonarchitecturalsupportfor Y.Chen,A.Bilas,S.N.Damianakis,C.Dubnicki,andK.Li.UTLB:AMechanismforAddressTranslation October1991. [7] [8] ofthetwentiethinternationalsymposiumoncomputerarchitecture,sandiego,ca,may1993. A.L.CoxandR.J.Fowler.AdaptiveCacheCoherencyforDetectingMigratorySharedData.InProceedings [9] 1993. M.Dubois,J.C.Wang,L.A.Barroso,K.L.Lee,andY.-S.Chen.DelayedConsistencyanditsEecton ProgramminginSplit-C.InProceedings,Supercomputing'93,pages262{273,Portland,OR,November D.Culler,A.Dusseau,S.Goldstein,A.Krishnamurthy,S.Lumetta,T.vonEicken,andK.Yelick.Parallel [10]D.Dunning,G.Regnier,G.McAlpine,D.Cameron,B.Shubert,F.Berry,A.M.Meritt,E.Gronke,and themissrateofparallelprograms.insupercomputing'91proceedings,pages197{76,albuquerque,nm, [11]S.Dwarkadas,A.A.Schaer,R.W.CottinghamJr.,A.L.Cox,P.Keleher,andW.Zwaenepoel.ParallelizationofGeneralLinkageAnalysisProblems.HumanHeredity,44:127{141,1994. C.Dodd.TheVirtualInterfaceArchitecture.InIEEEMicro,pages66{76,March1998. November1991. [12]S.Dwarkadas,K.Gharachorloo,L.Kontothanassis,D.J.Scales,M.L.Scott,andR.Stets.Comparative [13]T.v.Eicken,A.Basu,V.Buch,andW.Vogels.U-Net:AUser-LevelNetworkInterfaceforParalleland EvaluationofFine-andCoarse-GrainApproachesforSoftwareDistributedSharedMemory.InProceedings ofthefifthinternationalsymposiumonhighperformancecomputerarchitecture,orlando,fl,january 1999. [14]A.Erlichson,N.Nuckolls,G.Chesson,andJ.Hennessy.SoftFLASH:AnalyzingthePerformanceofClusteredDistributedVirtualSharedMemory.InProceedingsoftheSeventhInternationalConferenceon CopperMountain,CO,December1995. DistributedComputing.InProceedingsoftheFifteenthACMSymposiumonOperatingSystemsPrinciples, ArchitecturalSupportforProgrammingLanguagesandOperatingSystems,pages210{2,Cambridge,MA, [15]R.Gillett.MemoryChannel:AnOptimizedClusterInterconnect.IEEEMicro,16(2):12{18,February [16]R.W.HorstandD.Garcia.ServerNetSANI/OArchitecture.InProceedingsofHotInterconnectsV October1996. Symposium,PaloAlto,CA,August,1997. 21
[17]L.Kontothanassis,G.Hunt,R.Stets,N.Hardavellas,M.Cierniak,S.Parthasarathy,W.Meira,S. [18]M.Marchetti,L.Kontothanassis,R.Bianchini,andM.L.Scott.UsingSimplePagePlacementPolicies works.inproceedingsofthetwenty-fourthinternationalsymposiumoncomputerarchitecture,pages Dwarkadas,andM.L.Scott.VM-BasedSharedMemoryonLow-Latency,Remote-Memory-AccessNet- 157{169,Denver,CO,June1997. [19]L.R.MonneratandR.Bianchini.EcientlyAdaptingtoSharingPatternsinSoftwareDSMs.InProceedingsoftheFourthInternationalSymposiumonHighPerformanceComputerArchitecture,LasVegas,NV, February1998. toreducethecostofcachefillsincoherentshared-memorysystems.inproceedingsoftheninth InternationalParallelProcessingSymposium,SantaBarbara,CA,April1995. []R.Samanta,A.Bilas,L.Iftode,andJ.Singh.Home-BasedSVMProtocolsforSMPClusters:Design [21]D.J.ScalesandK.Gharachorloo.TowardsTransparentandEcientSoftwareDistributedSharedMemory. andperformance.inproceedingsoffourthinternationalsymposiumonhighperformancecomputer Architecture,pages113{124,February1998. [22]D.J.Scales,K.Gharachorloo,andA.Aggarwal.Fine-GrainSoftwareDistributedSharedMemoryon InProceedingsoftheSixteenthACMSymposiumonOperatingSystemsPrinciples,St.Malo,France,October 1997. [23]J.P.Singh,W.-D.Weber,andA.Gupta.SPLASH:StanfordParallelApplicationsforShared-Memory. SMPClusters.InProceedingsoftheFourthInternationalSymposiumonHighPerformanceComputer [24]P.Stenstrom,M.Brorsson,andL.Sandberg.AnAdaptiveCacheCoherenceProtocolOptimizedfor Architecture,LasVegas,NV,February1998. ACMSIGARCHComputerArchitectureNews,(1):5{44,March1992. [25]R.Stets,S.Dwarkadas,N.Hardavellas,G.Hunt,L.Kontothanassis,S.Parthasarathy,andM.Scott. MigratorySharing.InProceedingsoftheTwentiethInternationalSymposiumonComputerArchitecture, SanDiego,CA,May1993. [26]M.Welsh,A.Basu,andT.vonEicken.AComparisonofATMandFastEthernetNetworkInterfaces Cashmere-2L:SoftwareCoherentSharedMemoryonaClusteredRemote-WriteNetwork.InProceedings ComputerArchitecture,SanAntonio,TX,February1997. foruser-levelcommunication.inproceedingsofthethirdinternationalsymposiumonhighperformance ofthesixteenthacmsymposiumonoperatingsystemsprinciples,st.malo,france,october1997. [27]S.C.Woo,M.Ohara,E.Torrie,J.P.Singh,andA.Gupta.MethodologicalConsiderationsandCharacterizationoftheSPLASH-2ParallelApplicationSuite.InProceedingsoftheTwenty-SecondInternational SymposiumonComputerArchitecture,SantaMargheritaLigure,Italy,June1995. 22