Similar documents

The Effect of Contention on the Scalability of Page-Based Software Shared Memory Systems

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Synchronization Aware Conflict Resolution for Runtime Monitoring Using Transactional Memory

Initial Setup of Microsoft Outlook with Google Apps Sync for Windows 7. Initial Setup of Microsoft Outlook with Google Apps Sync for Windows 7

VIEWING INVOICES VIA BDM (Banner Document Management)

Software Project Management. Lecture Objectives. Project. A Simple Project. Management. What is involved

Configuration Backup and Restore. Dgw v2.0 May 14,

Chapter 4. Authentication Applications. COSC 490 Network Security Annie Lu 1

Supply Chain Management (3rd Edition)

CE 4.2 to Windows 7 - Synchronism Problem

AD Integration options for Linux Systems

SKF Multilog IMx time synchronization and advanced Windows firewall settings with Observer and Analyst for Windows 7

Network Management, MIBs and MPLS

Everest Horton Bank of Atchison Bank of Gower Bank of McLouth Bank of Oskaloosa Bank of Plattsburg Member FDIC

OWA/2-Factor Authentication VPN FAQ. Outlook Web Access (OWA) QUESTIONS

EMD Roles and Responsibilities

Parallel Computing of Kernel Density Estimates with MPI

PCI COMPLIANCE GUIDE For Merchants and Service Members

Deploying the BIG-IP LTM system and Microsoft Windows Server 2003 Terminal Services

LionPATH Mobile: Android

A Dynamic Binary Translation System in a Client/Server Environment

Securing Data on Microsoft SQL Server 2012

Medical Claims Electronic Data Transfer DataLink Questions and Answers


Rossmoor Website SEO Tracking Sheet Updated: April 1, 2014

GAIL/GAS/NOIDA/MKTG/DMA/16/ E-TENDER NO Schedule of Rates (SOR)

MS-55096: Securing Data on Microsoft SQL Server 2012

INTERNATIONAL CONSTRUCTION CONSULTING, LLC PROJECT MANAGEMENT TRAINING PLAN

Global Variables. However, when global variables are used in a function block or control modules, they must be declared as external

FAQ - Frequently Asked Questions Sections of Questions

Bill Hobgood, City of Richmond (VA) & the Association of Public Safety Communications Officials International PAPER #

Genius in Salesforce.com Pre- Installation Setup

Introduction to the Junos Operating System

Application Note. Configuring a NEO Tape Library with Symantec Backup Exec and NetApp NDMP Environment. Technical Bulletin. November 2013.

Emerging Markets Local Currency Debt and Foreign Investors

Maryland Prescription Drug Monitoring Program (PDMP) Hospice Inpatient Waiver Application

Projektron BCS 7.24 More than a project management software

Ein einheitliches Risikoakzeptanzkriterium für Technische Systeme

INTERCONNECTION SECURITY AGREEMENT

Dufferin-Peel Catholic District School Board DESIGN DEPARTMENT MONTHLY PROJECT REPORT

911 Phone System: Telecommunicator Furniture PHONE & FURNITURE TOTAL: 22, Software Expenditures

INTERNET ATTACKS AGAINST NUCLEAR POWER PLANTS

Writing Better Requirements The Key to a Successful Project


Working with Motorola RFID

Introduction to

How to Setup SQL Server Replication

Export the address book from the Blackberry handheld to MS Outlook 2003, using the synchronize utility in Blackberry Desktop Manager.

MEDIA SYNCHRONIZATION STANDARDISATION AT W3C

Caching SMB Data for Offline Access and an Improved Online Experience

Gerard Fianen. Copyright 2014 Cypherbridge Systems LLC Page 1

Tradeoffs in Transactional Memory Virtualization

EEI Business Continuity. Threat Scenario Project (TSP) April 4, EEI Threat Scenario Project

Price of telecommunication services on IP networks and the impact of VoIP on the price of the national and international telephone services in Togo

Foreign Exchange markets and international monetary arrangements

Thepurposeofahospitalinformationsystem(HIS)istomanagetheinformationthathealth

guide to getting started

PM Planning Configuration Management

E-Travel Initiative Electronic Data Systems (EDS) FedTraveler.com

Architecture of End-to-End QoS for VoIP Call Processing in the MPLS Network

EESTEL. Association of European Experts in E-Transactions Systems. Apple iphone 6, Apple Pay, What else? EESTEL White Paper.

Internet!of!Services! Project!IntroducMon!

Using SSH Secure FTP Client INFORMATION TECHNOLOGY SERVICES California State University, Los Angeles Version 2.0 Fall 2008.

How To Use Kentico+ On A Pc Or Mac Or Macbook

Centerity Service Pack for Microsoft Exchange 2013 Keep your services up and running!

Migrating from MyYSU Mail to Office 365 Microsoft Outlook 2010

Step-by-Step Configuration Instructions

PARSEC vs. SPLASH 2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip Multiprocessors

HAMPTON LUMBER SALES COMPANY

Using SMI-S for Management Automation of StarWind iscsi SAN V8 beta in System Center Virtual Machine Manager 2012 R2

Transcription:

TheEectofNetworkTotalOrder,Broadcast,andRemote-Write CapabilityonNetwork-BasedSharedMemoryComputing RobertStets,SandhyaDwarkadas,LeonidasKontothanassisy,MichaelL.Scott DepartmentofComputerScienceyCompaqCambridgeResearchLab Rochester,NY14627{0226 UniversityofRochester DRAFTCOPY{Pleasedonotredistribute. OneKendallSq.,Bldg.700 Cambridge,MA02139 cationoverhead.suchfeaturesincludereducedlatency,protectedremotememoryaccess,cheapbroadcasting, andorderingguaranteesfornetworkpackets.someofthesefeaturescomeattheexpenseofscalabilityforthe Emergingsystem-areanetworksprovideavarietyoffeaturesthatcandramaticallyreducenetworkcommuni- Abstract networkfabricoratasignicantimplementationcost.inthispaperweevaluatetheimpactofthesefeatures ontheimplementationofsoftwaredistributedsharedmemory(sdsm)systems,inparticular,onthecashmere writeaccesstoremotememory,inexpensivebroadcast,andtotalorderingofnetworkpackets.ourevaluation protocol.cashmerehasbeenimplementedonforthecompaqmemorychannelnetwork,whichhassupportfor frameworkdividessdsmprotocolcommunicationintothreeareas:shareddatapropagation,protocolmeta-data maintenance,andsynchronization;demonstratestheperformanceimpactofexploitingmemorychannelfeatures ineachofthesethreeareas. Cashmere,whichmaximizesitsleverageoftheMemoryChannelfeatures,thanonacomparableversionthatuses explicitmessages(andnobroadcast)forallprotocolcommunication.theperformancedierenceis37%inone threeapplicationsshownoperformancedierences.ingeneral,thedierencesareduetoreducedprotocol-induced application,whichdynamicallydistributesitswork,and11%orlessintheothersevenapplications.theremaining Wefoundthateightofelevenwell-knownbenchmarkapplicationsperformbetteronthebaseversionof showthatthisoptimizationrecoupstheperformancelostbyabandoningtheuseofspecialmemorychannel features.infact,theoptimizationissoeectivethatthreeoftheapplicationsperform18%to34%betterona applicationperturbationandmoreecientmeta-datamaintenance.reducedperturbationaccountsforthelarge 37%improvementbydecreasinginterferencewiththeapplication'sdynamicworkdistributionmechanism. protocolwithmigrationandexplicitmessagesthanonourbaseprotocolthatfullyleveragesthememorychannel. Themessage-basedprotocolhastheadditionaladvantageofallowingsharedmemorytogrowbeyondtheamount Inaddition,wehavealsoinvestigatedHomenodemigrationtoreduceshareddatapropagation.Ourresults researchgrantfromcompaq. thatcanbemappedthroughthenetworkinterface. ThisworkwassupportedinpartbyNSFgrantsCDA{9401142,CCR{9702466,andCCR{9705594;andanexternal 1

1Introduction therelativelyhighcostofinter-processorcommunication.recenttrends,particularlytheintroductions parabletospecial-purposeparallelmachines.inpractice,however,performancehasbeenlimitedby Clustersofworkstationsconnectedbycommoditynetworkshavelongprovidedaggregatepowercomsaginglatency,manySANsprovideotheroverhead-reducingfeaturessuchasremotememoryaccess, areanetworks(sans),haveimprovedthepotentialperformanceofclusters.inadditiontolowmes- ofcommodity-pricedsymmetricmultiprocessors(smps)andlow-latency(inthemicroseconds)system inexpensivebroadcast,andtotalorderinginthenetwork[10,15,16].onsmpclustersconnectedby SANs,communicationoverheadcanbegreatlyreduced.Communicationwithinthesamenodecan performancenetwork. occurthroughhardware,whileacrosssmps,communicationoverheadcanbeamelioratedbythehigh grammingparadigmfortheseclustersissoftwaredistributedsharedmemory(sdsm)sinceitutilizes thehardwarewithinanodeeciently.severalstudieshavealreadydeterminedthepositiveimpactof SincesharedmemoryisavailableinhardwarewithinSMPnodes,perhapsthemostnaturalpro- SMP-basedclustersonSDSMperformance[12,14,,21,22,25].Manyofthesesamestudiesutilized lowlatencynetworks.however,thebenetsofadvancednetworkfeatures(forexample,remotememory access)havenotbeendirectlyquantied. state-of-the-artcashmere-2l[25]protocol.thecashmereprotocolusesthevirtualmemorysubsystem totrackdataaccesses,allowsmultipleconcurrentwriters,employshomenodes(i.e.maintainsone Inthispaper,weexaminetheimpactofadvancednetworkingfeaturesontheperformanceofthe mastercopyofeachshareddatapage),andleveragessharedmemorywithinsmpstoreduceprotocol overhead.inpractice,cashmere-2lhasbeenshowntohaveverygoodperformance[12,17,25]. paqmemorychannelnetwork,whichoerslowmessaginglatencies,writeaccesstoremotememory, inexpensivebroadcast,andtotalordering.cashmerethereforeattemptedtomaximizeperformanceby CashmerewasoriginallydesignedforaclusterconsistingofAlphaServerSMPsconnectedbyaCom- placingshareddatadirectlyinremotelyaccessiblememory,usingbroadcasttoreplicatethedirectory amongthenodes,andrelyingonnetworktotalorderandreliabilitytoavoidacknowledgingthereceipt ofmeta-datainformation. 2

WehavestructuredourevaluationtodeterminenotonlytheoverallimpactofthespecialMemory Channelfeatures,butalsotheirimpactonprotocolcommunicationandrelateddesign.Ingeneral,an Thepurposeofthispaperistoevaluatetheperformanceimpactofeachofthesedesigndecisions. impactofnetworksupportintheseterms,wehaveconstructedsixcashmerevariants.fourofthe ofinternalprotocoldatastructures(calledprotocolmeta-data),andsynchronization.toevaluatethe SDSMprotocolincurscommunicationinthreeareas:thepropagationofshareddata,themaintenance variantsareusedtoisolatetheimpactofthememorychannelfeaturesonprotocolcommunication migratetoactivewriters,therebyreducingremotepropagationofshareddata.thisoptimizationisonly possiblewhenshareddataisnotinremotelyaccessiblememory,sincemigrationofremotelyaccessible intheaboveareas.thenaltwovariantsemployaprotocoloptimizationthatallowshomenodesto elevenstandardbenchmarks.thelargestimprovementis37%andoccursinaprogramwithdynamic memoryisanexpensiveoperationinvolvingsynchronizationofallthenodesmappingthedata. workdistribution.thisapplicationbenetsfromreducedprotocol-inducedoverheadimbalances,resultinginmoreeectiveworkdistribution.importantly,theuseofthememorychannelfeaturesnever degradesperformanceinanyoftheapplications.intermsofprotocoldesign,meta-datamaintenance recovermostofthebenetslostbynotusingremotewriteaccesstopropagateshareddata,andimpor- benetsthemostfromthenetworksupport.inaddition,wefoundthathomenodemigrationcan OurresultsshowthattheMemoryChannelfeaturesimproveperformancebyanaverageof8%across space.1threeoftheapplicationsactuallyobtaintheirbestperformance(byafactorof18-34%)ona protocolwithmigrationandexplicitmessages. tantly,allowsshareddatasizetoscalebeyondtheournetwork'slimitedremotely-accessiblememory protocol.section3evaluatestheimpactofthememorychannelfeaturesandthehomenodemigration optimization.section4coversrelatedwork,andsection5outlinesourconclusions. ThenextsectiondiscussestheMemoryChannelanditsspecialfeatures,alongwiththeCashmere inatethisrestrictionareafocusofongoingresearch[6,26]. 1Mostcurrentcommodityremoteaccessnetworkshavealimitedremotely-accessiblememoryspace.Methodstoelim- 3

2ProtocolVariantsandImplementation systemshasquantiedthebenetsofsmpnodestosdsmperformance.inthispaper,wewillexamine Compaq'sMemoryChannelnetwork[15].Earlierwork[12,14,,21,22,25]onCashmereandother CashmerewasdesignedforSMPclustersconnectedbyahighperformancenetwork,specically, theperformanceimpactofthenetworkfeaturesexploited. FollowingthisoverviewisadescriptionoftheCashmereprotocol.Inkeepingwiththefocusofthis paper,thedesigndiscussionwillprimarilyfocusontheaspectsofnetworkcommunication.adiscussion WebeginbyprovidinganoverviewoftheMemoryChannelnetworkanditsprogramminginterface. 2.1MemoryChannel ofthedesigndecisionsrelatedtothesmpnodescanbefoundinanearlierpaper[25]. bility,whichallowsprocessorstomodifyremotememorywithoutremoteprocessorintervention.the MemoryChannelusesamemory-mapped,programmedI/Ointerface.Toworkwithremotely-accessible TheMemoryChannelisareliable,low-latencynetwork.Thehardwareprovidesremote-writecapa- memory,aprocessormustattachtoregionsinthememorychannel'saddressspace.theregionscan inparticular,toaddressesonthememorychannel'snetworkadapter.i/ospaceisuncacheable,but bemappedforeithertransmitorreceive.thephysicaladdressesoftransmitregionsmaptoi/ospace, writescanbecoalescedintheprocessor'swritebuer.receiveregionsmapdirectlytophysicalmemory. regionsareroutedtothenetworkadapter(onthepcibus),whichautomaticallyconstructsandlaunches adatamessage.uponmessagereception,anode'sadapterperformsadmaaccesstomainmemoryif Afterinitialconnectionsetup,thenetworkcanbeaccesseddirectlyfromuserlevel.Writestotransmit theregionismappedforreceive.otherwise,themessageisdropped. itselfintheoutgoingmessage'sdestination.themessagewillmovethroughthehubandarrivebackat node.byplacingaregioninloopbackmode,however,wecanarrangeforthesourceadaptertoinclude Normally,awritetoatransmitregionisnotreectedtoacorrespondingreceiveregiononthesource thesource,whereitwillbeprocessedasanormalincomingmessage.theadapterwillplacethedata intheappropriatereceiveregion. TheMemoryChannelguaranteestotalorder{allwritestothenetworkareobservedinthesame 4

orderbyallreceivers.thisguaranteeisprovidedbyaserializinghubthatconnectsallthemachines inthecluster.thehubisbus-based,whichensuresserializationandalsoaccountsforthenetwork's inexpensivebroadcastsupport.(thesecondgenerationofthememorychannelhasacrossbarhub. 2.2ProtocolOverview Broadcastsupportwillstillavailable,howeveratahighercost.) theunitofcoherenceisavirtualmemorypage(8konoursystem).tousecashmere,anapplication mustbedata-race-free[1].simplystated,oneprocessmustsynchronizewithanotherinordertosee Cashmereusesthevirtualmemory(VM)subsystemtotrackaccessestoshareddata,andsonaturally, section;thelatterisusedtoexit. itsmodications.also,allsynchronizationprimitivesmustbevisibletothesystem.theseprimitives canbeconstructedfrombasicacquireandreleaseoperations.theformerisusedtoentertoacritical modelsimplementedbymunin[5]andtreadmarks[2].intheformer,modicationsbecomevisibleat visibleataprocessoratthetimeofitsnextacquireoperation.thismodelliesinbetweentheconsistency Cashmereimplementsavariantofdelayedconsistency[9].Inthisvariant,datamodicationsbecome thetimeofthemodier'sreleaseoperation.inthelatter,modicationsbecomevisibleatthetimeof initiallyassignedusingarst-touchpolicy.thehomenodecollectsallmodicationsintoamastercopy thenextcausallyrelatedacquire. ofthepage.sharingsetinformationandhomenodelocationsaremaintainedinadirectorycontaining InCashmere,eachpageofsharedmemoryhasasingle,distinguishedhomenode.Homenodesare oneentryperpage. protocolupdatesthesharingsetinformationinthedirectoryandobtainsanup-to-datecopyofthe pagefromthehomenode.ifthefaultisduetoawriteaccess,theprotocolwillalsocreateapristine Themainprotocolentrypointsarepagefaultsandsynchronizationoperations.Onapagefault,the copy(calledatwin)ofthepageandaddthepagetothedirtylist.asanoptimizationinthewritefault handler,apagethatissharedbyonlyonenodeismovedintoexclusivemode.inthiscase,thetwin anddirtylistoperationareskipped,andthepagewillincurnoprotocoloverheaduntilanothersharer emerges. 5

ProtocolName CSM-DMS CSM-MS CSM-S Data MC Meta-dataSynchronizationHomeMigration MC CSM-MS-Mg CSM-None-MgExplicit Explicit MC Explicit MC Yes No Table1:Theseprotocolvariantshavebeenchosentoisolatetheperformanceimpactofspecialnetwork featuresontheareasofsdsmcommunication.useofspecialmemorychannelfeaturesisdenoted bya\mc"undertheareaofcommunication.otherwise,theexplicitmessagesareused.theuseof MemoryChannelfeaturesisalsodenotedintheprotocolsux(D,M,and/orS),asistheuseofhome nodemigration(mg). pagetoitstwininordertouncoverthemodications.thesemodicationsareplacedinadimessage andsenttothehomenodetobeincorporatedintothemastercopyofthepage.uponcompletionof Atthenextreleaseoperation,theprotocolexamineseachpageinthedirtylistandcomparesthe thedimessage,theprotocoldowngradespermissionsonthedirtypagesandsendswritenoticesto allnodesinthesharingset.thesewritenoticesareaccumulatedintoalistatthedestinationand processedatthenode'snextacquireoperation.allpagesnamedbywritenoticesareinvalidatedas partoftheacquire. 2.3ProtocolCommunication thespecialmemorychannelfeaturesonthesethreeareas,wehavepreparedsixvariantsofthecashmere propagation,protocolmeta-datamaintenance,andsynchronization.inordertoisolatetheeectsof Asdescribedearlier,protocolcommunicationcanbebrokendownintothreeareas:shareddata protocol.table1liststhevariantsandcharacterizestheiruseofthememorychannel.foreachofthe betweenprocessors.weassumeareliablenetwork(asiscommonincurrentsans).ifwewishto areasofprotocolcommunication,theprotocolseitherleveragethefullmemorychannelcapabilities establishordering,however,explicitmessagesrequireanacknowledgement. (i.e.remotewriteaccess,totalordering,andinexpensivebroadcast)orinsteadsendexplicitmessages 6

2.3.1CSM-DMS:Data,Meta-data,andSynchronizationusingMemoryChannel Thebaseprotocol,denotedCSM-DMS,isthesameCashmere-2Lprotocoldescribedinourstudyon theeectsofsmpclusters[25].asdescribedinthesubsequentparagraphs,thisprotocolfullyexploits thememorychannelforallsdsmcommunication:topropagateshareddata,tomaintainprotocol meta-data,andforsynchronization.thefollowingtextdescribeshowthefeaturesareleveraged. Data:Shareddataisfetchedfromthehomenodeandmodicationsarewrittenback,intheform ofdis,tothehomenode.2thefetchoperationcouldbeoptimizedbyaremotereadoperationor byallowingthehomenodetowritethedatadirectlytotheworkingaddressontherequestingnode. 128MofMemoryChanneladdressspace,thissignicantlylimitsthemaximumdatasetsize.(Foreight requiresshareddatatobemappedatdistinctmemorychanneladdressesoneachnode.withonly Unfortunately,therstoptimizationisnotavailableontheMemoryChannel.Thesecondoptimization nodes,themaximumdatasetwouldbeonlyabout16m.)forthisreason,csm-dmsdoesnotusethe secondoptimizationeither. homenodecopiesonly.thisstilllimitsdatasetsize,butthelimitismuchhigher.withhomenode copiesinmemorychannelspace,aprocessorcanuseremotewritestoapplydisatreleasetime.this InsteadofusingMemoryChanneladdressspaceforallshareddatacopies,CSM-DMSusesitfor usageavoidstheneedtointerruptahomenodeprocessor. onthememorychannel'stotalordering.csm-dmsperformsalldioperationsandthencompletes section.ratherthanrequiringhomenodestoreturndiacknowledgements,csm-dmsinsteadrelies Toavoidraceconditions,Cashmeremustbesurealldisarecompletedbeforeexitingacritical thereleaseoperationbyresettingthecorrespondingsynchronizationlocationinmemorychannelspace. Sincethenetworkistotallyordered,thediisguaranteedtobecompletedbythetimeotherprocessors Meta-data:System-widemeta-datainCSM-DMSconsistsofthepagedirectoryandwritenotices. observethecompletionofthereleaseoperation. usebandwidthmoreecientlythanwrite-through,andtoprovidebetterperformance. CSM-DMSreplicatesthepagedirectoryoneachnodeandthenusesaremotewritetobroadcastall 2AnearlierCashmerestudy[17]investigatedusingwrite-throughtopropagatedatamodications.Diswerefoundto 7

changes.cashmerealsousesremotewritestodeliverwritenoticestoawell-knownlocationoneach node.atanacquire,thenodesimplyreadsthewritenoticesfromthatlocation.aswithdis,cashmere takesadvantageoftheguaranteednetworkorderingtoavoidwritenoticeacknowledgements. andbyatest-and-setagoneachnode.aprocessbeginsagloballockacquireoperationbyrst andwriteorderingcapabilities.locksarerepresentedbyan8-entryarrayinmemorychannelspace, Synchronization:Applicationlocks,barriers,andagsallleveragetheMemoryChannel'sbroadcast acquiringthelocaltest-and-setlock.thentheprocessassertsitsnodeentryinthe8-entryarray,waits forthewritetoappearvialoop-back,andthenreadstheentirearray.ifanyoftheotherentriesare set,theprocessresetsitsentry,backso,andtriesagain.ifnootherentriesareset,thelockhasbeen todetermineifitisthelastprocessoronthenodetoenterthebarrier.ifso,theprocessorupdates acquired.barriersarerepresentedbya8-entryarray,a\sense"variableinmemorychannelspace thenode'sentryinthe8-entryarray.asinglemasterprocessorwaitsforallnodestoarriveandthen andalocalcounteroneachnode.aprocessoratomicallyreadsandincrementsthelocalnodecounter togglesthesensevariable.thisreleasesallthenodes,whicharespinningonthesensevariable.flags simplyusethememorychannel'sremotewriteandbroadcast. 2.3.2CSM-MS:Meta-dataandSynchronizationusingMemoryChannel andsoavoidsnetwork-inducedlimitationsondatasetsize.thetradeoisthatcsm-mscannotleverage whichlimitsthemaximumdatasetsize.csm-msdoesnotplaceshareddatainmemorychannelspace Asmentionedabove,CSM-DMSplaceshomenodepagecopiesintheMemoryChanneladdressspace, viamemorychannelwritestoasharedag,coupledwithreceive-sidepollingonloopbackedges.)in mustinterruptthehomenodeandwhichrequireexplicitacknowledgements.(interruptsareachieved thememorychanneltooptimizedicommunication.instead,disaresentasexplicitmessages,which 2.3.3CSM-S:SynchronizationusingMemoryChannel CSM-MS,meta-dataandsynchronizationstillleverageallMemoryChannelfeatures. Thethirdprotocolvariant,CSM-S,onlyleveragestheMemoryChannelforsynchronization.Explicit messagesareusedbothtopropagateshareddataandtomaintainmeta-data.insteadofbroadcasting 8

adirectorychange,aprocessmustsendthechangetothehomenodeinanexplicitmessage.thehome nodeupdatestheentryandacknowledgestherequest.thehomenodeistheonlynodeguaranteedto haveanup-to-datedirectoryentry. canbepiggybackedontoanexistingmessage.forexample,adirectoryupdateisimplicitinapage fetchrequestandsocanbepiggybacked.also,writenoticesalwaysfollowdioperations,sothehome Inmostcases,anseparatedirectoryupdate(orread)messagecanbeavoided.Instead,theupdate nodecansimplypiggybackthesharingset(neededtoidentifywheretosendwritenotices)ontothedi 2.3.4CSM-None:NoUseofSpecialMemoryChannelFeatures acknowledgment.infact,anexplicitdirectorymessageisneededonlywhenapageisinvalidated.3 Thefourthprotocol,CSM-None,usesexplicitmessages(andacknowledgments)forallcommunication. Thisprotocolvariantreliesonlyonlow-latencymessaging,andsocouldeasilybeportedtoother reliesontheecientpollingmechanismdescribedabove.earliercashmerework[17]foundthatthe low-latencynetworkarchitectures.ratherthaninter-processorinterrupts,ourlow-latencymessaging expensivekerneltransitionincurredbyinter-processorinterruptslimitedthebenetsofthelow-latency network.inourimplementation,wepollawell-knownmessagearrivalagthatisupdatedthrough remote-write.thismechanismshouldbeconsideredindependentofouraboveuseofremotewrite,since ecientpollingcanbeimplementedonothernetworkinterfaces[10,26]thatlacktheabilitytowrite toarbitrary,user-denedlocations. Alloftheaboveprotocolvariantsuserst-touchhomenodeassignment[18].Homeassignmentis 2.3.5CSM-MS-MgandCSM-None-Mg:HomeNodeMigration extremelyimportantbecauseprocessorsonthehomenodewritedirectlytomastercopyandsodonot incurcostlytwinanddioverheads.ifapagehasmultiplewritersduringthecourseofexecution, whendataisremotelyaccessible.hence,csm-ms-mgandcsm-none-mgbothkeepshareddatain protocoloverheadcanpotentiallybereducedbymigratingthehomenodetotheactivewriters. DuetothehighcostofremappingMemoryChanneladdresses,migratinghomenodescannotbeused oftenover-estimatethenumberofsharersandcompromisetheeectivenessofcashmere'sexclusivemodeoptimization. 3Theprotocolcouldbedesignedtolazilydowngradethedirectoryentryinthiscase.Howeverthedirectoryentrywould 9

privatememory,andallowthehometomigrateduringexecution.whenaprocessorincursawrite fault,theprotocolchecksthelocalcopyofthedirectorytoseeifthehomeisactivelywritingthepage. Ifnot,amigrationrequestissenttothehome.Therequestisgrantedifreceivedwhenthehomeisnot whilecsm-none-mgusesonlyexplicitmessages.thelatterprotocolcansuerfromunnecessary writingthepage.ifgranted,thehomesimplychangesthedirectoryentrytopointtothenewhome. migrationrequestssincethecacheddirectoryentriesmaybeout-of-date.wedonotpresentcsm-s-mg TheCSM-MS-MgusesMemoryChannelfeaturesformeta-datamaintenanceandforsynchronization, ofwhetherthehomenodeisxedormigrating. sincetheresultsofusingthememorychannelforsynchronizationarequalitativelythesameregardless 3Results Next,wediscusstheresultsofourinvestigationoftheimpactofMemoryChannelfeaturesandthe homenodemigrationoptimization. Webeginthissectionwithabriefdescriptionofourhardwareplatformandourapplicationsuite. 3.1PlatformandBasicOperationCosts phaserverisequippedwithfour21064aprocessorsoperatingat233mhzandwith256mbofshared memory,aswellasamemorychannelnetworkinterface.the21064ahastwoon-chipcaches:a16k OurexperimentalenvironmentconsistsoffourDECAlphaServer21004/233computers.EachAl- bytes.eachalphaserverrunsdigitalunix4.0dwithtruclusterv.1.5(memorychannel)extensions. instructioncacheand16kdatacache.theo-chipsecondarycachesizeis1mbyte.acachelineis64 Thesystemsexecuteinmulti-usermode,butwiththeexceptionofnormalUnixdaemonsnootherprocesseswereactiveduringthetests.Inordertoincreasecacheeciency,applicationprocessesarepinned toaprocessoratstartup.nootherprocessorsareconnectedtothememorychannel.executiontimes One-waylatencyfora64-bitremote-writeoperationis4.3secs.Inpractice,theround-triplatencyfor representthemedianvaluesofthreeruns. nullmessageincashmereis39secs.thistimeincludesthetransferofthemessageheaderandthe Onourplatform,theMemoryChannelhasapoint-to-pointbandwidthofapproximately33MBytes/sec. invocationofanullhandlerfunction. 10

Operation Di(secs) LockAcquire(secs) MemoryChannelFeaturesExplicitMessages 290{363 46 485{760 Table2:Basicoperationcostsat16-processors.Dicostvariesaccordingtothesizeofthedi. Barrier(secs) 103 158 Barnes CLU Program 128Kbodies(26Mbytes) 48x48(33Mbytes) ProblemSize Time(sec.) EM3D 2500x2500(50Mbytes) 469.4 Gauss 64000nodes(52Mbytes) 137.3 294.7 Ilink 48x48(33Mbytes) 254.8 SOR TSP 3072x4096(50Mbytes) CLP(15Mbytes) 948.1 Volrend 17cities(1Mbyte) 4036.24 755.9 Water-nsquared Head(23Mbytes) 194.8 Water-spatial 9261mols.(16Mbytes) 9261mols.(6Mbytes) 11.6 12.01 Table3:Datasetsizesandsequentialexecutiontimeofapplications. 74.0 directoryupdates,writenoticepropagation,andsynchronization.table2showsthecostsfordi operations,lockacquires,andbarriers,bothwhenleveragingandnotleveragingthememorychannel Asdescribedearlier,MemoryChannelfeaturescanbeusedtosignicantlyreducethecostofdis, notices,andagsynchronizationallusethememorychannel'sremote-writeandtotalorderingfeatures. (Directoryupdatesandagsynchronizationalsorelyontheinexpensivebroadcastsupport.)Without features.thecostofdioperationsvariesaccordingtothesizeofthedi.directoryupdates,write messageswithsimplehandlers,sotheircostisonlyslightlymorethanthecostofanullmessage.the costofwritenoticeswilldependgreatlyonthewritenoticecountanddestinations.writenotices thesefeatures,theseoperationsareaccomplishedviaexplicitmessages.directoryupdatesaresmall senttodierentdestinationscanbeoverlapped,thusreducingtheoperation'soveralllatency.flagsare inherentlybroadcastoperations,butagaintheagupdatemessagestotheprocessorscanbeoverlapped soperceivedlatencyshouldnotbemuchmorethanthatofanullmessage. 11

3.2ApplicationSuite bution. Barnes:anN-bodysimulationfromtheTreadMarks[2]distribution(andbasedonthesameapplication Ourapplicationsarewell-knownbenchmarksthathavenotbeenmodiedfromtheiroriginaldistri- usedtocontrolthecomputation.synchronizationconsistsofbarriersbetweenphases. inthesplash-1[23]suite),usingthehierarchicalbarnes-hutmethod.bodiesinthesimulationspace areplacedintonodesinatreestructurebasedontheirphysicallocations,andthistreestructureis CLU:fromtheSPLASH-2[27]benchmark.Thekernelfactorsamatrixintotheproductofalowertriangularandanupper-triangularmatrix.Workisdistributedbysplittingthematrixintoblocksand assigningeachblocktoaprocessor.blocksmodiedbyasingleprocessorareallocatedcontiguouslyin contiguously,resultinginmultiplewritesharerspercoherenceblock. LU:AlsofromSplash-2.TheimplementationisidenticaltoCLUexceptthatblocksarenotallocated ordertoincreasespatiallocality.barriersareusedforsynchronization. EM3D:aprogramtosimulateelectromagneticwavepropagationthrough3Dobjects[8].Theprimary computationalelementisasetofmagneticandelectricnodesthatareequallydistributedamongthe synchronizedthroughbarriers. processors.thesenodesareonlysharedamongstneighboringprocessors.phasesofthesimulationare Gauss:alocally-developedsolverforasystemoflinearequationsAX=BusingGaussianElimination Ilink:awidelyusedgeneticlinkageanalysisprogramfromtheFASTLINK2.3P[11]packagethatlocates tosignalwhenapivotrowbecomesavailable. andback-substitution.rowsaredistributedamongprocessorscyclically.synchronizationagsareused diseasegenesonchromosomes.amasterprocessorperformsaround-robinassignmentofelementsina sparsearraytoapoolofslaveprocessors.theslavesperformcalculationsontheassignedprobabilities andbetweeniterationsintheprogram. SOR:aRed-BlackSuccessiveOver-RelaxationprogramfromtheTreadMarksdistribution.Theprogramsolvespartialdierentialequations.Theredandblackarraysaredividedintoroughlyequalsize bandsofrows,witheachbandassignedtoadierentprocessor.processorssynchronizeusingbarriers. andreporttheresultstothemaster.barriersareusedtosynchronizebetweenthemasterandslaves 12

TSP:abranch-and-boundsolutiontothetravelingsalesmanproblem.Theprogram,alsofrominthe TreadMarksdistribution,distributesworkthroughataskqueue.Itisnon-deterministic,inthatparts ofthesearchspacecanbepruned,dependingonwhenshortpathsarefound.thetaskqueuesare Volrend:aSPLASH-2applicationthatrendersathree-dimensionalvolumeusingaraycastingtech- protectedbylocks. nique.theimageplaneispartitionedamongprocessorsincontiguousblocks,whicharefurtherparti- tionedintosmalltiles.thesetilesserveasthebasicunitofworkandaredistributedthroughasetof Water-nsquared:auidowsimulationfromtheSPLASH-2benchmarksuite.Themoleculestructuresarekeptinasharedarraythatisdividedintocontiguouschunksandassignedtoprocessors. taskqueues.again,thetaskqueuesareprotectedbylocks. Thebulkoftheinterprocessorcommunicationoccursduringaphasethatupdatesintermolecularforces resultinginamigratorysharingpattern. (fromwithinaradiusofn/2molecules,wherenisthenumberofmolecules),usingper-moleculelocks, Water-spatial:anotherSPLASH-2uidowsimulationthatsolvesthesameproblemasWaternsquared.Thesimulationspaceisplacedunderauniform3-Dgridofcells,witheachcellassignedto aprocessor.sharingoccurswhenmoleculesmovefromonecelltoanother.incomparisonwithwaternsquared,thisapplicationalsousesamoreecient,linearalgorithm.theapplicationusesbarriersand Thesizeofsharedmemoryspaceislistedinparentheses.Executiontimesweremeasuredbyrunning lockstosynchronize. eachuninstrumentedapplicationsequentiallywithoutlinkingittotheprotocollibrary. ThedatasetsizesanduniprocessorexecutiontimesfortheseapplicationsarepresentedinTable3. 3.3Performance writecapabilities,inexpensivebroadcast,andtotal-orderingproperties,onthethreetypesofprotocol communication:shareddatapropagation,protocolmeta-datamaintenance,andsynchronization.all ThissubsectionbeginsbydiscussingtheimpactofMemoryChannelsupport,inparticular,remote- dierences.toelminatethesedierencesandisolatememorychannelimpact,wecapturedtherst-touchassignments protocolsdescribedinthissubsectionusearst-touchhomenodeassigment.4wefoundthateightofour 4Inthecaseofmultiplesharersperpage,thetimingdierencesbetweenprotocolvariantscanleadtorst-touch 13

distributeswork.inthiscase,thespecialmemorychannelfeaturesservetoreduceprotocol-induced canbeespeciallylarge(upto37%overanexplicitmessagingprotocol)inanapplicationthatdynamically elevenbenchmarkapplicationsbenetedfromthespecialmemorychannelfeatures.theimprovement overhead,therebyreducingloadimbalanceandcostlyworkre-distributions. executiontime,normalizedtothatofthecsm-dmsprotocol,forthesixprotocolsvariants.execution timeisbrokendowntoshowthetimespentexecutingapplicationcode(user),executingprotocol Throughoutthissection,wewillrefertoFigure1andTable4.Figure1showsabreakdownof code(protocol),waitingonsynchronizationoperations(synchronization),andsendingorreceiving messages(message).table4liststhespeedupsandstatisticsonprotocolcommunicationforeach oftheapplicationsrunningon16processors.thestatisticsincludethenumberofpagetransfers, invalidations,anddioperations.thetablealsolistthenumberofhomemigrations,alongwiththe numberofmigrationattempts(listedinparentheses). 3.3.1TheImpactofMemoryChannelFeatures (fullyleveragingmemorychannelfeatures)asopposedtocsm-none(usingexplicitmessages).volrend runs37%fasteroncsm-dmsthanitdoesoncsm-none.barnes,em3d,lu,andwater-nsquared EightofourelevenapplicationsshowmeasurableperformanceimprovementsrunningonCSM-DMS run7-11%faster.gaussandsorrunlessthan4%faster.threeapplications,clu,ilink,andtsp, dierencesacrossourprotocols. arenotsensitivetotheuseofmemorychannelfeaturesanddonotshowanysignicantperformance numberofpagetransfersanddisincreasesasshareddatapropagationandprotocolmeta-datamaintenancenolongerleveragememorychannelfeatures.despiteperformingallprotocolcommunication withexplicitmessages,csm-noneperformsbetterthancsm-s.oncsm-none,theapplicationhas protocolinducesloadimbalanceandtriggersexpensivetaskstealing.ascanbeseenfromtable4,the Volrend'sperformanceisverysensitivetoitsworkloaddistribution.Peturbationintroducedbythe fromcsm-dmsandusedthemtoexplicitlyassignhomenodesintheotherprotocols. betterloadbalanceandincurslesstaskstealing.csm-noneperformsfewerdioperations,andinstrumentationshowsitalsoperformsfeweraccessestothetaskqueuelock.regardess,volrendperforms 14

poorlyoverallonallprotocolversions{thebestachievedspeedupisonlytwoon16processors. totalorderinghavethebiggestimpactonthecostofmeta-datamaintenance.thesefeaturespermit overhead.performanceslowlydegradesacrosstheprotocols.inthisapplication,remotewritesand Barnesexhibitsahighdegreeofsharingandincursalargeamountofprotocolandsynchronization anapproximate5%reductioninthelargenumberofinvalidations.withouttheuseofmemorychannelfeatures,theinvalidationsrequireexplicitmessagestoupdatethemasterdirectoryentry.these messagesresultinhigherprotocoloverheadandpoorersynchronizationcharacteristics(seefigure1). oftheotherapplications.thelargenumberoflocksreducesper-lockcontention,andlargelylimitsthe However,acrossprotocols,theapplicationdoesnotshowalargesynchronizationtimerelativetosome Water-nsquaredhasalargenumberofsynchronizationoperationsduetoitsuseofper-moleculelocks. synchronizationoverheadtothesynchronizationmechanismandassociatedprotocoloverhead.this applicationbenetsmostfromusingthememorychannelfeaturestooptimizelocksynchronization. withinthesamesmpnode.however,asprotocol-inducedoverheadincreasesacrosstheprotocols, encouragelockhandosbetweenneighboringprocessors,whichincurnegligibleprotocoloverheadif Figure1showsthatthesynchronizationcostishighestinCSM-None.Theapplicationiswrittento disandhigherprotocolandsynchronizationtimes.asinvolrend,csm-dmsincurstheleastamount morelockhandosoccurbetweenprocessorsondierentnodes.theseinter-nodehandosleadtomore avoidingdioperations. ofperturbation(imbalance)duetotheprotocol,whichhelpskeeplockaccessesinsidenodes,thereby ofthedisandinvalidationoperations.theprotocolsusingexplicitmessagesshowhigherprotocoland sharingatrowboundaries.remotewritesandtotalorderingareveryeectiveatreducingtheoverhead Atthegivenmatrixsize,LUincursalargeamountofprotocolcommunicationduetothewrite-write synchronizationoverhead,duetomoreexpensivedisandinvalidations. Channelsupport.Intheseapplications,ourinstrumentationshowsthatmostdisarehandledbyan idleprocessor.fortheseapplications,meta-datamaintenanceisagaintheareathatbenetsmostfrom EM3D,Gauss,SOR,andWater-spatialallbenetfromprotocolsthatleveragethespecialMemory specialmemorychannelsupport. Oftheremainingapplications,CLU,Ilink,andTSParenotnoticeablyaectedbytheunderlying 15

MemoryChannelsupport.CLUandTSPhavelittlecommunicationthatcanbeoptimized.Ilink, however,performsalargenumberofdis,andmightbeexpectedtobenetsignicantlyfromremotewritesupport.however,90%ofthedisareappliedatthehomenodebyidleprocessors,sotheextra overheadissomewhathiddenfromapplicationcomputation. 3.3.2HomeNodeMigration:OptimizationforaScalableDataSpace activewriters.ourresultsshowthatthisoptimizationisveryeective.sixofourelevenapplications Homenodemigrationcanreducethenumberofremotememoryaccessesbymovingthehomenodeto performbetterusinghomenodemigrationandexplicitdatapropagation(csm-ms-mg)thanusing overheadbyreducingthenumberoftwin/disandinvalidations.infact,thisreductioncanbesogreat rst-touchandremote-writedatapropagation(csm-dms).5homenodemigrationcanreduceprotocol thatthreeofourapplicationsobtainthebestoverallperformancewhenusingmigrationandexplicit messagesforallprotocolcommunication. attendanttwin)operationsissignicantlyreduced(seetable4).infact,fortheseapplications,csm- None-Mg,whichdoesnotleveragethespecialMemoryChannelfeaturesatall,outperformsthefull Volrend,Water-spatial,andLUallbenetgreatlyfrommigrationbecausethenumberofdi(and componentofexecutiontimeissignicantlydecreasedfortheseapplications.involrend,thisdecrease isespeciallyimportantsincethereducedprotocoloverheadleadstobetterloadbalanceandlesstask MemoryChannelprotocolCSM-DMSbyarangeof18%to34%.Figure1showsthattheprotocol stealing. betterthantheirrst-touchcounterpartsthatuseexplicitmessagesforatleastsomeprotocolcommunication(csm-ms,csm-s,andcsm-none).themigrationoptimizationagainreducesthenumber OnEM3DandWater-nsquared,themigrationprotocolsCSM-MS-MgandCSM-None-Mgperform ofdioperations.however,thisgainisosetbyincreasedoverheadofmigrationrequests.thetwo migrationprotocolsperformbasicallythesameasthefullmcprotocol,csm-dms. ofremapping. BarnesandGaussaretheonlytwoapplicationstosuerunderthemigrationoptimization.In 5Migrationcannotbeusedwhendataisplacedinremotely-accessiblenetworkaddressspace,becauseofthehighcost 16

1 Barnes 1 CLU 1 LU 100 100 100 Execution Breakdown (%) 80 60 40 Execution Breakdown (%) 80 60 40 Execution Breakdown (%) 80 60 40 0 CSM-DMS 1 CSM-MS CSM-S CSM-None EM3D CSM-MS-MG CSM-None-MG 0 CSM-DMS 1 CSM-MS CSM-S CSM-None Ilink CSM-MS-MG CSM-None-MG 0 CSM-DMS 1 CSM-MS CSM-S CSM-None Gauss CSM-MS-MG CSM-None-MG 100 100 100 Execution Breakdown (%) 80 60 40 Execution Breakdown (%) 80 60 40 Execution Breakdown (%) 80 60 40 0 0 0 CSM-DMS 1 CSM-MS CSM-S CSM-None SOR CSM-MS-MG CSM-None-MG CSM-DMS 1 CSM-MS CSM-S CSM-None TSP CSM-MS-MG CSM-None-MG CSM-DMS 0 CSM-MS CSM-S CSM-None Volrend CSM-MS-MG CSM-None-MG 180 100 100 160 Execution Breakdown (%) 80 60 40 Execution Breakdown (%) 80 60 40 Execution Breakdown (%) 140 1 100 80 60 40 0 0 0 CSM-DMS 1 CSM-MS CSM-S CSM-None Water-NSQ CSM-MS-MG CSM-None-MG CSM-DMS 1 CSM-MS CSM-S CSM-None Water-SP CSM-MS-MG CSM-None-MG CSM-DMS CSM-MS CSM-S CSM-None CSM-MS-MG CSM-None-MG 100 100 Execution Breakdown (%) 80 Message 60 60 Synchronization Protocol User 40 40 ofmemorychannelfeatures).mgdenotesamigratinghomenodepolicy. ThesuxontheprotocolnamerepresentstheareasofcommunicationusingMemoryChannelfeatures (D:sharedDatapropagation,M:protocolMeta-datamaintenance,S:Synchronization,None:Nouse Figure1:Normalizedexecutiontimebreakdownfortheapplicationsontheprotocolsat16processors. 0 0 17 CSM-DMS CSM-MS CSM-S CSM-None CSM-MS-MG CSM-None-MG Execution Breakdown (%) 80 CSM-DMS CSM-MS CSM-S CSM-None CSM-MS-MG CSM-None-MG

Table4:Applicationspeedupsandstatisticsat16processors.Thesuxontheprotocolnamerepresents theareasofcommunicationusingmemorychannelfeatures(d:shareddatapropagation,m:protocol amigratinghomenodepolicy. Meta-datamaintenance,S:Synchronization,None:NouseofMemoryChannelfeatures).Mgdenotes 18

Barnes,thedegreeofsharingisveryhighandthereisalargenumberofmigrationrequests.Theextra overheadoftheserequestsbalancesthereductionofdioperationsincsm-ms-mg.csm-none-mg losesperformancesincedirectorystateisnolongerkeptconsistentglobally.asaresult,csm-none-mg sendsapproximately580kunsuccessfulmigrationrequests.asshownintable4,gaussperformsmany overheadwithrespecttotherst-touchprotocols. moreinvalidationswhenusingmigration.theseinvalidationsresultinincreasedprotocolandmessaging thebenetsareosetbyincreasedoverheadduetomigrationcosts. tothemigrationmechanism.inilinkthenumberofdioperationsissignicantlyreduced,butagain CLU,Ilink,andTSPagainarerelativelyinsensitivetotheunderlyingMemoryChannelsupportor 4RelatedWork totalordering.theirresultsshowthatadvancednetworkfeaturesprovidelargeimprovementsinsdsm performance.theirnetworkhasbothremote-writeandremote-readcapabilities,butnobroadcastor Inatechnicalreport,Bilasetal.[4]alsoexaminetheimpactofspecialnetworkfeaturesonSDSM performance.however,theirbaseprotocolusesinter-processorinterruptstosignalmessagingdelivery. Interruptsoncommoditymachinesaretypicallyontheorderofhundredsofmicrosends,andsolargely erasethebenetsofalow-latencynetwork.ourevaluationhereassumesthatmessagescanbedetected throughamuchmoreecientpollingmechanism,asisfoundwithothersans[10,13],andsoeachof ourprotocolsbenetfromthesamelowmessaginglatency. dis.)ourhomenodemigrationschemeissimilarinprinciple.ifapagehasonlyasinglewriter,the operationsonsharedpageswithonlyasinglewriter.(pageswithmultiplewritersstillusetwinsand Amzaetal.[3]describeadaptiveextensionstotheTreadMarks[2]protocolthatavoidtwin/di homealwaysmigratestothatwriter,andsotwin/dioperationsareavoided.inthepresenceofmultiple avoidingtwin/dioverheadatonenode.cashmereisalsoabletotakeadvantageofthereplicated concurrentwriters,ourschemewillalwaysmigratetooneofthemultipleconcurrentwriters,thereby terns.inamigratoryaccesspattern,apieceofdataisreadandwrittenbyasuccessionofprocessors directorywhenmakingmigrationdecisions(i.e.todetermineifthehomeiscurrentlywritingthepage). Therehasalsobeenmuchworkinadaptingcoherenceprotocoloperationstomigratoryaccesspat- 19

usuallyinvolvestwocoherenceoperations(eachwithmultiplemessages),oneforthereadandonefor inalockstepmanner.thispatternresultsinthetransferofdatafromoneprocessortoanother,and thewrite.recentwork[24,7,19]inbothhardwareandsoftwarecoherentsystemsdiscussesmethods couldbebuiltintooursystem,andmaybeveryhelpfulinreducingtheoverheadduetounnecessary toclassifymigratorydataandthencollapsingthetwocoherencemessagesintoone.thistechnique migrationrequests. 5Conclusions inexpensivebroadcast,andtotalordering,onsdsm.ourevaluationusedthestate-of-the-artcashmere protocol,whichwasdesignedwiththesenetworkfeaturesspecicallyinmind. Inthispaper,wehavestudiedtheeectofadvancednetworkfeatures,inparticular,remotewrites, manceimprovements(upto11%)formostapplications.theimprovementsareduetoadecreasein communication,andcorrespondinglyprotocol,overhead.oneapplication,however,improvesdramaticallyby37%.thisapplicationusesadynamicworkdistributionscheme,whichoperatesmoreeectively withthereducedprotocoloverhead.unfortunately,evenaftertheimprovement,theapplicationonly obtainsanextremelypoorspeedupoftwoon16processors. Theuseofremotewritestopropagatedatamodicationshaslittleimpact.Inbarrier-basedprograms, thiscanbeexpected:instrumentationshowsthatmostdimessagesarehandledbyidleprocessors.the Virtuallyalloftheperformancedierenceswehaveseenareduetooptimizedmeta-datamaintenance. Wehavefoundthatthesefeaturesneverhurtsperformanceanddoesindeedleadtomodestperfor- networkfeatureshavelittleeectontheoperationalcostofsynchronizationprimitives,sooptimization inthisareahaslittleeectonoverallperformance. numberoftwin/dioperationsandtheresultingprotocoloverhead.themechanismissoeectivethat thebenetsoutweighthosefromusingthenetworkfeaturesforshareddatapropagation.shareddata Finally,wealsofoundthathomenodemigrationisaveryeectivemechanismforreducingthe canthussafelybeplacedinthenode'sprivatememory.thepressureonremotelyaccessiblememory istherebygreatlyreduced,providingmoreexibilityandscalabilityforthesystem.

References [1] [2] S.V.AdveandM.D.Hill.AUniedFormulationofFourShared-MemoryModels.IEEETransactionson [3] ParallelandDistributedSystems,4(6):613{624,June1993. C.Amza,A.L.Cox,S.Dwarkadas,P.Keleher,H.Lu,R.Rajamony,W.Yu,andW.Zwaenepoel.Tread- Marks:SharedMemoryComputingonNetworksofWorkstations.Computer,29(2):18{28,February1996. [4] ComputerArchitecture,SanAntonio,TX,February1997. WriterandMultipleWriter.InProceedingsoftheThirdInternationalSymposiumonHighPerformance C.Amza,A.Cox,S.Dwarkadas,andW.Zwaenepoel.SoftwareDSMProtocolsthatAdaptbetweenSingle [5] A.Bilas,C.Liao,andJ.P.Singh.NetworkInterfaceSupportforSharedVirtualMemoryonClusters. TechnicalReportTR-579-98,DepartmentofComputerScience,PrincetonUniversity,March1998. [6] ofthethirteenthacmsymposiumonoperatingsystemsprinciples,pages152{164,pacicgrove,ca, J.B.Carter,J.K.Bennett,andW.Zwaenepoel.ImplementationandPerformanceofMunin.InProceedings ProgrammingLanguagesandOperatingSystems,SanJose,CA,October1998. onnetworkinterfaces.inproceedingsoftheeighthinternationalconferenceonarchitecturalsupportfor Y.Chen,A.Bilas,S.N.Damianakis,C.Dubnicki,andK.Li.UTLB:AMechanismforAddressTranslation October1991. [7] [8] ofthetwentiethinternationalsymposiumoncomputerarchitecture,sandiego,ca,may1993. A.L.CoxandR.J.Fowler.AdaptiveCacheCoherencyforDetectingMigratorySharedData.InProceedings [9] 1993. M.Dubois,J.C.Wang,L.A.Barroso,K.L.Lee,andY.-S.Chen.DelayedConsistencyanditsEecton ProgramminginSplit-C.InProceedings,Supercomputing'93,pages262{273,Portland,OR,November D.Culler,A.Dusseau,S.Goldstein,A.Krishnamurthy,S.Lumetta,T.vonEicken,andK.Yelick.Parallel [10]D.Dunning,G.Regnier,G.McAlpine,D.Cameron,B.Shubert,F.Berry,A.M.Meritt,E.Gronke,and themissrateofparallelprograms.insupercomputing'91proceedings,pages197{76,albuquerque,nm, [11]S.Dwarkadas,A.A.Schaer,R.W.CottinghamJr.,A.L.Cox,P.Keleher,andW.Zwaenepoel.ParallelizationofGeneralLinkageAnalysisProblems.HumanHeredity,44:127{141,1994. C.Dodd.TheVirtualInterfaceArchitecture.InIEEEMicro,pages66{76,March1998. November1991. [12]S.Dwarkadas,K.Gharachorloo,L.Kontothanassis,D.J.Scales,M.L.Scott,andR.Stets.Comparative [13]T.v.Eicken,A.Basu,V.Buch,andW.Vogels.U-Net:AUser-LevelNetworkInterfaceforParalleland EvaluationofFine-andCoarse-GrainApproachesforSoftwareDistributedSharedMemory.InProceedings ofthefifthinternationalsymposiumonhighperformancecomputerarchitecture,orlando,fl,january 1999. [14]A.Erlichson,N.Nuckolls,G.Chesson,andJ.Hennessy.SoftFLASH:AnalyzingthePerformanceofClusteredDistributedVirtualSharedMemory.InProceedingsoftheSeventhInternationalConferenceon CopperMountain,CO,December1995. DistributedComputing.InProceedingsoftheFifteenthACMSymposiumonOperatingSystemsPrinciples, ArchitecturalSupportforProgrammingLanguagesandOperatingSystems,pages210{2,Cambridge,MA, [15]R.Gillett.MemoryChannel:AnOptimizedClusterInterconnect.IEEEMicro,16(2):12{18,February [16]R.W.HorstandD.Garcia.ServerNetSANI/OArchitecture.InProceedingsofHotInterconnectsV October1996. Symposium,PaloAlto,CA,August,1997. 21

[17]L.Kontothanassis,G.Hunt,R.Stets,N.Hardavellas,M.Cierniak,S.Parthasarathy,W.Meira,S. [18]M.Marchetti,L.Kontothanassis,R.Bianchini,andM.L.Scott.UsingSimplePagePlacementPolicies works.inproceedingsofthetwenty-fourthinternationalsymposiumoncomputerarchitecture,pages Dwarkadas,andM.L.Scott.VM-BasedSharedMemoryonLow-Latency,Remote-Memory-AccessNet- 157{169,Denver,CO,June1997. [19]L.R.MonneratandR.Bianchini.EcientlyAdaptingtoSharingPatternsinSoftwareDSMs.InProceedingsoftheFourthInternationalSymposiumonHighPerformanceComputerArchitecture,LasVegas,NV, February1998. toreducethecostofcachefillsincoherentshared-memorysystems.inproceedingsoftheninth InternationalParallelProcessingSymposium,SantaBarbara,CA,April1995. []R.Samanta,A.Bilas,L.Iftode,andJ.Singh.Home-BasedSVMProtocolsforSMPClusters:Design [21]D.J.ScalesandK.Gharachorloo.TowardsTransparentandEcientSoftwareDistributedSharedMemory. andperformance.inproceedingsoffourthinternationalsymposiumonhighperformancecomputer Architecture,pages113{124,February1998. [22]D.J.Scales,K.Gharachorloo,andA.Aggarwal.Fine-GrainSoftwareDistributedSharedMemoryon InProceedingsoftheSixteenthACMSymposiumonOperatingSystemsPrinciples,St.Malo,France,October 1997. [23]J.P.Singh,W.-D.Weber,andA.Gupta.SPLASH:StanfordParallelApplicationsforShared-Memory. SMPClusters.InProceedingsoftheFourthInternationalSymposiumonHighPerformanceComputer [24]P.Stenstrom,M.Brorsson,andL.Sandberg.AnAdaptiveCacheCoherenceProtocolOptimizedfor Architecture,LasVegas,NV,February1998. ACMSIGARCHComputerArchitectureNews,(1):5{44,March1992. [25]R.Stets,S.Dwarkadas,N.Hardavellas,G.Hunt,L.Kontothanassis,S.Parthasarathy,andM.Scott. MigratorySharing.InProceedingsoftheTwentiethInternationalSymposiumonComputerArchitecture, SanDiego,CA,May1993. [26]M.Welsh,A.Basu,andT.vonEicken.AComparisonofATMandFastEthernetNetworkInterfaces Cashmere-2L:SoftwareCoherentSharedMemoryonaClusteredRemote-WriteNetwork.InProceedings ComputerArchitecture,SanAntonio,TX,February1997. foruser-levelcommunication.inproceedingsofthethirdinternationalsymposiumonhighperformance ofthesixteenthacmsymposiumonoperatingsystemsprinciples,st.malo,france,october1997. [27]S.C.Woo,M.Ohara,E.Torrie,J.P.Singh,andA.Gupta.MethodologicalConsiderationsandCharacterizationoftheSPLASH-2ParallelApplicationSuite.InProceedingsoftheTwenty-SecondInternational SymposiumonComputerArchitecture,SantaMargheritaLigure,Italy,June1995. 22