Similar documents
( ) = ( ) = {,,, } β ( ), < 1 ( ) + ( ) = ( ) + ( )

Performance Monitoring Tool

Implementing Web Applications in MLPQ System. I. Designing web applications in MLPQ System

This is a training module for Maximo Asset Management V7.1. In this module, you learn to use the E-Signature user authentication feature.

Release Service Request UCRP Supplemental Assessment. Error Report 2374 PPP500 Error. Test Plan. July 25, 2011 Prepared by Sreelekha Sen

ROYAL REHAB COLLEGE AND THE ENTOURAGE EDUCATION GROUP. UPDATED SCHEDULE OF VET UNITS OF STUDY AND VET TUITION FEES Course Aug 1/2015

SIMATIC S7-300, CPU 315-2DP CPU WITH MPI INTERFACE INTEGRATED 24 V DC POWER SUPPLY 128 KBYTE WORKING MEMORY 2

AGraphDrawingandTranslationServiceon StinaBridgeman,AshimGargandRobertoTamassia DepartmentofComputerScience thewww*

Supported Platforms and Software Requirements Effective on 7 May HULFT-DataMagic for Windows Ver.2.2.0

Applications & Tools. Configuration of Messages and Alarms in WinCC (TIA Portal) WinCC (TIA Portal) Application description December 2012

CDX Fuels Programs Registration User Guide for Responsible Corporate Officers. Version 3.02

Verify Needed Root Certificates Exist in Java Trust Store for Datawire JavaAPI

This is a training module for Maximo Asset Management V7.1. It demonstrates how to use the E-Audit function.

FROM DB TO DB. Manual. Page 1 of 7. Manual. Tel & Fax: info@altiliagroup.com Web:

Outline. Clouds of Clouds lessons learned from n years of research Miguel Correia

Infodata Systems Limited

Installing Globodox Web Client on Windows 7 (64 bit)

2Proofbymathematicalinductionplaysacrucialroleinthevericationofprogramtrans-

KIPUS, ONE OF THE LEADING BRANDS IN THE CAR AUDIO WORLD, IS GETTING FULFILL THE DESIRES AND DREAMS OF OUR FANS.

ASA 8.x: Renew and Install the SSL Certificate with ASDM

SciTools Understand Flavor for Structure 101g

Distributed Databases

IEEE P Task Force Channel Pair To Pair Resistance Imbalance. (End to End System Imbalance) Ad Hoc

1.1 Connection Direct COM port connection. 1. Half duplex RS232 spy cable without handshaking

City of Virginia Beach Mandatory Specification Contractor Performance Evaluation

QUICK REFERENCE GUIDE

Single-phase ( V) voltage monitoring: Undervoltage Overvoltage Window mode (overvoltage + undervoltage) Voltage fault memory selectable

Reinforcing Cyber Security -- Taiwan s Roadmap. 張 善 政 S-C (Simon) Chang 行 政 院 副 院 長 Vice Premier Aug. 26 th, 2015

However,duetoboththescaleandthecomplexityoftheInternet,itisunlikelythatameasure-

Installing Globodox Web Client on Windows Server 2012

End-to-end Protection of Web Services. Tracking. Hao Chen and Benjamin Davis UC Davis. Web services are highly attractive targets

Technical Note P/N REV A02 May 07, 2010

UPS / FedEx Package Shipping Field Mapping Guide

Larger, active workgroups (or workgroups with large databases) must use one of the full editions of SQL Server.

programsitproduces.finally,weshowhowtoproduceecient,optimizingprogramgeneratorsby

EVAL-UFDC-1/UFDC-1M-16

Electronic Signature Instructions

PayPal PRO Sandbox Testing

SIGNiX: Digital Signature Service

How to search for a draft travel & subsistence claim

BlackBerry Enterprise Server Resource Kit

Integrated SSL Scanning

INF-USB2 and SI-USB Quick Start Guide

Setting up a database for multi-user access

SAH2217 Enhanced ATHEROS GPS Module with ultra high sensitivity and antenna open/short detection/protection

Skilled Nursing Facility Coinsurance Part A Deductible Part B. Part B Excess (100%) Foreign Travel Emergency. Foreign Travel Emergency

High-Stability Time Adjustment with Real-Time Clock Module

An Expert Auditing System for Airline Passenger Tickets

Testing Installed Cabling - we know not what we do! Mike Gilmore FIA Technical Director

AMP Superannuation Savings Trust Register of significant event notifications

MIRRORING: START TO FINISH. Ryan Adams Blog - Twitter

Tech Sheet NOISEMOD.3PC HELP DOCUMENTATION

Changing the time on your system


Pre Sales Communications

Best Approaches to Database Auditing: Strengths and Weaknesses.

Exchange Granular Restore Instructional User Guide

PIN CONFIGURATION FEATURES ORDERING INFORMATION ABSOLUTE MAXIMUM RATINGS. D, F, N Packages

The completely Ethernet-based. protocol enables several imode devices to be linked, using easily-sourced. cables and components.

BROADCAST Trasmettitori / Transmitters DBH3 GAP FILLER

INDUSTRIAL INSTRUMENTATION

APC series overview. Copyright 2014 Deliberant LLC

UAE eservices User Manual

Document Control SOP. Document No: SOP_0103 Prepared by: David Brown. Version: 10

XMailer Reference Guide

GRNet. Advanced Network Services Tool

Electronic Signature Capture

Georgia Tech s Luminis IV Beta Testing

Complete I-9 Form (Electronic)

Exchange Granular Restore User Guide

Integrated SSL Scanning

How to configure your Acrobat Signature Appearance

MTA Course: Windows Operating System Fundamentals Topic: Understand backup and recovery methods File name: 10753_WindowsOS_SA_6.

Exchange Granular Restore. User Guide

Goals. Accelerating adoption & exchange of EHRs project. Evaluation Indicators EMR adoption (Hospitals) 20% (100 hospitals)

Installation and Administration Guide

Case 2:08-cv ABC-E Document 1-4 Filed 04/15/2008 Page 1 of 138. Exhibit 8

Best gateway technologies

HOW TO PROCESS A NON-PAYROLL EXPENDITURE ADJUSETMENT REQUEST (NPEAR)

Digital I/O: OUTPUT: Basic, Count, Count+, Smart+

Hands-on CUDA exercises

Web application for detailed realtime database transaction monitoring

3. From the Merchant Administration drop down select VCS Interfacing (page1)

UHC-iEnroll Offline Software Frequently Asked Questions

IPSLA Y1731 On-Demand and Concurrent Operations

Statistical Data analysis With Excel For HSMG.632 students

Streamline Paperwork with e-signatures

S7-1500/S7-1500F Technical Data

Differential Charge Amplifier

Tours Reporting System (Divisional) English version

Microsoft SQL Server 2005 How to Create and Restore Database (GRANTH3) Manually

Trademark e-filing 2013

Mastering Exchange 2000 and Active Directory with Tivoli. Bruno Friess

Online Change of Major: Student Instructions

Configuration of an SQL server as an index data base for quarantines in

H11S 04 (CFACC9) Use systems and technology during customer contact in a contact centre

Controller Automation, Model II+

WordPress Security Scan Configuration

/ / / Audio distribution system A44/A88 pc software

Placing and Initiating a Planned Orderset

Transcription:

MulticastGroupCommunicationasaBasefora Load-BalancingReplicatedDataService 2BasserDept.ofCS,MadsenF09,UniversityofSydney,NSW2006,Australia. 1MITLCS,545TechnologySquare,NE43-365,Cambridge,MA02139,USA. RogerKhazan1,AlanFekete2,andNancyLynch1 Abstract.Wegivearigorousaccountofanalgorithmthatprovides sequentiallyconsistentreplicateddataontopoftheviewsynchronous groupcommunicationservicepreviouslyspeciedbyfekete,lynchand ityview,butrotatestheworkofqueriesamongthememberstoequalize Shvartsman.Thealgorithmperformsupdatesatallmembersofamajor- 1Introduction theload.thealgorithmispresentedandveriedusingi/oautomata. Multicastgroupcommunicationservicesareimportantbuildingblocksforfaulttolerantapplicationsthatrequirereliableandorderedcommunicationamong andshvartsmanrecentlygaveasimpleautomatonspecicationvsforaviewsynchronousgroupcommunicationserviceanddemonstrateditspowerbyusincallychanginggroupsandprovidestrongintra-groupmulticastprimitives.ttateconsensusonwhatpropertiestheseservicesshouldexhibit,fekete,lynch remedytheexistinglackofgoodspecicationsfortheseservicesandtofacili- multipleparties.theseservicesmanagetheirclientsascollectionsofdynami- balancesqueriesandguaranteessequentialconsistency. ittosupportatotally-orderedbroadcastapplicationto[14,13].inthispaper, weusevstosupportasecondapplication:areplicateddataservicethatload involvingomission,crashingordelay,butnotbyzantinefailures.thefailures aconsistentandtransparentfashionandenablestheclientstoupdateand connected,andsubjecttoprocessorandcommunicationfailuresandrecoveries querythisobject.weassumetheunderlyingnetworkisasynchronous,strongly- Theservicemaintainsadataobjectreplicatedataxedsetofserversin preservingcorrectnessandmaintainingliveness. andrecoveriesmaycausethenetworkoritscomponentstopartitionandmerge. updateoperationsmustbeprocessedinthesameordereverywhere.toavoid Thebiggestchallengefortheserviceistocopewithnetworkpartitioningwhile inconsistencies,thealgorithmallowsupdatestooccuronlyinprimarycom- ponents.followingthecommonlyuseddenition,primarycomponentsarede- nedasthosecontainingamajority(ormoregenerallyaquorum)ofallservers. Weassumethatexecutedupdatescannotbeundone,whichimpliesthat Nonemptyintersectionofanytwomajorities(quorums)guaranteestheexistence ofatmostoneprimaryatagiventimeandallowsforthenecessaryowofinformationbetweenconsecutiveprimaries.ourserviceguaranteesprocessingof

updaterequestswheneverthereisastableprimarycomponent,regardlessofthe pastnetworkperturbations. nents,andisguaranteedprovidedtheclient'scomponenteventuallystabilizes. monlyoccurringsituationswhenqueriestakeapproximatelythesameamount Theserviceusesaround-robinload-balancingstrategytodistributequeriesto eachserverevenlywithineachcomponent.thisstrategymakessenseincom- Ontheotherhand,processingofqueriesisnotrestrictedtoprimarycompo- oftime,whichissignicant.eachqueryisprocessedwithrespecttoadatastate thatisatleastasadvancedasthelaststatewitnessedbythequery'sclient.the serviceisarrangedinsuchawaythattheserversarealwaysabletoprocessthe nicationlayer.theservers'layerissymmetric:allserversrunidenticalstate- machines.thecommunicationlayerconsistsoftwoparts,agroupcommuni- cationservicesatisfyingvs,andacollectionofindividualchannelsproviding Architecturally,theserviceconsistsoftheservers'layerandthecommu- assignedqueries,thatistheyarenotblockedbymissingupdateinformation. reliablereorderingpoint-to-pointcommunicationbetweenallpairsofservers. Theserversusethegroupcommunicationservicetodisseminateupdateand queryrequeststothemembersoftheirgroupsandrelyonthepropertiesofthis RelatedWork usedtosendtheresultsofprocessedqueriesdirectlytotheoriginalservers. servicetoenforcetheformationofidenticalsequencesofupdaterequestsatall serversandtoschedulequeryrequestscorrectly.thepoint-to-pointchannelsare Groupcommunication.Agoodoverviewoftherationalandusefulnessofgroup thoughthereisnoconsensusonwhatpropertiestheseservicesshouldprovide, nicationservicesareisis[5],transis[10],totem[25],newtop[12],relacs[3] thespecicorderinganddeliverypropertiesoftheirmulticastprimitives.even andhorus[27].dierentservicesdierinthewaytheymanagegroupsandin communicationservicesisgivenin[4].examplesofimplementedgroupcommu- rangeofdierentformalisms[3,6,8,11,15,24,26].fekete,lynch,andshvarts- descriptionsoftheirbehavior.manyspecicationshavebeenproposedusinga atypicalrequirementistodelivermessagesintotalorderandwithinaview. manrecentlypresentedthevsspecicationforapartitionablegroupcommu- nicationservice.pleasereferto[14]foradetaileddescriptionandcomparisonof Tobemostuseful,groupcommunicationserviceshavetocomewithprecise VSwithotherspecications. sentedaspecicationforgroupcommunicationservicethatprovidesadynamic anddolev[7]haveusedthesamestyletospecifyavirtuallysynchronousfifo groupcommunicationserviceandtomodelanadaptivetotally-orderedgroup communicationservice.deprisco,fekete,lynchandshvartsman[9]havepre- SeveralpapershavesinceextendedtheVSspecication.Chockler,Huleihel, ingalloperationsinthesamesequenceatallcopies.thedetailsofdoingthis municationservicesisformaintainingcoherentreplicateddatathroughapply- notionofprimaryview. ReplicationandLoadBalancing.Themostpopularapplicationofgroupcom-

Melliar-Smith,Moser,andVaysburd[18,2,1,19,16,17]. inpartitionablesystemshavebeenstudiedbyamir,dolev,friedman,keidar, aremadebyclients,andtasksaresentdirectlytotheassignedservers.inthe secondstyle,tasksaremulticasttoallserversinthegroup;eachserverthen balancingalgorithms.intherst,moretraditional,style,schedulingdecisions ideallysuitedforfault-tolerantload-balancing.hesuggeststwostylesofload- Inhisrecentbook[4,p.329],Birmanpointsoutthatprocessgroupsare appliesadeterministicruletodecideonwhethertoaccepteachparticulartask. isassignedtotheserverwhoserankwithinthisgroupis(imodn).thisstrategyreliesonthefactthatallserversreceiverequestsinthesameorder,anman[4,p.329].accordingtothisstrategy,queryrequestsaresenttotheservers Inthispaper,weusearound-robinstrategyoriginallysuggestedbyBir- usingtotally-orderedmulticast;theithrequestdeliveredinagroupofnservers membershipchanges. Weextendthisstrategywithafail-overpolicythatreissuesrequestswhengroup SequentialConsistency.Therearemanydierentwaysinwhichacollectionof guaranteesauniformdistributionofrequestsamongtheserversofeachgroup. tinguishabletoeachindividualclient.amuchstrongercoherencepropertyis inalworkindeningthesepreciselyislamport'sconceptofsequentialconsis- tency[21].asystemprovidessequentialconsistencywhenforeveryexecution replicasmayprovidetheappearanceofasingleshareddataobject.thesem- ofthesystem,thereisanexecutionwithasinglesharedobjectthatisindistemfromonewithasinglesharedobject.thealgorithmofthispaperprovides atomicity,whereauniversalobservercan'tdistinguishtheexecutionofthesys- Contributionsofthispaper anintermediateconditionwheretheupdatesareatomic,butqueriesmaysee Thispaperpresentsanewalgorithmforprovidingreplicateddataontopof resultsthatarenotasup-to-dateasthosepreviouslyseenbyotherclients. apartitionablegroupcommunicationsystem,inwhichtheworkofprocessing algorithmisbasedonpreviousideas(theload-balancingprocessingofqueriesis queriesisrotatedamongthegroupreplicasinaround-robinfashion.whilethe thattheserversalwayshavesucientlyadvancedstatestoprocessthequeries. showhowqueriescanbeprocessedinminoritypartitions,andhowtoensure ofapreviouslypublishedaccountofawaytointegratethese.inparticular,we takenfrom[4]andtheupdateprocessingrelatesto[18,2,1,19])weareunaware generateuniquelabels).theproofin[14]usesthepropertyofagreedmessage asense,thetoapplicationisanonymous,sinceanodeusesitsidentityonlyto canusesomeofthestrongerpropertiesofvs.previouswork[14]veriedto, anapplicationinwhichallnodeswithinaviewprocessmessagesidentically(in Anotherimportantadvanceinthisworkisthatitshowshowaverication sequence,butitdoesnotpayattentiontotheidenticalviewofmembershipat theydecidewhichmemberwillrespondtoaquery. allrecipients.incontrast,thispaper'sload-balancingalgorithm(andthusthe proof)usesthefactthatdierentrecipientshavethesamemembershipsetwhen

nology.section3presentsaformalspecicationforclients'viewofthereplicated service.section4containsanintermediatespecicationfortheservice,thepurposeofwhichistosimplifytheproofofcorrectness.section5presentsani/o Therestofthepaperisorganizedasfollows.Section2introducesbasictermi- automatonfortheserver'sstate-machineandoutlinestheproofofcorrectness. 2MathematicalFoundations disjointunions(+),whichdiersfromtheusualsetunion([)inthateachelementisimplicitlytaggedwithwhatcomponentitcomesfrom.forsimplicity,wtions(!),andpartialfunctions(,!).somewhatnon-standardisouruseof Weusestandardandself-explanatorynotationonsets,sequences,totalfunc- denesageneralrequesttype.furthermore,ifreq2request,anduandqarethe \matchingconstructs."thus,forexample,ifupdateandqueryaretherespectivetypesforupdateandqueryrequests,thentyperequest=update+query establishedvariableconventionsforupdateandquerytypes,then\req usevariablenameconventionstoavoidmoreformal\injectionfunctions"and automatonisasimplestate-machineinwhichthetransitionsareassociated andtuttle[23](withoutfairness),alsodescribedinchapter8of[22].ani/o and\req=q"arebothvalidstatements. ThemodelingisdoneintheframeworkoftheI/OautomatonmodelofLynch u" withnamedactions,whichcanbeeitherinput,output,orinternal.therst twoareexternallyvisible,andthelasttwoarelocallycontrolled.i/oautomata denedbyitssignature(input,outputandinternalactions),setofstates,set areinput-enabled,i.e.,theycannotcontroltheirinputactions.anautomatonis ofstartstates,andastate-transitionrelation(across-productbetweenstates, capturedbythesetoftracesgeneratedbyitsexecutions.executionfragments tionfragmentthatbeginswithastartstate.thesubsequenceofanexecution consistingofalltheexternalactionsiscalledatrace.theexternalbehavioris andactionsconsistentwiththetransitionrelation.anexecutionisanexecu- actions,andstates).anexecutionfragmentisanalternatingsequenceofstates canbeconcatenated.compatiblei/oautomatacanbecomposedtoyielda ofanautomatonasinternal. complexsystemfromindividualcomponents.thecompositionidentiesactions thathaveintheirsignatures.thehidingoperationreclassiesoutputactions automataperformsastepinvolvingaction,sodoallcomponentautomata withthesamenameindierentcomponentautomata.whenanycomponent Toprovethatoneautomatonimplementsanotherinthesenseoftraceinclusion, executionsequence.arenementmappingisasingle-valuedsimulationrelation. ofthatautomaton.theyareusuallyprovedbyinductiononthelengthofthe itissucienttopresentarenementmappingfromthersttothesecond.a Invariantsofanautomatonarepropertiesthataretrueinallreachablestates functionisprovedtobearenementmappingbycarryingoutasimulation actionintoasingleatomicpieceofcode. proof,whichusuallyreliesoninvariants(seechapter8of[22]). whichgroupstogetherallthetransitionsthatinvolveeachparticulartypeof Wedescribethetransitionrelationinaprecondition-eectstyle(asin[22]),

ass:dbs.likewise,ifviewisastatevariableofaserverp,thenitsinstanceina dbsisastatevariableofanautomaton,thenitsinstanceinastatesisexpressed statetisexpressedast[p]:vieworasp:viewiftisclearfromthediscussion. Toaccesscomponentsofcompoundobjectsweusethedotnotation.Thus,if 3ServiceSpecicationS isgiveninfigure1.theautomatonsappearsinfigure2. mationonbasicandderivedtypes,alongwithaconventionforvariableusage, Inthissection,weformallyspecifyourreplicateddataservicebygivingacen- Fig.1Typeinformation tralizedi/oautomatonsthatdenesitsallowedbehavior.thecompleteinfor- rdbdb cvartype C Description oquupdate:db!db a Request=Update+QueryRequestisadisjointunionofUpdateandQuerytypes. Output=Answer+fokgOutputisadisjointunionofAnswerandfokgtypes. Query:DB!Answer Queriesarefunctionsfromdatabasestatestoanswers. Updatesarefunctionsfromdatabasestatestodatabasestates. Databasetypewithadistinguishedinitialvaluedb0. FinitesetofclientIDs.(c:procreferstotheserverofc). Answertypeforqueries.Answersforupdatesarefokg. ofoutputvalueotoaclientc. oftheformrequest(r)c,representingthesubmissionofrequestrbyaclientc;s client-serverarchitecture:clients'requestsaredeliveredtosviainputactions repliestoitsclientsviaactionsoftheformreply(o)c,representingthedelivery Theinterfacebetweentheserviceanditsblockingclientsistypicalofa thereplicatedsystem,thiswouldimplythatprocessingofqueryrequestswould vice),thenspecicationswouldincludeastatevariabledboftypedband wouldapplyupdateandqueryrequeststothelatestvalueofthisvariable.in havetoberestrictedtotheprimarycomponentsofthenetwork. Ifourserviceweretosatisfyatomicity(i.e.,behaveasanon-replicatedser- thatisatleastasadvancedasthelastonewitnessedbythequeries'client.for thispurpose,smaintainsahistorydbsofdatabasestatesandkeepsanindex beprocessedwithrespecttothelatestvalueofdb,onlywithrespecttothevalue service,wegiveaslightlyweakerspecication,whichdoesnotrequirequeriesto Inordertoeliminatethisrestrictionandthusincreasetheavailabilityofthe last(c)tothelateststateseenbyeachclientc. clientasanon-replicatedone,andthus,satisessequentialconsistency.note that,sincetheatomicityhasbeenrelaxedonlyforqueries,theserviceisactually strongerthantheweakestoneallowedbysequentialconsistency. Eventhoughourserviceisnotatomic,itstillappearstoeachparticular whereeachccmodelsanondeterministicblockingclientc(seefigure3);real formally,weclosesbycomposingitwiththeautomatonenv=qc2c(cc), because,asani/oautomaton,itisinput-enabled.toexpressthisassumption theygetrepliesfortheircurrentones)cannotbeexpressedwithinautomatons Theassumptionthatclientsblock(i.e.,donotsubmitanynewrequestsuntil blockingclientscanbeshowntoimplementthisautomaton.intheclosedautomatons,therequestactionsareforcedtoalternatewiththereplyactions,

Output: reply(o)c;o2output;c2c Fig.2SpecicationS Signature: Input: request(r)c;r2request;c2c Internal: map2c,!(request+output),initially?.buerfortheclients'pendingrequestsorreplies. State: dbs2seq0db,initiallydb0.sequenceofdatabasestates.indexingfrom0tojdbsj 1. query(c;q;l);c2c;q2query;l2n update(c;u);c2c;u2update last2c!n,initiallyf!0g.indexofthelastdbstatewitnessedbyid. update(c;u) Transitions: request(r)c E:dbs Pre:u=map(c) E:map(c) r reply(o)c Pre:map(c)=o last(c) dbs+u(dbs[jdbsj 1]) ok query(c;q;l) E:map(c) Pre:q=map(c) E:map(c) last(c)ljdbsj 1 lq(dbs[l])? whichmodelstheassumedbehavior.intherestofthepaper,weconsiderthe Fig.3ClientSpecicationCc Signature: Input: closedversionsofthepresentedautomata,denotingthemwithabar(e.g.,s). State:busy2Bool,initiallyfalse.Statusag.Keepstrackofwhetherthereisapendingrequest. reply(o)c;o2output Transitions: request(r)c Output: Pre:busy=false request(r)c;r2request true reply(o)c 4IntermediateSpecicationDE:busy centralizeddatabase,anditsetsclient-specicvariables,map(c)andlast(c),to theirnewvalues.inadistributedsetting,thesetwotasksaregenerallyaccomplishedbytwoseparatetransitions.tosimplifytherenementmappingbetween ActionupdateofspecicationSaccomplishestwologicaltasks:Itupdatesthe Fig.4IntermediateSpecicationD (seefigure4),inwhichthesetasksareseparated.disformedbysplitting theimplementationandthespecication,weintroduceanintermediatelayerd Transitions:SameasinS,exceptupdateismodiedandserviceisdened. update(c;u) Signature:SameasinS,withtheadditionofaninternalactionservice(c);c2C. State: E:dbs Pre:u=map(c) delay(c) c62dom(delay) dbs+u(dbs[jdbsj 1]) SameasinS,withtheadditionofastatevariabledelay2C,!N,initially?. service(c) E:map(c) Pre:c2dom(delay) eachupdateactionofsintotwo,updateandservice.therstoneextends last(c) delay(c) delay(c) ok?

to\ok"andusesinformationstoredindelaytosetlast(c)toitsvalue. dbswithanewdatabasestate,butinsteadofsettingmap(c)to\ok"andlast(c) Lemma1ThefollowingfunctionDS()isarenementfromDtoSwithrespect databasestatewitnessedbyc)indelaybuer.thesecondactionsetsmap(c) toitsnewvalueasins,itsavesthisvalue(i.e.,theindextothemostrecent toreachablestatesofdands.1 DS(d:D)!S s:last s:map s:dbs s:busyc = overlay(d:last;d:delay) overlay(d:map;fhc;okijc2dom(d:delay)g) d:dbs TransitionsofDsimulatetransitionsofSwiththesameactions,exceptforthose dence,themappingandtheproofarestraightforward.thelemmaimpliesthat thatinvolveservice;thesesimulateemptytransitions.giventhiscorrespon- d:busycforallc2c aboutimplementationtandspecicationd,whichbytransitivityofthe\implements"relationimpliesthattimplementssinthesenseoftraceinclusion. 5ImplementationT Thegurebelowdepictsthemajorcomponentsofthesystemandtheirinteractions.SetPrepresentsthesetofservers.Eachserverp2Prunsanidentical state-machinevstodpandservestheclientswhosec:proc=p. request(r)cp gpsnd(m)preply(r)cp gprcv(m)q;p VStoDp request(r) safe(m)q;p c0pnewview(v)p reply(r)c0prequest(r)cqreply(r)cqrequest(r)c0qreply(r)c0q PTP VSgpsnd(m)qgprcv(m)p;qVStoDq safe(m)p;qnewview(v)q icationvs[14,seeappendixa]andacollectionptpofreliablereordering servers'layeri=qp2p(vstodp)withthegroup-communicationservicespec- point-to-pointchannelsbetweenanypairofservers[22,pages460-461],withall theoutputactionsofthiscompositionhidden,exceptfortheservers'replies. TheI/OautomatonTfortheserviceimplementationisacompositionofthe DimplementsSinthesenseoftraceinclusion.Later,weprovethesameresult 1Givenf;g:X,!Y,overlay(f;g)isasgoverdom(g)andasfelsewhere. T=hideout(IVSPTP) freply(o)cg IVSPTP:

ure5.thei/ocodeforthevstodpstatemachineisgiveninfigures6and7. 5.1TheServer'sState-MachineVStoDp Fig.5AdditionalTypeDeclaration VarType Theadditionaltypeandvariable-nameconventioninformationappearsinFig- vgmm=cupdate+ xqp(p) V=GP(P) hg;<g;g0i X=G(CUpdate)NExpertiseinformationforexchangeprocess.Fields:xl,us,su. CQueryN+Xorexpertiseinformationforexchangeprocess. Anelementofthissetiscalledaview.Fields:idandset. Fixedsetofquorums.ForanyQ2QandQ02Q,Q\Q06=;. Totally-orderedsetofviewidswiththesmallestelement. MessagessentviaVS:Eitherupdaterequests,queryrequests, Description pktpkt=canswerngpacketssentviaptp.(nisindexofthewitnesseddbstate.) one.wealsodistinguishwhetherornottheserverisamemberofaprimaryview, tioninalreadyestablishedview,whilerecoveryactivity inanewlyforming modebeingnormal,orrecovery,markedbymodebeingeitherexpertisebroadcast orexpertisecollection.normalactivityisassociatedwiththeserver'sparticipa- Theactivityoftheserver'sstate-machinecanbeeithernormal,markedby whichisdenedasthatwhosememberscompriseaquorum(view:set2q). Fig.6Implementation(VStoDp):SignatureandStateVariables Signature: Input: request(r)c;r2request;c2c;c:proc=p gprcv(m)p0;p;m2m;p02p safe(m)p0;p;m2m;p02p newview(v)p;v2v ptprcv(pkt)p0;p;pkt2pkt;p02p Internal: Output: query(c;q;l);c2c;u2update ptpsnd(pkt)p;p0;pkt2pkt;p02p gpsnd(m)p;m2m update(c;u);c2c;u2update reply(o)c;o2output;c2c;c:proc=p map2cj(c:proc=p),!request+output,buerthatmapsclientstotheirrequestsorreplies. last2cj(c:proc=p)!n, State: db2db,initiallydb0. pending2p(cj(c:proc=p)),initially;. initiallycj(c:proc=p)!0. initially?. Indexofthelastdbstateseenbyeachclient. Setofclientswhoserequestsarebeingprocessed. Localreplica.Nextstatedependsoncurrentandaction. updates2(cupdate),initially[]. lastupdate2n,initially0. Sequenceofupdates.Indexingfrom1.Fields:candu. mode2fnormal;expertisebroadcast; queries2c,!(query+answer)n,queryrequestsoranswers,pairedwiththeirlast(c). querycounter2n,initially0. view2v,initiallyv0=hg0;pi. safetoupdate2n,initially0. initially?. Currentviewofp.Fields:idandset. Indexofthelastexecutedelementinupdates. Indexofthelast\safetoupdate"elementinupdates. expertcounter12n,initially0. expertiselevel2g,initiallyg0. expertisecollectiong,initiallynormal. Numberofqueriesreceivedwithincurrentview. expertcounter22n,initially0. expertisemax2x,initiallyhg0;[];0i. Cumulativeexpertisecollectedduringrecovery. Modesofoperation.Thelasttwoareforrecovery. Thehighestprimaryviewidthatpknowsof. Numberofexpertisemessagesreceivedsofar. Thefactthatserversofthesameviewreceivequeryrequestsinthesameorder gprcv(c;q;l)p0;p,query(c;q;l)p,ptpsnd(c;a;l;g)p;p0,andptprcv(c;a;l;g)p0;p. Processingofqueryrequestsishandledbyactionsofthetypegpsnd(c;q;l)p, Numberofexpertisemessagesreceivedsofarassafe. requestsuniformlyamongtheserversofoneview. guaranteesthattheschedulingfunctionofgprcv(c;q;l)p0;pdistributesquery

Fig.7ImplementationVStoDp:Transitions Transitions: request(r)c gpsnd(c;q;l)p E:map(c) Pre:mode=normal E:pending q=map(c)^c62pending l=last(c) r reply(o)c gprcv(c;q;l)p0;p E:querycounter if(rank(p;view:set)= thenqueries(c) querycountermodjview:setj) pending[c query(c;q;l) Pre:hq;li2queries(c) querycounter+1 gpsnd(c;u)p Pre:map(c)=o E:map(c)? ptpsnd(c;a;l;g)p;p0 E:queries(c) lastupdatel hq(db);lastupdatei hq;li safe(c;u)p0;p gprcv(c;u)p0;p E:pending Pre:mode=normal^view:set2Q u=map(c)^c62pending pending[c ptprcv(c;a;l;g)p0;p E:queries(c) E:if(g=view:id^c:proc=p)then Pre:c2dom(queries)^c:proc=p0 update(c;u) E:safetoupdate E:updates updates+hc;ui ha;li2queries(c) E:lastupdate Pre:lastupdate<safetoupdate hc;ui=updates[lastupdate+1] safetoupdate+1 map(c) last(c) pending lapending c? gprcv(x)p0;p db u(db) newview(v)p E:expertisemax if(c:proc=p)then pending pending c E:queries expertcounter1 pending expertisemax safetoupdate maxflast(c)jc2c^c:proc=pg?;querycounter pending fcj(9q:hc;qi2map)g expertisemax0 max safetoupdate; 0;expertcounter2 0 if(expertcounter1=jview:setj)then map(c) last(c) ok lastupdate maxx(expertisemax;x) expertisemax:xl expertcounter1+1 gpsnd(x)p Pre:mode=expertisebroadcast x=hexpertiselevel;updates;safetoupdatei view vexpertisebroadcast 0safe(x)p0;p updates expertisemax:us E:expertcounter2 if(view:set2q)then safetoupdate expertiselevel expertisemax:su if(expertcounter2=jview:setj)then expertcounter2+1 view:id E:mode expertisecollection if(view:set2q)then safetoupdate fcjc2pending^ c62updates[(lastupdate+1):: pending jexpertisemax:usj onlywhenthecurrentstateofthelocaldatabaseisatleastasadvancedasthe Servicingofeachquerybyabackgroundthreadquery(c;q;l)pisallowed mode normal safetoupdate]:cg servicethequeriesthatareassignedtothem. laststatewitnessedbyitsclient.thisconditioniscapturedbylastupdatel. Thenon-trivialpartofthisprotocolisthattheserviceactuallyguaranteesthat theserversalwayshavethesucientlyadvanceddatabasestatestobeableto gprcv(c;u)p0;p,safe(c;u)p0;p,andupdate(c;u).eachservermaintainsasequenceupdatesofupdaterequests,thepurposeofwhichistoenforcetheorderin recoveryprocedure,inwhichitmovesitsownpendingqueriesforreprocessing anderasesanyinformationpertainingtothequeriesofothers. Whenaserverlearnsofitsnewview,itexecutesasimplequery-related eachtimeanupdaterequestisdeliveredviaagprcvaction.thesequencehastwo whichupdatesareappliedtothelocaldatabasereplica.thesequenceisextended Processingofupdaterequestsishandledbyactionsofthetypegpsnd(c;u)p,

safetoexecuteandthosethathavealreadybeenexecuted.thesafeprexis calledsafeanddone,thatmarkrespectivelythoseupdaterequeststhatare distinguishedprexesupdates[1::safetoupdate]andupdates[1::lastupdate], serversareconsistent(i.e.,givenanytwo,oneisaprexofanother).sincedone prexesmarkthoseupdaterequeststhathavebeenappliedtodatabasereplica, view.2theserviceguaranteesthatatalltimessafeanddoneprexesofall extendedtocoveracertainupdaterequestonupdatessequencewhentheserver thispropertyimpliesmutualconsistencyofdatabasereplicas. learnsthattherequesthasbeendeliveredtoallothermembersofthatserver's safe(x)p0;p.thequery-relatedpartofthisactivitywasdescribedabove.for ishandledbyactionsofthetypenewview(v)p,gpsnd(x)p,gprcv(x)p0;p,and thatthestatesofalltheserversofthisviewareconsistentwiththeirandother theupdate-relatedpart,theserverhastocollaboratewithothersonensuring Whenaserverlearnsofitsnewview,itstartsarecoveryactivitythat servers'pastexecutionhistoriesandaresuitablefortheirsubsequentnormal activity. keepstrackofthisinformationinitsstatevariableexpertiselevel.twoother viewofwhichtheserverknows.thisknowledgemayhavecomedirectlyfrom personalparticipationinthatview,orindirectlyfromanotherserver.theserver comparedtothoseofothers.themostimportantcriterionisthelatestprimary Forthispurpose,eachserverhastobeabletotellhowadvanceditsstateis criteriaaretheserver'supdatessequenceanditssafeprex.thevaluesofthese ofexpertiseelementsisdenedasthefollowingtriple Denition1Thecumulativeexpertise,maxX(X),ofasetorasequence,X, threevariablescomprisetheserver'sexpertise. maxx(x)=max<gfx:xljx2xg; 2Someoftheoptimisticprotocols,suchas[16,17],executerequestsassoonasthey max<jjfx:usj(x2x)^(x:xl2max<gfx:xljx2xg)g; aredeliveredbyatotalordermulticast(abcastofhorus),butmayresultin inconsistentreplicas,inwhichcasetheyhavetoundoactionsandrollthereplicas' max<nfx:sujx2xg: statesback.ontheotherhand,pessimisticprotocols,whichimplementstrictmutual consistencyamongreplicas,requireadditionalinformationbeforetheyareableto totalordermulticastwithsafedelivery,i.e.,amessagedeliveredtoonememberis executedonlywhenaservercollectsamajorityofacknowledgments,whichhaveto andmoserin[1,2]eliminatetheneedforend-to-endacknowledgmentsbyusing bemulticastbyeachserveronceitreceivestherequest.amir,dolev,melliar-smith, executeadeliveredrequest.thepessimisticversionin[17]allowsforarequesttobe guaranteedtobedeliveredtoanyothermemberofthesameviewprovideditdoes notcrash.aspointedoutin[14,13],\asimple'coordinatedattack'argument(asin Chapter5of[22])showsthatinapartitionablesystem,thisnotionofsafedelivery thosebasedonvs,whichseparatesmessagedeliveryandsafenoticationevents. Asaresult,protocolsbasedonthismulticastprimitivearemorecomplicatedthan isincompatiblewithhavingallrecipientsinexactlythesameviewasthesender."

completedwithadeliveryofthelastexpertisemessageviaactiongprcv(x)p0;p. ofotherserverscannotcauseinconsistencyamongreplicas.therststepis cumulativeexpertise(seedef.1).noticethatadoptingcumulativeexpertise aimsatadvancingeveryone'sexpertisetothehighestoneknowntothem their Asarststep,theserver'scollaborationwithothersduringrecoveryactivity propagationofupdaterequeststopreviouslyinaccessiblereplicas.second,it ensuresthefutureabilityofserverstoprocessthequeriesassignedtothem. Inadditiontoadvancingtheirexpertise,theserversofprimaryviewshave Advancingtheserver'sexpertiseachievestwopurposes.First,itensuresthe oncetheserverofaprimaryviewlearnsthatallexpertisemessageshavebeen identicalupdatessequences,theentirecontentofwhichissafeandcontainsas prexesthesafeprexesofallotherserversinthesystem.forthispurpose, normalactivity,whichsubsumesthattheyhavetostartnormalactivitywith toensuretheirabilitytoprocessnewupdaterequestsoncetheyresumetheir deliveredtoallserversofthisview,itextendsitssafeprextocovertheentire updatessequenceadoptedduringtheexchangeprocess. subsequentprimaryviewshavehigheridentiers. primaryviewswillcontainintheirupdatessequences.attainmentofthisbehaviordependsontheintersectionpropertyofprimaryviewsandthefactthat Theresultantsafeprexactsasanewbasethatallserversofthefuture thisbasebackforreprocessing.afterthisstep,theservermayresumeitsnormal viewestablishesthebase,itmovesallpendingupdaterequeststhatarenotin areresubmittedbytheiroriginalservers.therefore,onceaserverofaprimary thatarenotincludedinthebasewillneverndawaytoasafeprexunlessthey Theestablishedbaseworksasadivider:partiallyprocessedupdaterequests 5.2RenementMappingfromTtoD activity,whichenablesittoprocessnewupdateandqueryrequests. reply(o)caresimulatedwhenttakesthecorrespondingactions.actionsof AutomatonDhasvetypesofactions.Actionsofthetypesrequest(r)cand update(c;u)poftasleadingwhent[p]:lastupdate=max}ft[}]:lastupdateg, thetypequery(c)aresimulatedwhentexecutesptprcv(c;a;l;g)p0;pwith latedundercertainconditionswhentexecutesupdate(c;u)p.wedeneactions thatarejustnativesimulateservice(c),thatarebothleadingandnativesimulate\update(c),service(c)",andthatareneithersimulateemptytransitions. g=p:view:id.thelasttwotypes,update(c)andservice(c),arebothsimu- andasnativewhenc:proc=p.actionsthatarejustleadingsimulateupdate(c), Lemma2ThefollowingfunctionisarenementfromTtoDwithrespectto reachablestatesoftandd.3 TransitionsofTwithanyotheractionssimulateemptytransitionsofD. 3Ifsis\f1;f2;:::;fn"witheachfi:A!A,andifa2A,thenscan(s)= =\f1;(f2f1);:::;(fn:::f2f1)"andmap(s;a)=\f1(a);f2(a);:::;fn(a)".

TD(t:T)!D= lett:done=t[}]:updates[1::t[}]:lastupdate],where}2pisanysuchthat last map dbs Sp2Pt[p]:last Sp2Pt[p]:map db0+map(scan(t:done);db0) t[}]:lastupdate=maxp2pft[p]:lastupdateg delay busyc fht:done[i]:c;iij1ijt:donej^t[t:done[i]:c:proc]:lastupdate<ig arethesame.t:doneisaderivedvariablethatdenotesthelongestsequence serversareconsistent.inparticular,allsequenceswhichhavemaximumlength Aninvariantwillshowthatsequencesofprocessedrequestsatdierent t:busyc forallc2c ofupdaterequestsprocessedinthesystem.thissequencecorrespondstoall quests).witheachcinthisdomainweassociateitspositioninsequencet:done. dened.domainoftd(t):delayconsistsofidsofupdaterequeststhathavebeen modicationsdonetothedatabaseofd,whichexplainsthewaytd(t):dbsis explainsthewayd:delayisdened. Thispositioncorrespondstothelastdatabasestatewitnessedbyclientc,which processedsomewhere(i.e.,int:done)butnotattheirnativelocations(i.e.,the lastupdateattheirnativelocationshavenotyetsurpassedtheseupdatere- theirdoneprexesarethesame: I2Foranytwoserversp1andp22P,ifthelengthsoftheirdoneprexesarethesame,then Fig.8InvariantsusedintheproofthatTD()isarenementmapping(Lemma2) I1Foreachserverp2P,p:lastupdatep:safetoupdatejp:updatesj. c:proc:lastupdate<ip:safetoupdate,thenhc;ui2c:proc:mapandc2c:proc:pending. tionisstillreectedinitsnativemapandpendingbuers:ifhc;ui=p:updates[i]and I3Anyupdaterequestthatissafesomewherebuthasnotbeenexecutedatitsnativeloca- p1:lastupdate=p2:lastupdate)p1:updates[1::p1:lastupdate]=p2:updates[1::p2:lastupdate]: I4Atmostoneunexecutedupdaterequestpereachclientcanappearatthatclient'sserver:For anyclientc2c,thereexistsatmostoneindexi2nsuchthati>c:proc:lastupdateand Moreover,ifp:view:id=gthen I5ForallPTPpacketshc;a;l;gionain-transitp0;pchannel,itfollowsthatc:proc=p. c=c:proc:updates[i]:c. (a)c2dom(p:map)^p:map(c)2query (b)c2p:pending (d)lp:last(c) TheproofofLemma2isstraightforwardgiventhevetop-levelinvariantsin (c)a=p:map(c)(compose(p:updates[1::l])(db0)) (e)lmax}f}:lastupdateg withunsafeportionsofupdatessequences(becausethelatterbecometheformer tohavepropertiesonlyaboutsafeprexes weneedinvariantsthatdealalso approach[20]:oneofthefundamentalinvariantsstatesthatsafeprexesof updatessequencesatallserversareconsistent.toprovethisfact,itisnotenough Figure8.Toprovetheseinvariantsassertionallywehavedevelopedaninteresting ofdierentserversdependontheservers'expertiselevel,whichmayhavecome duringanexecution).invariantsthatrelatesafeprexesandupdatessequences

towhichthereplicationpartofthealgorithmoperates.therecursivenatureof ofservers'expertiseinearlierviews.inasense,itpresentsthelawaccording toaserverdirectlyfromtheparticipationinaprimaryview,orindirectlyfrom recursivelythehighestexpertiseachievedbyeachserverineachviewinterms thisfunctionmakesproofsbyinductioneasy:provinganinductivestepinvolves someoneelse.inourproof,wehaveinventedaderivedfunctionxthatexpresses 6FutureWork unwindingonlyonerecursivestepofthederivedfunctionx. goodbehavioroftheunderlyingnetwork.inparticular,weareplanningtocomparetheresponsetimeofthisalgorithmwithotherswhichsharequeryload dierently,forexamplebasedonrecentrun-timeloadreportswhicharedisseminatedbymulticastmanceandfault-toleranceproperties,statedconditionallytoholdinperiodsof Thispaperhasdealtwithsafetyproperties;futureworkwillconsiderperfor- messagecommunicationintothegroupcommunicationlayer. References dynamically,usingaservicesuchastheonein[9],andintegratingtheunicast Otherpossibleextensionstothisworkinvolvedeterminingprimaryviews 1.Y.Amir.ReplicationusingGroupCommunicationoveraPartitionedNetwork. 3.O.Babaoglu,R.Davoli,L.Giachini,andP.Sabattini.Theinherentcostofstrongpartialview-synchronouscommunication.LNCS,972:72{86,1995. usinggroupcommunication.technicalreport94-20,thehebrewuniversityof 2.Y.Amir,D.Dolev,P.Melliar-Smith,andL.Moser.Robustandecientreplication Jerusalem,Israel,1994. PhDthesis,TheHebrewUniversityofJerusalem,Israel,1995. 4.K.P.Birman.BuildingSecureandReliableNetworkApplications.ManningPublicationsCo.,Greenwich,CT,1996. 5.K.P.BirmanandR.vanRenesse,editors.ReliableDistributedComputingwith 6.T.D.Chandra,V.Hadzilacos,S.Toueg,andB.Charron-Bost.Ontheimpossibility theisistoolkit.ieeecomputersocietypress,1994. 7.G.V.Chockler,N.Huleihel,andD.Dolev.Anadaptivetotallyorderedmulticastprotocolthattoleratespartitions.InProceedingsofthe17hAnnualACM PrinciplesofDistributedComputing,pages322{330,NewYork,USA,May1996. ofgroupmembership.inproceedingsofthe15thannualacmsymposiumon SymposiumonPrinciplesofDistributedComputing,pages237{246,1998. 9.R.DePrisco,A.Fekete,N.Lynch,andA.Shvartsman.Adynamicview-oriented 8.F.Cristian.Group,majority,andstrictagreementintimedasynchronousdistributedsystems.InProceedingsoftheTwenty-SixthInternationalSymposiumon onprinciplesofdistributedcomputing,pages227{236,1998. groupcommunicationservice.inproceedingsofthe17hannualacmsymposium Fault-TolerantComputing,pages178{189,Washington,June25{27,1996.IEEE. 11.D.Dolev,D.Malki,andR.Strong.Aframeworkforpartitionablemembership 10.D.DolevandD.Malki.TheTransisapproachtohighavailabilityclustercommunication.CommunicationsoftheACM,39(4):64{70,Apr.1996. service.technicalreporttr94-6,departmentofcomputerscience,hebrewuniversity,1994.

12.P.D.Ezhilchelvan,R.A.Mac^edo,andS.K.Shrivastava.Newtop:Afault-tolerant 13.A.Fekete,N.Lynch,andA.Shvartsman.Specifyingandusingapartionablegroup groupcommunicationprotocol.inproceedingsofthe15thinternationalconference 14.A.Fekete,N.Lynch,andA.Shvartsman.Specifyingandusingapartionablegroup ondistributedcomputingsystems(icdcs'95),pages296{306,losalamitos,ca, USA,May30{June2,1995.IEEEComputerSocietyPress. communicationservice.extendedversion,http://theory.lcs.mit.edu/tds. 15.R.FriedmanandR.vanRenesse.StrongandweakvirtualsynchronyinHorus. communicationservice.inproceedingsofthesixteenthannualacmsymposium onprinciplesofdistributedcomputing,pages53{62,santabarbara,california, Aug.21{24,1997. TechnicalReportTR95-1537,CornellUniversity,ComputerScienceDepartment, 17.R.FriedmanandA.Vaysburd.High-performancereplicateddistributedobjects 16.R.FriedmanandA.Vaysburd.Implementingreplicatedstatemachinesoverpartitionablenetworks.TechnicalReportTR96-1581,CornellUniversity,Computer Aug.24,1995. Science,Apr.17,1996. 18.I.Keidar.Ahighlyavailableparadigmforconsistentobjectreplication.Master's ComputerScience,July16,1997. inpartitionableenvironments.technicalreporttr97-1639,cornelluniversity, 19.I.KeidarandD.Dolev.Ecientmessageorderingindynamicnetworks.InProceedingsofthe15thAnnualACMSymposiumonPrinciplesofDistributedComputing, 1994. pages68{76,newyork,usa,may1996. thesis,instituteofcomputerscience,thehebrewuniversityofjerusalem,israel, 20.R.I.Khazan.Groupcommunicationasabaseforaload-balancingreplicated 21.L.Lamport.Howtomakeamultiprocessorcomputerthatcorrectlyexecutes 22.N.A.Lynch.DistributedAlgorithms.MorganKaufmannseriesindatamanagementsystems.MorganKaufmannPublishers,LosAltos,CA94022,USA,1996. dataservice.master'sthesis,departmentofelectricalengineeringandcomputer Science,MassachusettsInstituteofTechnology,Cambridge,MA02139,May1998. multiprocessprograms.ieeetransactionsoncomputers,c-28(9):690{691,1979. 24.L.E.Moser,Y.Amir,P.M.Melliar-Smith,andD.A.Agarwal.Extendedvirtualsynchrony.InProceedingsofthe14thInternationalConferenceonDistributed 23.N.A.LynchandM.R.Tuttle.Anintroductiontoinput/outputautomata. CWIQuarterly,2(3):219{246,1989.AlsoavailableasMITTechnicalMemo MIT/LCS/TM-373. ComputingSystems,pages56{65,LosAlamitos,CA,USA,June1994.IEEEComputerSocietyPress. 26.A.M.Ricciardi,A.Schiper,andK.P.Birman.Understandingpartitionsand 25.L.E.Moser,P.M.Melliar-Smith,D.A.Agarwal,R.K.Budhia,andC.A.Lingley-Papadopoulos.Totem:Afault-tolerantmulticastgroupcommunicationsystem. 27.R.vanRenesse,K.P.Birman,andS.Maeis.Horus:Aexiblegroupcommunicationsystem.CommunicationsoftheACM,39(4):76{83,Apr.1996. the\nopartition"assumption.technicalreporttr93-1355,cornelluniversity, ComputerScienceDepartment,June1993. CommunicationsoftheACM,39(4):54{63,Apr.1996.

ATheVSSpecication TheVSspecicationof[14,13]isreprintedinFigure9.Mdenotesamessage alphabetandhg;<g;g0iisatotally-orderedsetofviewidentierswithan initialviewidentier.anelementofthesetv=gp(p)iscalledaview.if visaview,wewritev:idandv:settodenoteitscomponents. Fig.9VS-machine Signature: Input: gpsnd(m)p,m2m,p2p Output: gprcv(m)p;qhiddeng,m2m,p;q2p,g2g safe(m)p;qhiddenv,m2m,p;q2p,v2views newview(v)p,v2views,p2p,p2v:set Internal: createview(v),v2views vs-order(m;p;g),m2m,p2p,g2g State: createdv,initiallyfhg0;pig foreachp2p: currentviewid[p]2g,initiallyg0 foreachg2g: queue[g],anitesequenceofmp, initiallyempty foreachp2p,g2g: pending[p;g],anitesequenceofm, initiallyempty next[p;g]2n>0,initially1 nextsafe[p;g]2n>0,initially1 Transitions: createview(v) Pre:v:id>max(g:9S;hg;Si2created) E:created created[fvg newview(v)p Pre:v2created v:id>currentviewid[p] E:currentviewid[p] v:id gpsnd(m)p E:appendmtopending[p;currentviewid[p]] vs-order(m;p;g) Pre:misheadofpending[p;g] E:removeheadofpending[p;g] appendhm;pitoqueue[g] gprcv(m)p;q;hiddeng Pre:g=currentviewid[q] queue[g](next[q;g])=hm;pi E:next[q;g] next[q;g]+1 safe(m)p;q;hiddeng,s Pre:g=currentviewid[q] hg;si2created queue[g](nextsafe[q;g])=hm;pi forallr2s: next[r;g]>nextsafe[q;g] E:nextsafe[q;g] nextsafe[q;g]+1 VSspeciesapartitionableserviceinwhich,atanymomentoftime,every clienthaspreciseknowledgeofitscurrentview.vsdoesnotrequireclients tolearnabouteveryviewofwhichtheyaremembers,nordoesitplaceany consistencyrestrictionsonthemembershipofconcurrentviewsheldbydierent clients.itsonlyview-relatedrequirementisthatviewsarepresentedtoeach clientaccordingtothetotalorderonviewidentiers.vsprovidesamulticast servicethatimposesatotalorderonmessagessubmittedwithineachview,and deliversthemaccordingtothisorder,withnoomissions,andstrictlywithina view.inotherwords,thesequenceofmessagesreceivedbyeachclientwhileina certainviewisaprexofthetotalorderonmessagesassociatedwiththatview. Separatelyfromthemulticastservice,VSprovidesa\safe"noticationoncea messagehasbeendeliveredtoallmembersoftheview.