MulticastGroupCommunicationasaBasefora Load-BalancingReplicatedDataService 2BasserDept.ofCS,MadsenF09,UniversityofSydney,NSW2006,Australia. 1MITLCS,545TechnologySquare,NE43-365,Cambridge,MA02139,USA. RogerKhazan1,AlanFekete2,andNancyLynch1 Abstract.Wegivearigorousaccountofanalgorithmthatprovides sequentiallyconsistentreplicateddataontopoftheviewsynchronous groupcommunicationservicepreviouslyspeciedbyfekete,lynchand ityview,butrotatestheworkofqueriesamongthememberstoequalize Shvartsman.Thealgorithmperformsupdatesatallmembersofamajor- 1Introduction theload.thealgorithmispresentedandveriedusingi/oautomata. Multicastgroupcommunicationservicesareimportantbuildingblocksforfaulttolerantapplicationsthatrequirereliableandorderedcommunicationamong andshvartsmanrecentlygaveasimpleautomatonspecicationvsforaviewsynchronousgroupcommunicationserviceanddemonstrateditspowerbyusincallychanginggroupsandprovidestrongintra-groupmulticastprimitives.ttateconsensusonwhatpropertiestheseservicesshouldexhibit,fekete,lynch remedytheexistinglackofgoodspecicationsfortheseservicesandtofacili- multipleparties.theseservicesmanagetheirclientsascollectionsofdynami- balancesqueriesandguaranteessequentialconsistency. ittosupportatotally-orderedbroadcastapplicationto[14,13].inthispaper, weusevstosupportasecondapplication:areplicateddataservicethatload involvingomission,crashingordelay,butnotbyzantinefailures.thefailures aconsistentandtransparentfashionandenablestheclientstoupdateand connected,andsubjecttoprocessorandcommunicationfailuresandrecoveries querythisobject.weassumetheunderlyingnetworkisasynchronous,strongly- Theservicemaintainsadataobjectreplicatedataxedsetofserversin preservingcorrectnessandmaintainingliveness. andrecoveriesmaycausethenetworkoritscomponentstopartitionandmerge. updateoperationsmustbeprocessedinthesameordereverywhere.toavoid Thebiggestchallengefortheserviceistocopewithnetworkpartitioningwhile inconsistencies,thealgorithmallowsupdatestooccuronlyinprimarycom- ponents.followingthecommonlyuseddenition,primarycomponentsarede- nedasthosecontainingamajority(ormoregenerallyaquorum)ofallservers. Weassumethatexecutedupdatescannotbeundone,whichimpliesthat Nonemptyintersectionofanytwomajorities(quorums)guaranteestheexistence ofatmostoneprimaryatagiventimeandallowsforthenecessaryowofinformationbetweenconsecutiveprimaries.ourserviceguaranteesprocessingof
updaterequestswheneverthereisastableprimarycomponent,regardlessofthe pastnetworkperturbations. nents,andisguaranteedprovidedtheclient'scomponenteventuallystabilizes. monlyoccurringsituationswhenqueriestakeapproximatelythesameamount Theserviceusesaround-robinload-balancingstrategytodistributequeriesto eachserverevenlywithineachcomponent.thisstrategymakessenseincom- Ontheotherhand,processingofqueriesisnotrestrictedtoprimarycompo- oftime,whichissignicant.eachqueryisprocessedwithrespecttoadatastate thatisatleastasadvancedasthelaststatewitnessedbythequery'sclient.the serviceisarrangedinsuchawaythattheserversarealwaysabletoprocessthe nicationlayer.theservers'layerissymmetric:allserversrunidenticalstate- machines.thecommunicationlayerconsistsoftwoparts,agroupcommuni- cationservicesatisfyingvs,andacollectionofindividualchannelsproviding Architecturally,theserviceconsistsoftheservers'layerandthecommu- assignedqueries,thatistheyarenotblockedbymissingupdateinformation. reliablereorderingpoint-to-pointcommunicationbetweenallpairsofservers. Theserversusethegroupcommunicationservicetodisseminateupdateand queryrequeststothemembersoftheirgroupsandrelyonthepropertiesofthis RelatedWork usedtosendtheresultsofprocessedqueriesdirectlytotheoriginalservers. servicetoenforcetheformationofidenticalsequencesofupdaterequestsatall serversandtoschedulequeryrequestscorrectly.thepoint-to-pointchannelsare Groupcommunication.Agoodoverviewoftherationalandusefulnessofgroup thoughthereisnoconsensusonwhatpropertiestheseservicesshouldprovide, nicationservicesareisis[5],transis[10],totem[25],newtop[12],relacs[3] thespecicorderinganddeliverypropertiesoftheirmulticastprimitives.even andhorus[27].dierentservicesdierinthewaytheymanagegroupsandin communicationservicesisgivenin[4].examplesofimplementedgroupcommu- rangeofdierentformalisms[3,6,8,11,15,24,26].fekete,lynch,andshvarts- descriptionsoftheirbehavior.manyspecicationshavebeenproposedusinga atypicalrequirementistodelivermessagesintotalorderandwithinaview. manrecentlypresentedthevsspecicationforapartitionablegroupcommu- nicationservice.pleasereferto[14]foradetaileddescriptionandcomparisonof Tobemostuseful,groupcommunicationserviceshavetocomewithprecise VSwithotherspecications. sentedaspecicationforgroupcommunicationservicethatprovidesadynamic anddolev[7]haveusedthesamestyletospecifyavirtuallysynchronousfifo groupcommunicationserviceandtomodelanadaptivetotally-orderedgroup communicationservice.deprisco,fekete,lynchandshvartsman[9]havepre- SeveralpapershavesinceextendedtheVSspecication.Chockler,Huleihel, ingalloperationsinthesamesequenceatallcopies.thedetailsofdoingthis municationservicesisformaintainingcoherentreplicateddatathroughapply- notionofprimaryview. ReplicationandLoadBalancing.Themostpopularapplicationofgroupcom-
Melliar-Smith,Moser,andVaysburd[18,2,1,19,16,17]. inpartitionablesystemshavebeenstudiedbyamir,dolev,friedman,keidar, aremadebyclients,andtasksaresentdirectlytotheassignedservers.inthe secondstyle,tasksaremulticasttoallserversinthegroup;eachserverthen balancingalgorithms.intherst,moretraditional,style,schedulingdecisions ideallysuitedforfault-tolerantload-balancing.hesuggeststwostylesofload- Inhisrecentbook[4,p.329],Birmanpointsoutthatprocessgroupsare appliesadeterministicruletodecideonwhethertoaccepteachparticulartask. isassignedtotheserverwhoserankwithinthisgroupis(imodn).thisstrategyreliesonthefactthatallserversreceiverequestsinthesameorder,anman[4,p.329].accordingtothisstrategy,queryrequestsaresenttotheservers Inthispaper,weusearound-robinstrategyoriginallysuggestedbyBir- usingtotally-orderedmulticast;theithrequestdeliveredinagroupofnservers membershipchanges. Weextendthisstrategywithafail-overpolicythatreissuesrequestswhengroup SequentialConsistency.Therearemanydierentwaysinwhichacollectionof guaranteesauniformdistributionofrequestsamongtheserversofeachgroup. tinguishabletoeachindividualclient.amuchstrongercoherencepropertyis inalworkindeningthesepreciselyislamport'sconceptofsequentialconsis- tency[21].asystemprovidessequentialconsistencywhenforeveryexecution replicasmayprovidetheappearanceofasingleshareddataobject.thesem- ofthesystem,thereisanexecutionwithasinglesharedobjectthatisindistemfromonewithasinglesharedobject.thealgorithmofthispaperprovides atomicity,whereauniversalobservercan'tdistinguishtheexecutionofthesys- Contributionsofthispaper anintermediateconditionwheretheupdatesareatomic,butqueriesmaysee Thispaperpresentsanewalgorithmforprovidingreplicateddataontopof resultsthatarenotasup-to-dateasthosepreviouslyseenbyotherclients. apartitionablegroupcommunicationsystem,inwhichtheworkofprocessing algorithmisbasedonpreviousideas(theload-balancingprocessingofqueriesis queriesisrotatedamongthegroupreplicasinaround-robinfashion.whilethe thattheserversalwayshavesucientlyadvancedstatestoprocessthequeries. showhowqueriescanbeprocessedinminoritypartitions,andhowtoensure ofapreviouslypublishedaccountofawaytointegratethese.inparticular,we takenfrom[4]andtheupdateprocessingrelatesto[18,2,1,19])weareunaware generateuniquelabels).theproofin[14]usesthepropertyofagreedmessage asense,thetoapplicationisanonymous,sinceanodeusesitsidentityonlyto canusesomeofthestrongerpropertiesofvs.previouswork[14]veriedto, anapplicationinwhichallnodeswithinaviewprocessmessagesidentically(in Anotherimportantadvanceinthisworkisthatitshowshowaverication sequence,butitdoesnotpayattentiontotheidenticalviewofmembershipat theydecidewhichmemberwillrespondtoaquery. allrecipients.incontrast,thispaper'sload-balancingalgorithm(andthusthe proof)usesthefactthatdierentrecipientshavethesamemembershipsetwhen
nology.section3presentsaformalspecicationforclients'viewofthereplicated service.section4containsanintermediatespecicationfortheservice,thepurposeofwhichistosimplifytheproofofcorrectness.section5presentsani/o Therestofthepaperisorganizedasfollows.Section2introducesbasictermi- automatonfortheserver'sstate-machineandoutlinestheproofofcorrectness. 2MathematicalFoundations disjointunions(+),whichdiersfromtheusualsetunion([)inthateachelementisimplicitlytaggedwithwhatcomponentitcomesfrom.forsimplicity,wtions(!),andpartialfunctions(,!).somewhatnon-standardisouruseof Weusestandardandself-explanatorynotationonsets,sequences,totalfunc- denesageneralrequesttype.furthermore,ifreq2request,anduandqarethe \matchingconstructs."thus,forexample,ifupdateandqueryaretherespectivetypesforupdateandqueryrequests,thentyperequest=update+query establishedvariableconventionsforupdateandquerytypes,then\req usevariablenameconventionstoavoidmoreformal\injectionfunctions"and automatonisasimplestate-machineinwhichthetransitionsareassociated andtuttle[23](withoutfairness),alsodescribedinchapter8of[22].ani/o and\req=q"arebothvalidstatements. ThemodelingisdoneintheframeworkoftheI/OautomatonmodelofLynch u" withnamedactions,whichcanbeeitherinput,output,orinternal.therst twoareexternallyvisible,andthelasttwoarelocallycontrolled.i/oautomata denedbyitssignature(input,outputandinternalactions),setofstates,set areinput-enabled,i.e.,theycannotcontroltheirinputactions.anautomatonis ofstartstates,andastate-transitionrelation(across-productbetweenstates, capturedbythesetoftracesgeneratedbyitsexecutions.executionfragments tionfragmentthatbeginswithastartstate.thesubsequenceofanexecution consistingofalltheexternalactionsiscalledatrace.theexternalbehavioris andactionsconsistentwiththetransitionrelation.anexecutionisanexecu- actions,andstates).anexecutionfragmentisanalternatingsequenceofstates canbeconcatenated.compatiblei/oautomatacanbecomposedtoyielda ofanautomatonasinternal. complexsystemfromindividualcomponents.thecompositionidentiesactions thathaveintheirsignatures.thehidingoperationreclassiesoutputactions automataperformsastepinvolvingaction,sodoallcomponentautomata withthesamenameindierentcomponentautomata.whenanycomponent Toprovethatoneautomatonimplementsanotherinthesenseoftraceinclusion, executionsequence.arenementmappingisasingle-valuedsimulationrelation. ofthatautomaton.theyareusuallyprovedbyinductiononthelengthofthe itissucienttopresentarenementmappingfromthersttothesecond.a Invariantsofanautomatonarepropertiesthataretrueinallreachablestates functionisprovedtobearenementmappingbycarryingoutasimulation actionintoasingleatomicpieceofcode. proof,whichusuallyreliesoninvariants(seechapter8of[22]). whichgroupstogetherallthetransitionsthatinvolveeachparticulartypeof Wedescribethetransitionrelationinaprecondition-eectstyle(asin[22]),
ass:dbs.likewise,ifviewisastatevariableofaserverp,thenitsinstanceina dbsisastatevariableofanautomaton,thenitsinstanceinastatesisexpressed statetisexpressedast[p]:vieworasp:viewiftisclearfromthediscussion. Toaccesscomponentsofcompoundobjectsweusethedotnotation.Thus,if 3ServiceSpecicationS isgiveninfigure1.theautomatonsappearsinfigure2. mationonbasicandderivedtypes,alongwithaconventionforvariableusage, Inthissection,weformallyspecifyourreplicateddataservicebygivingacen- Fig.1Typeinformation tralizedi/oautomatonsthatdenesitsallowedbehavior.thecompleteinfor- rdbdb cvartype C Description oquupdate:db!db a Request=Update+QueryRequestisadisjointunionofUpdateandQuerytypes. Output=Answer+fokgOutputisadisjointunionofAnswerandfokgtypes. Query:DB!Answer Queriesarefunctionsfromdatabasestatestoanswers. Updatesarefunctionsfromdatabasestatestodatabasestates. Databasetypewithadistinguishedinitialvaluedb0. FinitesetofclientIDs.(c:procreferstotheserverofc). Answertypeforqueries.Answersforupdatesarefokg. ofoutputvalueotoaclientc. oftheformrequest(r)c,representingthesubmissionofrequestrbyaclientc;s client-serverarchitecture:clients'requestsaredeliveredtosviainputactions repliestoitsclientsviaactionsoftheformreply(o)c,representingthedelivery Theinterfacebetweentheserviceanditsblockingclientsistypicalofa thereplicatedsystem,thiswouldimplythatprocessingofqueryrequestswould vice),thenspecicationswouldincludeastatevariabledboftypedband wouldapplyupdateandqueryrequeststothelatestvalueofthisvariable.in havetoberestrictedtotheprimarycomponentsofthenetwork. Ifourserviceweretosatisfyatomicity(i.e.,behaveasanon-replicatedser- thatisatleastasadvancedasthelastonewitnessedbythequeries'client.for thispurpose,smaintainsahistorydbsofdatabasestatesandkeepsanindex beprocessedwithrespecttothelatestvalueofdb,onlywithrespecttothevalue service,wegiveaslightlyweakerspecication,whichdoesnotrequirequeriesto Inordertoeliminatethisrestrictionandthusincreasetheavailabilityofthe last(c)tothelateststateseenbyeachclientc. clientasanon-replicatedone,andthus,satisessequentialconsistency.note that,sincetheatomicityhasbeenrelaxedonlyforqueries,theserviceisactually strongerthantheweakestoneallowedbysequentialconsistency. Eventhoughourserviceisnotatomic,itstillappearstoeachparticular whereeachccmodelsanondeterministicblockingclientc(seefigure3);real formally,weclosesbycomposingitwiththeautomatonenv=qc2c(cc), because,asani/oautomaton,itisinput-enabled.toexpressthisassumption theygetrepliesfortheircurrentones)cannotbeexpressedwithinautomatons Theassumptionthatclientsblock(i.e.,donotsubmitanynewrequestsuntil blockingclientscanbeshowntoimplementthisautomaton.intheclosedautomatons,therequestactionsareforcedtoalternatewiththereplyactions,
Output: reply(o)c;o2output;c2c Fig.2SpecicationS Signature: Input: request(r)c;r2request;c2c Internal: map2c,!(request+output),initially?.buerfortheclients'pendingrequestsorreplies. State: dbs2seq0db,initiallydb0.sequenceofdatabasestates.indexingfrom0tojdbsj 1. query(c;q;l);c2c;q2query;l2n update(c;u);c2c;u2update last2c!n,initiallyf!0g.indexofthelastdbstatewitnessedbyid. update(c;u) Transitions: request(r)c E:dbs Pre:u=map(c) E:map(c) r reply(o)c Pre:map(c)=o last(c) dbs+u(dbs[jdbsj 1]) ok query(c;q;l) E:map(c) Pre:q=map(c) E:map(c) last(c)ljdbsj 1 lq(dbs[l])? whichmodelstheassumedbehavior.intherestofthepaper,weconsiderthe Fig.3ClientSpecicationCc Signature: Input: closedversionsofthepresentedautomata,denotingthemwithabar(e.g.,s). State:busy2Bool,initiallyfalse.Statusag.Keepstrackofwhetherthereisapendingrequest. reply(o)c;o2output Transitions: request(r)c Output: Pre:busy=false request(r)c;r2request true reply(o)c 4IntermediateSpecicationDE:busy centralizeddatabase,anditsetsclient-specicvariables,map(c)andlast(c),to theirnewvalues.inadistributedsetting,thesetwotasksaregenerallyaccomplishedbytwoseparatetransitions.tosimplifytherenementmappingbetween ActionupdateofspecicationSaccomplishestwologicaltasks:Itupdatesthe Fig.4IntermediateSpecicationD (seefigure4),inwhichthesetasksareseparated.disformedbysplitting theimplementationandthespecication,weintroduceanintermediatelayerd Transitions:SameasinS,exceptupdateismodiedandserviceisdened. update(c;u) Signature:SameasinS,withtheadditionofaninternalactionservice(c);c2C. State: E:dbs Pre:u=map(c) delay(c) c62dom(delay) dbs+u(dbs[jdbsj 1]) SameasinS,withtheadditionofastatevariabledelay2C,!N,initially?. service(c) E:map(c) Pre:c2dom(delay) eachupdateactionofsintotwo,updateandservice.therstoneextends last(c) delay(c) delay(c) ok?
to\ok"andusesinformationstoredindelaytosetlast(c)toitsvalue. dbswithanewdatabasestate,butinsteadofsettingmap(c)to\ok"andlast(c) Lemma1ThefollowingfunctionDS()isarenementfromDtoSwithrespect databasestatewitnessedbyc)indelaybuer.thesecondactionsetsmap(c) toitsnewvalueasins,itsavesthisvalue(i.e.,theindextothemostrecent toreachablestatesofdands.1 DS(d:D)!S s:last s:map s:dbs s:busyc = overlay(d:last;d:delay) overlay(d:map;fhc;okijc2dom(d:delay)g) d:dbs TransitionsofDsimulatetransitionsofSwiththesameactions,exceptforthose dence,themappingandtheproofarestraightforward.thelemmaimpliesthat thatinvolveservice;thesesimulateemptytransitions.giventhiscorrespon- d:busycforallc2c aboutimplementationtandspecicationd,whichbytransitivityofthe\implements"relationimpliesthattimplementssinthesenseoftraceinclusion. 5ImplementationT Thegurebelowdepictsthemajorcomponentsofthesystemandtheirinteractions.SetPrepresentsthesetofservers.Eachserverp2Prunsanidentical state-machinevstodpandservestheclientswhosec:proc=p. request(r)cp gpsnd(m)preply(r)cp gprcv(m)q;p VStoDp request(r) safe(m)q;p c0pnewview(v)p reply(r)c0prequest(r)cqreply(r)cqrequest(r)c0qreply(r)c0q PTP VSgpsnd(m)qgprcv(m)p;qVStoDq safe(m)p;qnewview(v)q icationvs[14,seeappendixa]andacollectionptpofreliablereordering servers'layeri=qp2p(vstodp)withthegroup-communicationservicespec- point-to-pointchannelsbetweenanypairofservers[22,pages460-461],withall theoutputactionsofthiscompositionhidden,exceptfortheservers'replies. TheI/OautomatonTfortheserviceimplementationisacompositionofthe DimplementsSinthesenseoftraceinclusion.Later,weprovethesameresult 1Givenf;g:X,!Y,overlay(f;g)isasgoverdom(g)andasfelsewhere. T=hideout(IVSPTP) freply(o)cg IVSPTP:
ure5.thei/ocodeforthevstodpstatemachineisgiveninfigures6and7. 5.1TheServer'sState-MachineVStoDp Fig.5AdditionalTypeDeclaration VarType Theadditionaltypeandvariable-nameconventioninformationappearsinFig- vgmm=cupdate+ xqp(p) V=GP(P) hg;<g;g0i X=G(CUpdate)NExpertiseinformationforexchangeprocess.Fields:xl,us,su. CQueryN+Xorexpertiseinformationforexchangeprocess. Anelementofthissetiscalledaview.Fields:idandset. Fixedsetofquorums.ForanyQ2QandQ02Q,Q\Q06=;. Totally-orderedsetofviewidswiththesmallestelement. MessagessentviaVS:Eitherupdaterequests,queryrequests, Description pktpkt=canswerngpacketssentviaptp.(nisindexofthewitnesseddbstate.) one.wealsodistinguishwhetherornottheserverisamemberofaprimaryview, tioninalreadyestablishedview,whilerecoveryactivity inanewlyforming modebeingnormal,orrecovery,markedbymodebeingeitherexpertisebroadcast orexpertisecollection.normalactivityisassociatedwiththeserver'sparticipa- Theactivityoftheserver'sstate-machinecanbeeithernormal,markedby whichisdenedasthatwhosememberscompriseaquorum(view:set2q). Fig.6Implementation(VStoDp):SignatureandStateVariables Signature: Input: request(r)c;r2request;c2c;c:proc=p gprcv(m)p0;p;m2m;p02p safe(m)p0;p;m2m;p02p newview(v)p;v2v ptprcv(pkt)p0;p;pkt2pkt;p02p Internal: Output: query(c;q;l);c2c;u2update ptpsnd(pkt)p;p0;pkt2pkt;p02p gpsnd(m)p;m2m update(c;u);c2c;u2update reply(o)c;o2output;c2c;c:proc=p map2cj(c:proc=p),!request+output,buerthatmapsclientstotheirrequestsorreplies. last2cj(c:proc=p)!n, State: db2db,initiallydb0. pending2p(cj(c:proc=p)),initially;. initiallycj(c:proc=p)!0. initially?. Indexofthelastdbstateseenbyeachclient. Setofclientswhoserequestsarebeingprocessed. Localreplica.Nextstatedependsoncurrentandaction. updates2(cupdate),initially[]. lastupdate2n,initially0. Sequenceofupdates.Indexingfrom1.Fields:candu. mode2fnormal;expertisebroadcast; queries2c,!(query+answer)n,queryrequestsoranswers,pairedwiththeirlast(c). querycounter2n,initially0. view2v,initiallyv0=hg0;pi. safetoupdate2n,initially0. initially?. Currentviewofp.Fields:idandset. Indexofthelastexecutedelementinupdates. Indexofthelast\safetoupdate"elementinupdates. expertcounter12n,initially0. expertiselevel2g,initiallyg0. expertisecollectiong,initiallynormal. Numberofqueriesreceivedwithincurrentview. expertcounter22n,initially0. expertisemax2x,initiallyhg0;[];0i. Cumulativeexpertisecollectedduringrecovery. Modesofoperation.Thelasttwoareforrecovery. Thehighestprimaryviewidthatpknowsof. Numberofexpertisemessagesreceivedsofar. Thefactthatserversofthesameviewreceivequeryrequestsinthesameorder gprcv(c;q;l)p0;p,query(c;q;l)p,ptpsnd(c;a;l;g)p;p0,andptprcv(c;a;l;g)p0;p. Processingofqueryrequestsishandledbyactionsofthetypegpsnd(c;q;l)p, Numberofexpertisemessagesreceivedsofarassafe. requestsuniformlyamongtheserversofoneview. guaranteesthattheschedulingfunctionofgprcv(c;q;l)p0;pdistributesquery
Fig.7ImplementationVStoDp:Transitions Transitions: request(r)c gpsnd(c;q;l)p E:map(c) Pre:mode=normal E:pending q=map(c)^c62pending l=last(c) r reply(o)c gprcv(c;q;l)p0;p E:querycounter if(rank(p;view:set)= thenqueries(c) querycountermodjview:setj) pending[c query(c;q;l) Pre:hq;li2queries(c) querycounter+1 gpsnd(c;u)p Pre:map(c)=o E:map(c)? ptpsnd(c;a;l;g)p;p0 E:queries(c) lastupdatel hq(db);lastupdatei hq;li safe(c;u)p0;p gprcv(c;u)p0;p E:pending Pre:mode=normal^view:set2Q u=map(c)^c62pending pending[c ptprcv(c;a;l;g)p0;p E:queries(c) E:if(g=view:id^c:proc=p)then Pre:c2dom(queries)^c:proc=p0 update(c;u) E:safetoupdate E:updates updates+hc;ui ha;li2queries(c) E:lastupdate Pre:lastupdate<safetoupdate hc;ui=updates[lastupdate+1] safetoupdate+1 map(c) last(c) pending lapending c? gprcv(x)p0;p db u(db) newview(v)p E:expertisemax if(c:proc=p)then pending pending c E:queries expertcounter1 pending expertisemax safetoupdate maxflast(c)jc2c^c:proc=pg?;querycounter pending fcj(9q:hc;qi2map)g expertisemax0 max safetoupdate; 0;expertcounter2 0 if(expertcounter1=jview:setj)then map(c) last(c) ok lastupdate maxx(expertisemax;x) expertisemax:xl expertcounter1+1 gpsnd(x)p Pre:mode=expertisebroadcast x=hexpertiselevel;updates;safetoupdatei view vexpertisebroadcast 0safe(x)p0;p updates expertisemax:us E:expertcounter2 if(view:set2q)then safetoupdate expertiselevel expertisemax:su if(expertcounter2=jview:setj)then expertcounter2+1 view:id E:mode expertisecollection if(view:set2q)then safetoupdate fcjc2pending^ c62updates[(lastupdate+1):: pending jexpertisemax:usj onlywhenthecurrentstateofthelocaldatabaseisatleastasadvancedasthe Servicingofeachquerybyabackgroundthreadquery(c;q;l)pisallowed mode normal safetoupdate]:cg servicethequeriesthatareassignedtothem. laststatewitnessedbyitsclient.thisconditioniscapturedbylastupdatel. Thenon-trivialpartofthisprotocolisthattheserviceactuallyguaranteesthat theserversalwayshavethesucientlyadvanceddatabasestatestobeableto gprcv(c;u)p0;p,safe(c;u)p0;p,andupdate(c;u).eachservermaintainsasequenceupdatesofupdaterequests,thepurposeofwhichistoenforcetheorderin recoveryprocedure,inwhichitmovesitsownpendingqueriesforreprocessing anderasesanyinformationpertainingtothequeriesofothers. Whenaserverlearnsofitsnewview,itexecutesasimplequery-related eachtimeanupdaterequestisdeliveredviaagprcvaction.thesequencehastwo whichupdatesareappliedtothelocaldatabasereplica.thesequenceisextended Processingofupdaterequestsishandledbyactionsofthetypegpsnd(c;u)p,
safetoexecuteandthosethathavealreadybeenexecuted.thesafeprexis calledsafeanddone,thatmarkrespectivelythoseupdaterequeststhatare distinguishedprexesupdates[1::safetoupdate]andupdates[1::lastupdate], serversareconsistent(i.e.,givenanytwo,oneisaprexofanother).sincedone prexesmarkthoseupdaterequeststhathavebeenappliedtodatabasereplica, view.2theserviceguaranteesthatatalltimessafeanddoneprexesofall extendedtocoveracertainupdaterequestonupdatessequencewhentheserver thispropertyimpliesmutualconsistencyofdatabasereplicas. learnsthattherequesthasbeendeliveredtoallothermembersofthatserver's safe(x)p0;p.thequery-relatedpartofthisactivitywasdescribedabove.for ishandledbyactionsofthetypenewview(v)p,gpsnd(x)p,gprcv(x)p0;p,and thatthestatesofalltheserversofthisviewareconsistentwiththeirandother theupdate-relatedpart,theserverhastocollaboratewithothersonensuring Whenaserverlearnsofitsnewview,itstartsarecoveryactivitythat servers'pastexecutionhistoriesandaresuitablefortheirsubsequentnormal activity. keepstrackofthisinformationinitsstatevariableexpertiselevel.twoother viewofwhichtheserverknows.thisknowledgemayhavecomedirectlyfrom personalparticipationinthatview,orindirectlyfromanotherserver.theserver comparedtothoseofothers.themostimportantcriterionisthelatestprimary Forthispurpose,eachserverhastobeabletotellhowadvanceditsstateis criteriaaretheserver'supdatessequenceanditssafeprex.thevaluesofthese ofexpertiseelementsisdenedasthefollowingtriple Denition1Thecumulativeexpertise,maxX(X),ofasetorasequence,X, threevariablescomprisetheserver'sexpertise. maxx(x)=max<gfx:xljx2xg; 2Someoftheoptimisticprotocols,suchas[16,17],executerequestsassoonasthey max<jjfx:usj(x2x)^(x:xl2max<gfx:xljx2xg)g; aredeliveredbyatotalordermulticast(abcastofhorus),butmayresultin inconsistentreplicas,inwhichcasetheyhavetoundoactionsandrollthereplicas' max<nfx:sujx2xg: statesback.ontheotherhand,pessimisticprotocols,whichimplementstrictmutual consistencyamongreplicas,requireadditionalinformationbeforetheyareableto totalordermulticastwithsafedelivery,i.e.,amessagedeliveredtoonememberis executedonlywhenaservercollectsamajorityofacknowledgments,whichhaveto andmoserin[1,2]eliminatetheneedforend-to-endacknowledgmentsbyusing bemulticastbyeachserveronceitreceivestherequest.amir,dolev,melliar-smith, executeadeliveredrequest.thepessimisticversionin[17]allowsforarequesttobe guaranteedtobedeliveredtoanyothermemberofthesameviewprovideditdoes notcrash.aspointedoutin[14,13],\asimple'coordinatedattack'argument(asin Chapter5of[22])showsthatinapartitionablesystem,thisnotionofsafedelivery thosebasedonvs,whichseparatesmessagedeliveryandsafenoticationevents. Asaresult,protocolsbasedonthismulticastprimitivearemorecomplicatedthan isincompatiblewithhavingallrecipientsinexactlythesameviewasthesender."
completedwithadeliveryofthelastexpertisemessageviaactiongprcv(x)p0;p. ofotherserverscannotcauseinconsistencyamongreplicas.therststepis cumulativeexpertise(seedef.1).noticethatadoptingcumulativeexpertise aimsatadvancingeveryone'sexpertisetothehighestoneknowntothem their Asarststep,theserver'scollaborationwithothersduringrecoveryactivity propagationofupdaterequeststopreviouslyinaccessiblereplicas.second,it ensuresthefutureabilityofserverstoprocessthequeriesassignedtothem. Inadditiontoadvancingtheirexpertise,theserversofprimaryviewshave Advancingtheserver'sexpertiseachievestwopurposes.First,itensuresthe oncetheserverofaprimaryviewlearnsthatallexpertisemessageshavebeen identicalupdatessequences,theentirecontentofwhichissafeandcontainsas prexesthesafeprexesofallotherserversinthesystem.forthispurpose, normalactivity,whichsubsumesthattheyhavetostartnormalactivitywith toensuretheirabilitytoprocessnewupdaterequestsoncetheyresumetheir deliveredtoallserversofthisview,itextendsitssafeprextocovertheentire updatessequenceadoptedduringtheexchangeprocess. subsequentprimaryviewshavehigheridentiers. primaryviewswillcontainintheirupdatessequences.attainmentofthisbehaviordependsontheintersectionpropertyofprimaryviewsandthefactthat Theresultantsafeprexactsasanewbasethatallserversofthefuture thisbasebackforreprocessing.afterthisstep,theservermayresumeitsnormal viewestablishesthebase,itmovesallpendingupdaterequeststhatarenotin areresubmittedbytheiroriginalservers.therefore,onceaserverofaprimary thatarenotincludedinthebasewillneverndawaytoasafeprexunlessthey Theestablishedbaseworksasadivider:partiallyprocessedupdaterequests 5.2RenementMappingfromTtoD activity,whichenablesittoprocessnewupdateandqueryrequests. reply(o)caresimulatedwhenttakesthecorrespondingactions.actionsof AutomatonDhasvetypesofactions.Actionsofthetypesrequest(r)cand update(c;u)poftasleadingwhent[p]:lastupdate=max}ft[}]:lastupdateg, thetypequery(c)aresimulatedwhentexecutesptprcv(c;a;l;g)p0;pwith latedundercertainconditionswhentexecutesupdate(c;u)p.wedeneactions thatarejustnativesimulateservice(c),thatarebothleadingandnativesimulate\update(c),service(c)",andthatareneithersimulateemptytransitions. g=p:view:id.thelasttwotypes,update(c)andservice(c),arebothsimu- andasnativewhenc:proc=p.actionsthatarejustleadingsimulateupdate(c), Lemma2ThefollowingfunctionisarenementfromTtoDwithrespectto reachablestatesoftandd.3 TransitionsofTwithanyotheractionssimulateemptytransitionsofD. 3Ifsis\f1;f2;:::;fn"witheachfi:A!A,andifa2A,thenscan(s)= =\f1;(f2f1);:::;(fn:::f2f1)"andmap(s;a)=\f1(a);f2(a);:::;fn(a)".
TD(t:T)!D= lett:done=t[}]:updates[1::t[}]:lastupdate],where}2pisanysuchthat last map dbs Sp2Pt[p]:last Sp2Pt[p]:map db0+map(scan(t:done);db0) t[}]:lastupdate=maxp2pft[p]:lastupdateg delay busyc fht:done[i]:c;iij1ijt:donej^t[t:done[i]:c:proc]:lastupdate<ig arethesame.t:doneisaderivedvariablethatdenotesthelongestsequence serversareconsistent.inparticular,allsequenceswhichhavemaximumlength Aninvariantwillshowthatsequencesofprocessedrequestsatdierent t:busyc forallc2c ofupdaterequestsprocessedinthesystem.thissequencecorrespondstoall quests).witheachcinthisdomainweassociateitspositioninsequencet:done. dened.domainoftd(t):delayconsistsofidsofupdaterequeststhathavebeen modicationsdonetothedatabaseofd,whichexplainsthewaytd(t):dbsis explainsthewayd:delayisdened. Thispositioncorrespondstothelastdatabasestatewitnessedbyclientc,which processedsomewhere(i.e.,int:done)butnotattheirnativelocations(i.e.,the lastupdateattheirnativelocationshavenotyetsurpassedtheseupdatere- theirdoneprexesarethesame: I2Foranytwoserversp1andp22P,ifthelengthsoftheirdoneprexesarethesame,then Fig.8InvariantsusedintheproofthatTD()isarenementmapping(Lemma2) I1Foreachserverp2P,p:lastupdatep:safetoupdatejp:updatesj. c:proc:lastupdate<ip:safetoupdate,thenhc;ui2c:proc:mapandc2c:proc:pending. tionisstillreectedinitsnativemapandpendingbuers:ifhc;ui=p:updates[i]and I3Anyupdaterequestthatissafesomewherebuthasnotbeenexecutedatitsnativeloca- p1:lastupdate=p2:lastupdate)p1:updates[1::p1:lastupdate]=p2:updates[1::p2:lastupdate]: I4Atmostoneunexecutedupdaterequestpereachclientcanappearatthatclient'sserver:For anyclientc2c,thereexistsatmostoneindexi2nsuchthati>c:proc:lastupdateand Moreover,ifp:view:id=gthen I5ForallPTPpacketshc;a;l;gionain-transitp0;pchannel,itfollowsthatc:proc=p. c=c:proc:updates[i]:c. (a)c2dom(p:map)^p:map(c)2query (b)c2p:pending (d)lp:last(c) TheproofofLemma2isstraightforwardgiventhevetop-levelinvariantsin (c)a=p:map(c)(compose(p:updates[1::l])(db0)) (e)lmax}f}:lastupdateg withunsafeportionsofupdatessequences(becausethelatterbecometheformer tohavepropertiesonlyaboutsafeprexes weneedinvariantsthatdealalso approach[20]:oneofthefundamentalinvariantsstatesthatsafeprexesof updatessequencesatallserversareconsistent.toprovethisfact,itisnotenough Figure8.Toprovetheseinvariantsassertionallywehavedevelopedaninteresting ofdierentserversdependontheservers'expertiselevel,whichmayhavecome duringanexecution).invariantsthatrelatesafeprexesandupdatessequences
towhichthereplicationpartofthealgorithmoperates.therecursivenatureof ofservers'expertiseinearlierviews.inasense,itpresentsthelawaccording toaserverdirectlyfromtheparticipationinaprimaryview,orindirectlyfrom recursivelythehighestexpertiseachievedbyeachserverineachviewinterms thisfunctionmakesproofsbyinductioneasy:provinganinductivestepinvolves someoneelse.inourproof,wehaveinventedaderivedfunctionxthatexpresses 6FutureWork unwindingonlyonerecursivestepofthederivedfunctionx. goodbehavioroftheunderlyingnetwork.inparticular,weareplanningtocomparetheresponsetimeofthisalgorithmwithotherswhichsharequeryload dierently,forexamplebasedonrecentrun-timeloadreportswhicharedisseminatedbymulticastmanceandfault-toleranceproperties,statedconditionallytoholdinperiodsof Thispaperhasdealtwithsafetyproperties;futureworkwillconsiderperfor- messagecommunicationintothegroupcommunicationlayer. References dynamically,usingaservicesuchastheonein[9],andintegratingtheunicast Otherpossibleextensionstothisworkinvolvedeterminingprimaryviews 1.Y.Amir.ReplicationusingGroupCommunicationoveraPartitionedNetwork. 3.O.Babaoglu,R.Davoli,L.Giachini,andP.Sabattini.Theinherentcostofstrongpartialview-synchronouscommunication.LNCS,972:72{86,1995. usinggroupcommunication.technicalreport94-20,thehebrewuniversityof 2.Y.Amir,D.Dolev,P.Melliar-Smith,andL.Moser.Robustandecientreplication Jerusalem,Israel,1994. PhDthesis,TheHebrewUniversityofJerusalem,Israel,1995. 4.K.P.Birman.BuildingSecureandReliableNetworkApplications.ManningPublicationsCo.,Greenwich,CT,1996. 5.K.P.BirmanandR.vanRenesse,editors.ReliableDistributedComputingwith 6.T.D.Chandra,V.Hadzilacos,S.Toueg,andB.Charron-Bost.Ontheimpossibility theisistoolkit.ieeecomputersocietypress,1994. 7.G.V.Chockler,N.Huleihel,andD.Dolev.Anadaptivetotallyorderedmulticastprotocolthattoleratespartitions.InProceedingsofthe17hAnnualACM PrinciplesofDistributedComputing,pages322{330,NewYork,USA,May1996. ofgroupmembership.inproceedingsofthe15thannualacmsymposiumon SymposiumonPrinciplesofDistributedComputing,pages237{246,1998. 9.R.DePrisco,A.Fekete,N.Lynch,andA.Shvartsman.Adynamicview-oriented 8.F.Cristian.Group,majority,andstrictagreementintimedasynchronousdistributedsystems.InProceedingsoftheTwenty-SixthInternationalSymposiumon onprinciplesofdistributedcomputing,pages227{236,1998. groupcommunicationservice.inproceedingsofthe17hannualacmsymposium Fault-TolerantComputing,pages178{189,Washington,June25{27,1996.IEEE. 11.D.Dolev,D.Malki,andR.Strong.Aframeworkforpartitionablemembership 10.D.DolevandD.Malki.TheTransisapproachtohighavailabilityclustercommunication.CommunicationsoftheACM,39(4):64{70,Apr.1996. service.technicalreporttr94-6,departmentofcomputerscience,hebrewuniversity,1994.
12.P.D.Ezhilchelvan,R.A.Mac^edo,andS.K.Shrivastava.Newtop:Afault-tolerant 13.A.Fekete,N.Lynch,andA.Shvartsman.Specifyingandusingapartionablegroup groupcommunicationprotocol.inproceedingsofthe15thinternationalconference 14.A.Fekete,N.Lynch,andA.Shvartsman.Specifyingandusingapartionablegroup ondistributedcomputingsystems(icdcs'95),pages296{306,losalamitos,ca, USA,May30{June2,1995.IEEEComputerSocietyPress. communicationservice.extendedversion,http://theory.lcs.mit.edu/tds. 15.R.FriedmanandR.vanRenesse.StrongandweakvirtualsynchronyinHorus. communicationservice.inproceedingsofthesixteenthannualacmsymposium onprinciplesofdistributedcomputing,pages53{62,santabarbara,california, Aug.21{24,1997. TechnicalReportTR95-1537,CornellUniversity,ComputerScienceDepartment, 17.R.FriedmanandA.Vaysburd.High-performancereplicateddistributedobjects 16.R.FriedmanandA.Vaysburd.Implementingreplicatedstatemachinesoverpartitionablenetworks.TechnicalReportTR96-1581,CornellUniversity,Computer Aug.24,1995. Science,Apr.17,1996. 18.I.Keidar.Ahighlyavailableparadigmforconsistentobjectreplication.Master's ComputerScience,July16,1997. inpartitionableenvironments.technicalreporttr97-1639,cornelluniversity, 19.I.KeidarandD.Dolev.Ecientmessageorderingindynamicnetworks.InProceedingsofthe15thAnnualACMSymposiumonPrinciplesofDistributedComputing, 1994. pages68{76,newyork,usa,may1996. thesis,instituteofcomputerscience,thehebrewuniversityofjerusalem,israel, 20.R.I.Khazan.Groupcommunicationasabaseforaload-balancingreplicated 21.L.Lamport.Howtomakeamultiprocessorcomputerthatcorrectlyexecutes 22.N.A.Lynch.DistributedAlgorithms.MorganKaufmannseriesindatamanagementsystems.MorganKaufmannPublishers,LosAltos,CA94022,USA,1996. dataservice.master'sthesis,departmentofelectricalengineeringandcomputer Science,MassachusettsInstituteofTechnology,Cambridge,MA02139,May1998. multiprocessprograms.ieeetransactionsoncomputers,c-28(9):690{691,1979. 24.L.E.Moser,Y.Amir,P.M.Melliar-Smith,andD.A.Agarwal.Extendedvirtualsynchrony.InProceedingsofthe14thInternationalConferenceonDistributed 23.N.A.LynchandM.R.Tuttle.Anintroductiontoinput/outputautomata. CWIQuarterly,2(3):219{246,1989.AlsoavailableasMITTechnicalMemo MIT/LCS/TM-373. ComputingSystems,pages56{65,LosAlamitos,CA,USA,June1994.IEEEComputerSocietyPress. 26.A.M.Ricciardi,A.Schiper,andK.P.Birman.Understandingpartitionsand 25.L.E.Moser,P.M.Melliar-Smith,D.A.Agarwal,R.K.Budhia,andC.A.Lingley-Papadopoulos.Totem:Afault-tolerantmulticastgroupcommunicationsystem. 27.R.vanRenesse,K.P.Birman,andS.Maeis.Horus:Aexiblegroupcommunicationsystem.CommunicationsoftheACM,39(4):76{83,Apr.1996. the\nopartition"assumption.technicalreporttr93-1355,cornelluniversity, ComputerScienceDepartment,June1993. CommunicationsoftheACM,39(4):54{63,Apr.1996.
ATheVSSpecication TheVSspecicationof[14,13]isreprintedinFigure9.Mdenotesamessage alphabetandhg;<g;g0iisatotally-orderedsetofviewidentierswithan initialviewidentier.anelementofthesetv=gp(p)iscalledaview.if visaview,wewritev:idandv:settodenoteitscomponents. Fig.9VS-machine Signature: Input: gpsnd(m)p,m2m,p2p Output: gprcv(m)p;qhiddeng,m2m,p;q2p,g2g safe(m)p;qhiddenv,m2m,p;q2p,v2views newview(v)p,v2views,p2p,p2v:set Internal: createview(v),v2views vs-order(m;p;g),m2m,p2p,g2g State: createdv,initiallyfhg0;pig foreachp2p: currentviewid[p]2g,initiallyg0 foreachg2g: queue[g],anitesequenceofmp, initiallyempty foreachp2p,g2g: pending[p;g],anitesequenceofm, initiallyempty next[p;g]2n>0,initially1 nextsafe[p;g]2n>0,initially1 Transitions: createview(v) Pre:v:id>max(g:9S;hg;Si2created) E:created created[fvg newview(v)p Pre:v2created v:id>currentviewid[p] E:currentviewid[p] v:id gpsnd(m)p E:appendmtopending[p;currentviewid[p]] vs-order(m;p;g) Pre:misheadofpending[p;g] E:removeheadofpending[p;g] appendhm;pitoqueue[g] gprcv(m)p;q;hiddeng Pre:g=currentviewid[q] queue[g](next[q;g])=hm;pi E:next[q;g] next[q;g]+1 safe(m)p;q;hiddeng,s Pre:g=currentviewid[q] hg;si2created queue[g](nextsafe[q;g])=hm;pi forallr2s: next[r;g]>nextsafe[q;g] E:nextsafe[q;g] nextsafe[q;g]+1 VSspeciesapartitionableserviceinwhich,atanymomentoftime,every clienthaspreciseknowledgeofitscurrentview.vsdoesnotrequireclients tolearnabouteveryviewofwhichtheyaremembers,nordoesitplaceany consistencyrestrictionsonthemembershipofconcurrentviewsheldbydierent clients.itsonlyview-relatedrequirementisthatviewsarepresentedtoeach clientaccordingtothetotalorderonviewidentiers.vsprovidesamulticast servicethatimposesatotalorderonmessagessubmittedwithineachview,and deliversthemaccordingtothisorder,withnoomissions,andstrictlywithina view.inotherwords,thesequenceofmessagesreceivedbyeachclientwhileina certainviewisaprexofthetotalorderonmessagesassociatedwiththatview. Separatelyfromthemulticastservice,VSprovidesa\safe"noticationoncea messagehasbeendeliveredtoallmembersoftheview.