Alargenumberofapplications(e.g.,callroutingandswitchingintelecommunic-

Transcription

1 RajeevRastogiPhilipBohannonJamesParker DistributedMulti-LevelRecoveryin S.SeshadriyAviSilberschatzS.Sudarshany yindianinstituteoftechnology,bombay,india Main-MemoryDatabases BellLaboratories,MurrayHill,NJ ofupdates.theschemesoerdierenttradeos,basedonfactorssuchas onebasedonpageshipping,andtheotherbasedonbroadcastingofthelog presentarecoveryschemeforclient-serverarchitectures,basedonshippinglog updaterates. recordstotheserver,andtworecoveryschemesforshared-diskarchitectures databases,specicallyforclient-serverandshared-diskarchitectures.we Inthispaperwepresentrecoverytechniquesfordistributedmain-memory Ourtechniquesareextensionstoadistributed-memorysettingofacent- Abstract thesystemlog.further,thetechniquesuseafuzzycheckpointingscheme cessing,anduseper-transactionredoandundologstoreducecontentionon mentedinthedalmain-memorydatabasesystem.ourcentralizedaswell ralizedrecoveryschemeformain-memorydatabases,whichhasbeenimple- schemesalsosupportconcurrentupdatestothesamepageatdierentsites. thatwritesonlydirtypagestodisk,yetminimallyinterfereswithnormal asdistributed-memoryrecoveryschemeshaveseveralattractivefeatures reducediski/obywritingonlyredologrecordstodiskduringnormalpro- evenacquirealatchbeforeupdatingapage.ourlogshipping/broadcasting processing allbutoneofourrecoveryschemesdonotrequireupdatersto theysupportanexplicitmulti-levelrecoveryabstractionforhighconcurrency, ytheworkoftheseauthorswasperformedinpartwhiletheywereatbelllabs. 0

2 thatisdisk-resident.anattractiveapproachtoprovidingapplicationswithlow (andpredictable)responsetimesistoloadtheentiredatabaseintomain-memory. thehighperformanceneedsofsuchapplicationsduetothelatencyofaccessingdata ofmilliseconds.traditionaldisk-baseddatabasesystemsareincapableofmeeting ations,nancialapplications,automationcontrol)requirehighperformanceaccess todatawithresponsetimerequirementsoftheorderofafewmillisecondstotens 1Introduction Databasesforsuchapplicationsareoftenoftheorderoftensorhundredsofmegabytes,whichcaneasilybesupportedinmain-memory.Further,machineswithmain Alargenumberofapplications(e.g.,callroutingandswitchingintelecommunic- ofram,machineswithsuchlargemainmemorieswillbecomecheaperandmore alargebuer-cachetoatraditionaldisk-basedsystem.incontrast,inamainmemorydatabasesystem(mmdb)(see,e.g.,[gms92,lsc92,jlr+94,dko+84]), memorypointers,orindirectlyvialocationindependentdatabaseosetsthatcan pages.also,objectslargerthanthesystem'spagesizecanbestoredcontiguously, interactwithabuermanager,eitherforlocatingdata,orforfetching/pinningbuer processandlockedinmemory.datacanbeaccessedeitherdirectlybyvirtual bequicklytranslatedtomemoryaddresses.duringdataaccess,thereisnoneedto Oneapproachforimplementingsuchhighperformancedatabasesistoprovide memoriesof8gigabytesormorearealreadyavailable,andwiththefallingprice common. theentiredatabasecanbedirectlymappedintothevirtualaddressspaceofthe therebysimplifyingretrievalorin-placeuse.thus,dataaccessusingamain-memory databaseisveryfastcomparedtousingdisk-basedstoragemanagers,evenwhen furtherperformanceimprovementsforanumberofapplications.forexample,considerapplicationsinwhichtransactionsarepredominantlyread-onlyandupdate ratesarelow(e.g.,numbertranslationandcallroutingintelecommunications). Eachmachinecanlocallyaccessdatacachedinmemory,thusavoidingnetworkcommunicationwhichcouldbefairlyexpensive.AnotherexampleisComputerAided arelong,andinteractiveresponsetimeisveryimportant. Designapplications,wherelocalityofreferenceisveryhigh,updatetransactions Distributedarchitecturesinwhichseveralmachinesareconnectedbyafastnet- thedisk-basedmanagerhassucientmemorytocachealldatapages. work,andperformdatabaseaccessesandupdatesinparallel,providesignicant toahot-sparesincetheloadcanbedistributedinthenon-failurecaseleadingto Inthiscase,especiallywithlowupdaterates,adistributeddatabaseispreferable criticalapplications,evenifdatatseasilyinasinglemachine'smain-memory. improvedperformance. isbasedonthemain-memoryrecoveryschemepresentedin[jss93].therecovery whichundologrecordsarekeptinmemoryandonlywrittentodiskifrequiredfor schemeof[jss93]providesimportantfeaturessuchastransientundologgingin Distributionalsoenhancesfaulttolerance,whichisrequiredinmanymission- TherecoveryschemeusedintheDalmain-memorydatabasesystem[JLR+94] 1

3 checkpointing,per-transactionlogsinmemorytoreducecontentiononthesystem logtail,andrecoveryusingonlyasinglepassoverthesystemlog.therecovery tothedistributedmemorycase,simultaneouslymaintainingtheadvantagesofthe ([WHBM90,MHL+92,Lom92]),andfuzzycheckpointing[SGM90a,Hag86]. single-sitescheme,andecientlysupportingtheapplicationsdescribedabove.for schemeusedindalprovidesseveralfurtherextensions,suchasmulti-levelrecovery example,wecanmakeuseoftransientundologgingtoreducethesizeofthelog protocols. writtentodisk,aswellasthesizeofthelogsentacrossnetworklinksindistributed ThegoaloftheworkdescribedherewastoextendtheDalrecoveryscheme tionexecutesatasinglesite,fetchingdata(pages)asrequiredfromothersites. Distributedcommitprotocolsarenotneededasin\function-shipping"environ- client-serverarchitectures,andthesecondandthirdforshareddiskarchitectures. Theseareall\data-shipping"schemes(see,e.g.,[FZT+92])inwhichatransacments.Whileshareddiskarchitectureshavetraditionallybeencloselytiedtohardwareplatforms(e.g.,VAXCluster),UNIX-basedshareddiskplatformsandnetwork ofworkstationarchitectureswithsimilarperformancecharacteristicsarebecoming morecommon. Wepresentthreedistinctbutrelateddistributedrecoveryschemes{therstfor isthatconcurrentupdatesarepossibleatgranularitiessmallerthanapage-size. Therebyminimizing\false-sharing"(thatis,apparentconictsduetocoarse-granularitylocking)andconsequently,needlessnetworkaccessestoresolvefalsesharing. recoveryalgorithms,suchastransientundologging,explicitmulti-levelrecovery, Ourdistributedrecoveryalgorithmsprovidetheadvancedfeaturesofourcentralized andfuzzycheckpointing.siteorglobalrecoveryrequiresonlyasinglepassover checkpoint. thesystemlog,startingfromtheendofthesystemlogrecordedinthemostrecent Akeypropertyoftheclient-serverschemeandoneoftheshareddiskschemes thepaper. 2OverviewofMain-MemoryRecovery Sections6and7presentourshareddiskrecoveryalgorithms.Section8concludes multi-levelrecoveryandthesingle-sitealgorithmonwhichthepresentworkisbased recoveryalgorithminsection4.section5describesourshareddiskmodel,while insection2.relatedworkispresentedinsection3.wepresentourclient-server Theremainderofthepaperisorganizedasfollows.Wepresentbackgroundon (S)modesthatguardupdatesandaccessestotheregion,respectively. singleassociatedlock,referredtoastheregionlock,withexclusive(x)andshared Inthissectionwepresentareviewofmulti-levelrecoveryconceptsandanoverview anobject,oranarbitrarydatastructurelikealistoratree.eachregionhasa detailsofourschemearedescribedin[bpr+96]. ofthesingle-sitemain-memoryrecoveryschemeusedinthedalsystem.low-level Inourscheme,dataislogicallyorganizedintoregions.Aregioncanbeatuple, 2

4 Figure1:OverviewofRecoveryStructures Redo Log Database Undo Log Dirty Page Table Trans. Local Logs Active Trans. Table End Stable Log System Log Tail In Main Memory logginghasbeendonephysically(e.g.recordingexactlywhichbytesweremodied Stable System Log On Disk End toinsertakeyintotheindex)thenthetransactionmanagementsystemmustensure 2.1Multi-LevelRecovery Stable cur_ckpt Database thatthesephysicalundodescriptionsarevaliduntiltransactioncommit.sincethe Ckpt A ckpt_dpt descriptionsrefertobytechangesatspecicpositions,thistypicallyimpliesthat untiltransactioncommitleadstounacceptablylowlevelsofconcurrency.ifundo Multi-levelrecovery[WHBM90,MHL+92,Lom92]providesrecoverysupportfor theuseofweakeroperationlocksinplaceofstrongershared/exclusiveregionlocks. enhancedconcurrencybasedonthesemanticsofoperations.specically,itpermits Acommonexampleisindexmanagement,whereholdingphysicalregionlocks Ckpt B Active Trans Table (ATT) theregionlocksontheupdatedindexnodesmustbeheldtilltransactioncommit (undo logs) Oncethisreplacementismade,theregionlocksmaybereleasedandonly(less restrictive)operationlocksareretained.forexample,regionlocksontheparticular index. toensurecorrectrecovery,inadditiontoconsiderationsforconcurrentaccesstothe replacedbyalogicalundorecordindicatingthattheinsertedkeymustbedeleted. theoperationlevel.thus,foraninsertoperation,physicalundorecordswouldbe nodesinvolvedinaninsertcanbereleased,whileanoperationlockonthenewly recordswithhigherlevellogicalundologrecordscontainingundodescriptionsat insertedkeythatpreventsthekeyfrombeingaccessedordeletedisheld. Themulti-levelrecoveryapproachistoreplacetheselow-levelphysicalundolog storedondiskare1)curckpt,an\anchor"pointingtothemostrecentvalidcheck- isinmainmemory,with(two)checkpointimagesckptaandckptbondisk.also 2.2SystemOverview Figure1givesanoverviewofthestructuresusedforrecovery.Thedatabase(a sequenceofxedsizepages)ismappedintotheaddressspaceofeachprocessand 3

5 withitstailinmemory.thevariableendofstablelogstoresapointerintothesystemlogsuchthatallrecordspriortothepointerareknowntohavebeenushedto separateredoandundologsforactivetransactions,inadditiontoinformationabout thestablesystemlog. storedwitheachcheckpointimage.thedirtypagetableinacheckpointimageis referredtoasckptdpt. transactionstatus.adirtypagetable,dpt,ismaintainedinmemorytorecordpages TheATT(withundologs,butwithoutredologs)andthedirtypagetablearealso thathavebeenupdatedsincethelastcheckpoint.forsimplicityofpresentation, weassumethatthedirtypageismaintainedasabitmapwithonebitperpage. Thereisasingleactivetransactiontable(ATT)inmain-memorywhichstores pointimageforthedatabase,and2)asinglesystemlogcontainingredoinformation, associatedwithit.anoperationatlevellicanconsistofasequenceofoperations to[lom92].webrieydescribethemodelbelow.eachoperationhasalevelli Transactions,inourmodel,consistofasequenceofmulti-leveloperations,following 2.3TransactionsandOperations memory,establishingapointintheserializationorder,andcommitwhenthecommit distinguishbetweenpre-commit,whenthecommitrecordentersthesystemlogin Ln?1.PhysicalupdatestoregionsarelevelL0operations.Fortransactions,we atlevelli?1.transactions,assumedtobeatlevelln,calloperationsatlevel commits;similarly,anoperationlockatlevelliishelduntilthetransactionorthe containingoperation(atlevelli+1)commits.allthelocksacquiredbyatransaction byotheractivetransactions.levell0operationsobtainregionlocksinsteadof operationlocks.thelocksontheregionarereleasedoncethel1operationpre- operationlockisgrantediftheoperationcommuteswithotheroperationlocksheld interchangeablysincebothrefertothetimewhenthecommitrecordentersthe recordhitsthestablelog.foroperations,weusethetermscommitandpre-commit systemloginmemory. arereleasedonceitcommits.1 Eachtransactionobtainsanoperationlockbeforeitexecutesanoperation;the Therecoveryalgorithmmaintainsseparatelocalundoandredologsinmemoryfor 2.4LoggingModel mayreaduncommitteddata,andtheircommitmustbedelayeduntilthedirtydatatheyhaveread hasbeencommitted. andredologrecordsthatareappendedtotherespectivelocallog.whenatransaction/operationpre-commits,thecurrentcontentsofthetransaction'slocalredo logareappendedtothesystemlogtailinmemory,andthelogicalundodescription intheatt.eachphysicalupdate(toapartofaregion)generatesphysicalundo eachtransaction.thesearestoredasalinkedlistoanentryforthetransaction 1Itispossibletoreleaselocksforatransactiononpre-commit;asaresultread-onlytransactions 4

6 arewrittentothesystemlogduringnormalprocessing. systemlog.thus,withtheexceptionoflogicalundodescriptors,onlyredorecords fortheoperationisincludedinanoperationcommitlogrecordappendedtothe recordwrittentodisk,pagestouchedbytheupdateonthelogrecordaremarked tions/updatesarereplacedinthetransaction's(local)undologwithalogicalundo dirtyinthedirtypagetable,dpt,bytheushingprocedure.inoursingle-siterecoveryscheme,updateactionsdonotobtainlatchesonpages{insteadregionlocksare obtainedtoensurethatupdatesdonotinterferewitheachother.3eliminatinglatchingsignicantlydecreasesaccesscostsinmain-memory,andreducesprogramming Thesystemlogisushedtodiskwhenatransactioncommits.Foreachredolog logrecordcontainingtheundodescriptionfortheoperation.in-memoryundologs oftransactionsthathavecommittedaredeletedsincetheyarenotrequiredagain.2 Also,whenanoperationpre-commits,theundologrecordsforitssubopera- totheredolog.(ourdistributed-memoryschemes,withtheexceptionofoneofthe assettingofdirtybitsforthepage,arenowperformedbasedonlogrecordswritten complexity.recoveryrelatedactionsthatarenormallytakenonpagelatching,such latchingtoensurecachecoherency,whichisnotaprobleminthesingle-sitecase.) shared-diskschemes,donotobtainpagelatcheseither;thesoleexceptionusespage Theredologisusedasasingleunifyingresourcetocoordinatetheapplication's diersslightlyfromtheterminologyused,forexample,inaries[mhl+92]. interactionwiththerecoverysystem,andthisapproachhasprovenveryuseful. 2.5Ping-pongCheckpointing pointtomeanacopyofthemain-memorydatabasewhichisstoredondisk,andthe termcheckpointingtorefertotheactionofcreatingacheckpoint.thisterminology Consistentwiththeterminologyinmain-memorydatabases,weusethetermcheck- undologsforupdatesonapageareushedtodiskbeforethepageisushedto disk.insuchsystems,toguaranteethewalproperty,typicallyalatchonapage isobtained,alllogrecordspertainingtothepageareushedtostablestoragethe latch,therebypreventingconcurrentupdateswhileapageisbeingushedtodisk. Asaresultofnotobtaininglatchesonpagesduringupdates,itisnotpossibleto beingwrittenout. pageiscopiedtodisk,andthelatchreleased.updatersalsoobtainthesamepage enforcethewrite-aheadloggingpolicy,sincepagesmaybeupdatedevenastheyare Traditionalrecoveryschemesimplementwrite-aheadlogging(WAL),wherebyall thetransactionaborting. ofatupletochange,theninadditiontoaregionlockonthetuple,anxmoderegionlockonthe storageallocationstructuresonthepagemustbeobtained. turesmayneedtobeobtained.forexample,inapagebasedsystem,ifanupdatecausesthesize pointing(see,e.g.,[sgm90b]).inping-pongcheckpointingtwocopiesofthedata- baseimagearestoredondisk,andalternatecheckpointswritedirtypagestoaltern- Instead,ourrecoveryalgorithmmakesuseofastrategycalledping-pongcheck- 2Thelogscanbedeletedonpre-commit,since,shortofasystemcrash,nothingcanresultin 3Incaseswhenregionsizeschange,certainadditionalregionlocksonstorageallocationstruc- 5

7 thatisbeingcreatedtobetemporarilyinconsistent;i.e.,updatesmayhavebeen outtobringthecheckpointtoaconsistentstate.evenifafailureoccurswhile writtenoutwithoutcorrespondingundorecordshavingbeenwritten.however, recovery. atecopies.writingalternatecheckpointstoalternatecopiespermitsacheckpoint creatingonecheckpoint,theothercheckpointisstillconsistentandcanbeusedfor afterwritingoutdirtypages,sucientredoandundologinformationiswritten incompletepagewritesresultingfrom,forexample,powerfailures.incompletepage ingdoesnothaveaveryhighspacepenalty,sincediskspaceismuchcheaper writescausenoproblemswithping-pongcheckpointing,sincethepreviouscheckpointimageisstillavailable.ping-pongcheckpointingalsopermitssomephysical Keepingtwocopiesofamain-memorydatabaseondiskforping-pongcheckpoint- realitytheyarenot,andcomplexschemesareneededtodetectandrecoverfrom Forinstance,althoughmanyrecoveryschemesassumepagewritesareatomic,in andlogicalconsistencycheckstobeperformedonthecheckpointbeforedeclaring thanmain-memory.further,ping-pongcheckpointinghasseveralotherbenets. thatwereeitherdirtyintheckptdptofthelastcompletedcheckpoint,ordirtyin itsuccessfullycompleted. outthatweremodiedsincethecurrentcheckpointimagewaspreviouslywritten, thoseofthedptandthedptiszeroed(notingofendofstablelogandzeroingofdpt usingthischeckpoint.next,thecontentsofthe(in-memory)ckptdptaresetto aredoneatomicallywithrespecttoushing).thepageswrittenoutarethepages Thisisthestartpointforscanningthesystemlogwhenrecoveringfromacrash thecurrent(in-memory)ckptdpt,orinboth.inotherwords,allpagesarewritten stableloginthevariableendofstablelog,whichwillbestoredwiththecheckpoint. Beforewritinganydirtydatatodisk,thecheckpointnotesthecurrentendofthe toensurethatupdatesdescribedbylogrecordsprecedingthecurrentcheckpoint's namely,pagesthatweredirtiedsincethelast-but-onecheckpoint.thisisnecessary endofstableloghavemadeitinthedatabaseimageinthecurrentcheckpoint. interferingwithnormaloperations.thecheckpointimageisthusfuzzy.fuzzy checkpointinghowevercouldresultintwoproblemsforrecovery: Checkpointswriteoutdirtypageswithoutobtaininganylatchesandtherebyavoid Therstproblemissolvedbyourpolicyofalwayswritingphysicalredologrecords. Byapplyingphysicalredologrecords(whoseeectsareidempotent)toacheckpoint pageimagewecanensurethatwecanobtainapageimagethatdoesnotcontain thecheckpointpageimagemaycontainpartialupdatesofanoperation anypartialupdates. theundologrecordforanupdatemaynotbeinthestablesystemlog(which madeittothecheckpointimage,oneofthefollowingholds:1)correspondingphysicalundologrecordsarewrittenouttodiskafterthedatabaseimagehasbeen Thesecondproblemissolvedbyensuringthatforanyupdatewhoseeectshave checkpoint). couldresultinaproblemifthesystemweretocrashimmediatelyafterthe 6

8 writtenor2)allphysicalredologrecordsfortheoperation(correspondingtothe partialupdate)aswellasthelogicalundodescriptorintheoperationcommitlog recordareonstablestorage.thisisperformedbycheckpointingtheattand ushingthelogaftercheckpointingthedata.thecheckpointoftheattwritesout alllogrecordscorrespondingtotheoperation(containingthepartialupdate)aswell removedfromtheattbeforethecheckpointoftheatt,thelogushensuresthat undologrecords,aswellassomeotherstatusinformation.incasetheoperation containingthepartialupdatecompletesandconsequentlytheundologrecordsare astheoperationcommitlogrecordareonstablestorage.thecheckpointisdeclared partofthetransaction. bytraversingtheundologbackwardsfromtheend.transactionabortiscarried outbyexecuting,inreverseorder,everyundorecordjustasiftheexecutionwere dates/operationsdescribedbylogrecordsinthetransaction'sundologareundone Whenatransactionaborts,thatis,doesnotsuccessfullycompleteexecution,up- 2.6AbortProcessing completed(andconsistent)bytogglingcurckpttopointtothenewcheckpoint. whentheproxyoperationcommits,allitsundologrecordsaredeletedalongwith theproxyoperationservesapurposesimilartothatservedbycompensationlogrecords(clrs)inaries{duringrestartrecovery,whenitisencountered,thelogicacordsarecreatedforeachphysicalundorecordencounteredduringtheabort.sim- Followingthephilosophyofrepeatinghistory[MHL+92],newphysicalredologre- thelogicalundorecordfortheoperationthatwasundone.thecommitrecordfor formedbytheoperationaregeneratedasduringnormalprocessing.furthermore, operationisexecutedbasedontheundodescription.logrecordsforupdatesperilarly,foreachlogicalundorecordencountered,anew\compensation"or\proxy" RestartrecoverybeginsbyinitializingtheATTandtransactionundologstothe undolog,thuspreventingitfrombeingundoneagain. undologrecordfortheoperationthatwasundoneisdeletedfromthetransaction's 2.7Recovery beforethedatabaseimageischeckpointed.thisvalueofendofstablelogbecomes andsetsdpttozero.next,recoveryprocessesredologrecords.recallthataspartof the\beginrecoverypoint"forthecheckpointoncethecheckpointhascompleted. ATTandundologsstoredinthemostrecentcheckpoint,loadsthedatabaseimage Allupdatesdescribedbylogrecordsprecedingthispointareguaranteedtobe thecheckpointoperation,theendofthesystemlogondisk,endofstablelog,isnoted recordfortheoperationisnotfoundinthesystemlog.suchlogrecordsrepresent forthelastcompletedcheckpointofthedatabaseareapplied.restartrecovery reectedinthecheckpointeddatabaseimage. ignoresredologrecordsforupdatesperformedbyanoperationifthecommitlog Thus,duringrestartrecoveryonlyredologrecordsfollowingtheendofstablelog 7

9 uncommittedupdates,andmaynothavecorrespondingundorecordsinthecheckpointedatt.however,iftheundorecordsareabsent,theeectsofthelogrecords willnotbereectedinthecheckpointeddatabaseimage.suchrecordswouldbe dirtyforeachlogrecordandnecessaryactionsaretakentokeepthecheckpointed presentonlyduetoacrashwhilethelogrecordsforanoperationwerebeingushed. imageoftheattconsistentwiththelogasitisapplied.theseactionsontheatt mirrortheactionstakenduringnormalprocessing.forexample,whenanoperation commitlogrecordisencountered,lowerlevellogrecordsinthetransaction'sundo logfortheoperationarereplacedbyahigherlevelundodescription. Duringtheapplicationofredologrecords,appropriatepagesindptaresetto rolledbackisveryimportant,sothatanundoatlevelliseesdatastructuresthat rolledback.however,theorderinwhichoperationsofdierenttransactionsare areconsistent[lom92].first,alloperations(acrossalltransactions)atl0that back.todothis,allcompletedoperationsthathavebeeninvokeddirectlybythe transaction,orhavebeendirectlyinvokedbyanincompleteoperation,havetobe mustberolledbackarerolledback,followedbyalloperationsatlevell1,thenl2 andsoon. Oncealltheredologrecordshavebeenapplied,theactivetransactionsarerolled 3ConnectiontoRelatedWork operationcommitswhenundooperationscomplete(similartoclrsdescribedin Multi-levelrecoveryandvariantsthereof,primarilyfordisk-basedsystems,have [MHL+92]).Also,asin[Lom92],transactionrollbackatcrashrecoveryisperformed ourschemesrepeathistory,generatelogrecordsduringundoprocessingandlog impactthedistributedschemesare levelbylevel.someofthefeaturesofourmain-memoryrecoverytechniquewhich beenproposedintheliterature[whbm90,lom92,mhl+92].liketheseschemes, 2.Separateundologsaremaintainedinmemoryforactivetransactions.Aresult 3.Oursingle-siteschemedoesnotrequirelatchingofpagesduringupdates, 1.Duetotransientundologging,nophysicalundologsarewrittenouttothe setting.actionsthatarenormallytakenonpagelatching,suchassettingof isthattransactionrollbackdoesnotneedtoaccessthegloballog,partofwhich couldbeondisk. whichisinconvenientandexpensiveineitheramain-memorydboranoodb globallogexceptduringcheckpoints. recordswrittentothegloballog.(oneofourshared-diskschemesusespage doesnot.) dirtybitsforthepage,areecientlyperformedbasedonphysicalredolog latchingforensuringcacheconsistency,whiletheothershared-diskscheme 8

10 vironment,eachsitemaintainsaseparatelog,andpagesareshippedbetweensites. 4.Thecorrectnessrequirementsofthewrite-aheadloggingpolicyareaccomplishedwithasingleushfortheentiredatabaseduringacheckpoint,rather Ourshared-disklog-shippingschemedoesnotshippages,butinsteadbroadcastslog IntheARIES-SD[MN91]familyofschemesforrecoveryintheshareddisken- 5.Ourschemedoesnotperformin-placeupdateofthediskimageduringpage records,takingadvantageofcheapapplicationoftheselogrecordsinmain-memory, ush,insteadusingping-pongcheckpointing. than(potentially)oneushperpage. logicalundoandhigh-concurrencyindexoperations. andpermittingconcurrentupdatesatasmaller-than-pagegranularity.inourshared toprotecttheearlyreleaseoflocks,makingitunclearhowthatschemesupports diskschemes,logushesaredrivenbythereleaseofalockfromasite,inorderto recovery.the\superfast"methodofaries-sd[mn91]doesnotdescribeushes supportrepeatingofhistoryandcorrectrollbackofmulti-levelactionsduringcrash whichassumepage-levelconcurrencycontrolandtheno-stealpagewritepolicy {neitherofwhichareassumptionsmadeinourschemes. clients,whichisnotsupportedin[mn94]. checkpointingprocess.wealsosupportconcurrentupdatestoapagebydierent [MHL+92]canbeextendedtoaclient-serverenvironment.Incontrasttoour client-serverscheme,theirschemeinvolvestheclientsaswellastheserverinthe In[Rah91],theauthorsproposerecoveryschemesfortheshareddiskenvironment theclient-serverrecoveryschemefortheexodusstoragemanager(esm-cs)is discussed,butrecoveryconsiderationsarenotextensivelyaddressed.in[fzt+92], described.thisrecoveryscheme,basedonaries[mhl+92],requirespage-level In[MN94],theauthorsshowhowtheARIESrecoveryalgorithmdescribedin 4Client-ServerRecoveryScheme lockinguntilendoftransaction(forexample,thecommitdirtypagelist). In[CFZ94],object-levelaswellasadaptivelockingandreplicamanagementare Inthissection,wedescribetheclient-serverrecoveryscheme.Oursystemmodelis asfollows. Thereisasingleserverwithstablestorage,whichisresponsibleforcoordinatingallthelogging,andforperformingcheckpointsandrecovery(see Figure2).Theservermaintainsacopyoftheentiredatabaseinmemory. entiredatabaseinitsmemory. databaseattheclient. Atransactionexecutesatasingleclientandupdates/accessesthecopyofthe Multipleclientsmaybeconnectedtotheserver;eachclienthasacopyofthe 9

11 Database Database ATT ATT DPT System Log Tail System Log Tail ThenetworkisFIFOandreliable. Figure2:Client-ServerArchitecture In Main Memory SERVER In Main Memory Client nodes Network On Disk Stable System Log Database cur_ckpt ATT Checkpoints locksandagloballockmanager(glm)attheserverkeepstrackoflockscached Asaresultofupdatingthelocalcopyofthedatabase,databasepagesupdatedby Ckpt A System Log Tail Ckpt B theclientitself.however,requestsforlocksnotcachedlocallyareforwardedtothe atthevariousclients.transactionrequestsforlockscachedlocallyarehandledat aclientmaynotbecurrentatsomeotherclient.therefore,apageataclientisin dataduetoupdatesbyotherclientsandarerefreshedbyobtainingthelatestcopy oneoftwostates{validorinvalid.invalidpagescontainstaleversionsofcertain ofthepagefromtheserver. andreleasinglocks.eachclientsitehasalocallockmanager(llm)whichcaches Transactionsfollowthecallbacklockingscheme[LLOW91,CFZ94]whenobtaining Main Memory aconictingmode(beforegrantingthelockrequest).aclientrelinquishesalock GLMwhichcallsbackthelockfromotherclientsthatmayhavecachedthelockin inresponsetoacallbackassoonastransactionscurrentlyholdingthelock(ifany) system)whiletheclientsmaintaintheattforthetransactionsbelongingtothat releasethelock. client.thelogrecordsforupdatesgeneratedbyatransactionataclientsiteare storedinthatsite'satt.clientsitesdonotmaintainasystemlogondisk,but keepasystemlogtailinmemoryandappendlogrecordsfromthelocalredologsto thistailwhenoperationscommit/abort.checkpointingisperformedsolelyatthe server,andfollowsthesameprocedureasthecentralizedcase. TheservermaintainsthedptandtheATT(foralltransactionsintheclient-server theclientwaitsfortheservertoushthenewlyreceivedlogrecordstodiskbefore systemlogareshippedbytheclienttotheserver.inthecaseoftransactioncommit, Whenalockisrelinquishedfromasiteoratransactioncommits,logrecordsinthe 10

12 willnothavetoreadtheaectedpagesfromdisk. reportingthecommittotheuser.theshippedredologrecordsareusedtoupdate theserver'scopyoftheaectedpages,ensuringthatpagesshippedtoclientsfrom izationstothebasicideasdiscussedabove. recordsthemselvesissmallsince,inourmain-memorydatabasecontext,theserver theserverarecurrent(notethatpagesareshippedonlyfromtheservertoclients andneverviceversa).thisenablesourschemetosupportconcurrentupdatesto recordswillusuallybecheaperthanshippingpages,andthecostofapplyingthelog asinglepageatmultipleclientssincere-applyingtheupdatesattheservercauses 4.1BasicOperations themtobemerged(thisapproachisalsoadoptedin[cdf+94]).shippingthelog Wenowdescribethefeatureswhichdistinguishtheclient-serverschemefromthe pointsinprocessing. centralizedcase,intermsofactionsperformedattheclientandtheserveratspecic Wewillnowdescribeourschemeindetailandalsooutlineseveralpossibleoptim- PageAccess:Incaseaclientaccessesapagethatisvalid,itsimplygoes aheadwithoutcommunicatingwiththeserver.else,ifthepageisinvalid (certaindataonthepagemaybestale),thentheclientrefreshesthepage by1)obtainingthemostrecentversionofthepagefromtheserver,and2) applyingtothenewlyreceivedpageanylocalupdateswhichhavenotbeen senttotheserver(thisstepmergeslocalupdateswithupdatesfromother Topreventraceconditions,theclientdoesnotsendlogrecordstotheserver sites).theclientthenmarksthepageasvalid.theserverkeepstrackof clientsthathavethepageinavalidstate. Operation/TransactionCommit:Attheclient,redologrecordsare afteraskingforapageandbeforereceivingit. Anoptimizationoftheaboveistocheckforvalidityofpagesatthetimeof acquisitionofregionlocksfromtheserverratherthanoneveryaccess;forthis optimizationtobeused,thesetofpagescoveredbytheregionlockmustbe known. movedtothesystemlog,acommitrecordisappended,andappropriateactions LockRelease:Whenalockisrelinquishedbyaclient,allredologrecords areperformedonthetransaction'sundologintheattasdescribedforthe logareshippedtotheserver,andcommitprocessingwaitsuntiltheserver locally. centralizedcase.incaseofatransactioncommitthelogrecordsinthesystem Thelocallockmanageratthesitemayhowevercontinuetocachethelocks hasacknowledgedthatthelogrecordshavebeenushedtodisk. thatweregeneratedunderthislockneedtobeshippedtotheserver.the Finally,allthelocksacquiredbytheoperation/transactionarereleasedlocally. 11

13 otherclientthatobtainsthesamelockgetsacopyofthepageswhichcontains theupdatesdescribedbytheselogrecords.asimplewaytoensurethatall serverthenappliestheselogrecordstoitsdatabaseimagetoensurethatan- logrecordsgeneratedunderthelockareshippedtotheserveristoushthe systemlogfromtheclienttotheserver. Anoptimizationtoavoidushingthesystemlogeachtimeistostoretheend relatingtotheoperation(includingoperationcommit)precedethepointinthe systemlogstoredwiththelock.thislocationinthelogisclient-site-specic. inthelogstoredwiththelock.similarly,foranoperationlock,alllogrecords oranoperationlockisreleasedbyatransaction.thus,foranyregionlock, theserverduetocall-back,itshipstotheserveratleasttheportionofthe allredologrecordsinthesystemlogaectingthatregionprecedethepoint BeforeaclientsiterelinquishesanXmoderegionlockoroperationlockto oftheclientsystemlogwiththelock(attheclient)whenaxmoderegionlock thatthenextlockwillnotbeacquiredontheregionuntiltheserver'scopy systemlogwhichprecedesthelogpointerstoredwiththelock.thisensures LogRecordProcessing:Attheserver,foreachphysicalredologrecord releasedthelocks.thus,iftheserverabortsatransactionafterasitefailure, isuptodate,andthehistoryoftheupdateisinplaceintheserver'slogs. theabortofthisoperationwilltakeplaceatthelogicallevelofthelocksstill heldforitattheserver. ForXmoderegionlocks,thisushensuresrepeatingofhistoryonregions, undodescriptorsintheoperationcommitlogrecordsfortheoperationwhich (receivedfromaclient),theundologrecordisgeneratedbyreadingthecurrent whileforoperationlocksthisushensuresthattheserverreceivesthelogical fromthecommitlogrecordandappendedtotheundologforthetransaction Inaddition,foroperationcommit,thelogicalundodescriptorisextracted bytheredologrecordisapplied,followingwhichthelogrecordisappended sameactionsasinthecentralizedcasewhenthelogrecordsweregenerated. undologforthistransactionintheserver'satt.nexttheupdatedescribed commitlogrecordsreceivedfromtheclientareprocessedbyperformingthe contentsofthepageattheserver.thenewlogrecordisthenappendedtothe Byapplyingallthephysicalupdatesdescribedinthephysicallogrecords intheserver'satt.fortransactioncommit,theclientwhosetransaction committedisnotiedafterthelogushtodisksucceeds. totheredologforthetransactionintheserver'satt.operation/transaction TransactionAbort/SiteFailures:Ifaclientsitedecidestoabortatransaction,itprocessestheabort(asinthecentralizedcase)usingtheundologs toitspages,theserverensuresthatitalwayscontainsthelatestupdateson oftheloggingscheme,asfarasdataupdatesareconcerned,isjustasifthe clienttransactionactuallyranattheserversite. regionsforlockswhichhavebeenreleasedtoitfromtheclients.theeect 12

14 PageInvalidation willaborttransactionsthatwereactiveattheclientusingundologsforthe forthetransactionintheclient'satt.iftheclientsiteitselffails,theserver theserver).iftheserverfails,thenthecompletesystemisbroughtdown,and ingwiththeserver,incaseofpartition,adecisiontoabortisenforceableby transactioninit'satt(sincetheclientcannotcommitwithoutcommunicat- restartrecoveryisperformedattheserverasdescribedinsection2.7. on-update,andinvalidate-on-lock,forensuringthatdataaccessedbyaclient Wecompleteourclient-serverschemebypresentingtwomethods,invalidate- fromthesite.sincetheserverwouldhaveappliedthelogrecordstoitscopy isup-to-date. Allactionsdescribedsofarareusedincommonbybothmethods.Inparticular,bothmethodsfollowtherulethatalllogrecordspertainingtoupdates theclient. Bothmethodsmarkpagesattheclientsasinvalid,todenotethatsomeofthe ofthedata,thisensuresthatwhentheservergrantsalock,ithasthecurrent involvedintheregionforwhichthelockwasobtainedarenotup-to-dateat clientacquiresalock,itisstillpossiblethatthecopyofoneormorepages dataonthepageisoutofdate.evenifapageismarkedinvalid,someof versionofallpagescontainingdatacoveredbythatlock.however,whena regionlockonthedata.therstmethod,invalidate-on-update,isaneager thedatainthepagemaystillbeup-to-date,forinstance,iftheclienthasa methodthatmarkspagesasinvalidatclientsassoonasanupdateoccurs madeunderalockareushedtotheserverbeforethelockisrelinquished Theinvalidate-on-updateschemeworksasfollows.Whentheserverreceiveslog 4.2Invalidate-On-Update markingpagesasinvalidatclientswhentheclientgetsalock.thesecond schemereducesinvalidationmessagesbykeepingextraper-lockinformation attheserver.detailsofthetwomethodsarepresentedinsections4.2and 4.3respectively. attheserver,whilethesecond,invalidate-on-lock,isamorelazymethod, invalidatemessagestoclients(otherthantheclientthatupdatedthepage)thatmay recordsfromaclient,itdoesthefollowing.foreachpagethatitupdates,itsends havethepagemarkedasvalid.forallclientsotherthantheclientthatupdatedthe page,theservernotesthattheclientdoesnothavethepagemarkedvalid.clients, onreceivingtheinvalidatemessage,marktheirpageasinvalid.thusinvalidation messagesarereceivedbyclientsbeforetheycanacquirearegionlockontheupdated data,andbeginaccessingthedata. twodierentregionlocks.lets1bethesitethatushesitsupdatestotheserver Forexample,considertwositess1ands2updatingthesamepageconcurrentlyunder Althoughthemethodisverysimpleandeasytoimplement,ithassomedrawbacks. 13

15 rst;theupdatewillcausetheservertosendaninvalidatemessagetos2,whichwill underthelockthatitalreadyhas,thentheinvalidatewasnotnecessary,sincethe 4.3Invalidate-On-Lock thenre-readthepagefromtheserver.however,ifsites2accessesthepageagain Theinvalidate-on-lockschemedecreasesunnecessaryinvalidationsandtheoverhead ofsendinginvalidationmessagesbymarkingpagesasinvalidonlywhenalockon thenextsectiontakesadvantageofthisobservationtoreduceoverheads. aregioncoveringthepageisobtainedbyaclient.asaresult,iftwoclientsare updatingdierentregionsonthesamepage,asintheearlierexample,noinvalidationmessagesaresenttoeitherclient.bypiggy-backinginvalidationmessages separateinvalidationmessagesinthepreviousschemeiseliminated. dataintheregionithaslockedhasnotchanged.theinvalidate-on-lockschemein forupdatedpagesonlockgrantmessagesfromtheserver,theoverheadofsending obtainingthisinformationistorequirethatanupdatecallmustspecifynotonly associatedwiththelockfortheupdatedregion.thus,theschemerequiresthatit needtocheckforvalidityofapageoneveryaccessorupdatetothepage itsuces bepossibletodeterminetheregionlockfromtheredorecord.asimplewayof formationaboutupdatestothatregion.specically,whenupdatesdescribedby tocheckforvalidityatlockacquisitiontime. aphysicalredorecordareappliedtopagesattheserver,theupdatedpagesare Toachievetheabove,theschememustassociatewiththelockforaregionin- Thebiggestbenetoftheinvalidate-on-lockscheme,however,isthatthereisno aprogrammertoprovidethisinformation,sinceallupdatesmustbemadeholding aregionlock.thelocknamecanthenbesentwiththeredologrecord. thedatatobeupdated,butalsotheregionlockthatprotectsthedata.itiseasyfor witheachlogrecord,whichreectsboththeorderinwhichtherecordwasapplied totheserver'scopyofthepageandtheorderinwhichitwasaddedtothesystem theclient(valid/invalid),alongwiththelsnforthepagewhenitwaslastshipped eachclient,theservermaintainsinaclientpagetable(cpt),thestateofthepageat log.foreachpage,theserverstoresthelsnofthemostrecentlogrecordthat totheclient. updatedthepage,andtheidentityoftheclientwhichissuedit.inaddition,for ThisschemealsorequiresthattheserverassociateaLogSequenceNumber(LSN), toupdatestotheregion.foreachpageinthelist,theserverstoresthelsnofthe mostrecentlogrecordreceivedbytheserverthatrecordedanupdatetothepart oftheregiononthispage,andtheclientwhichperformedtheupdate.thus,when aclientisgrantedaregionlock,if,forapageinthelocklist,thelsnisgreater thanthelsnforthepagewhenitwaslastshippedtotheclient,thentheclient pagecontainsstaledatafortheregionandmustbeinvalidated. Theserveralsomaintainsforeachregionlockalistofpagesthataredirtydue 14

16 apageasinvalidonlyifthereisanupdateperformedundertheregionlockrequested bytheclient,andtheupdatehasnotyetbeenpropagatedtotheclient. TheLSNinformationservestominimizetheshippingofpagestoclients,marking Theadditionalactionsforthisschemeareasfollows: Logapply:WhentheserverappliestoapageParedologrecord,LR, Lockgrant:Asetofinvalidatemessagesispassedbacktotheclientwiththe Phasbeenupdated).First,theLSNforPissettotheLSNforLR.Second, theentryforpinthelistofdirtypagesforlisupdated(orcreated),setting theclienttoc,andthelsntothelsnforlr. generatedatclientcunderregionlockl,ittakesthefollowingactions(after Pagerefresh:Whentheserversendsapagetoaclient(pagerefresh),at thelockwasnotthelasttoupdatethepageunderthislock.theinvalidated theserver,thepageismarkedvalidinthecptfortheclientandthelsnfor pagesaremarkedinvalidinthecptfortheclientandattheclientsite. smallerthanthelsnofthepageinthelocklist,and3)theclientacquiring withthelockbeingacquiredthatmeetthreecriteria:1)thepageiscachedat theclientinthevalidstate,2)thelsnofthepageinthecptfortheclientis lockacquisition.theinvalidatemessagesareforpagesinthelistassociated Locklistcleanup:Weareinterestedinkeepingthelistofpageswithevery thepageinthecptisupdatedtobethelsnforthepageattheserver. lockassmallaspossible.thiscanbeachievedbyperiodicallydeletingpages theclientnotedinthelistofpagesforlasthelastclienttoupdatep: PfromthelistoflockLsuchthatthefollowingconditionholds,whereCis LSNisgreaterthantheLSNforthepageinthelocklist,thentheclienthas needtobepartofanyinvalidationlistsenttotheclient. Therationaleforthisruleisthatthepurposeofregionlocklistsistodetermine pagesthatmustbeinvalidated.however,ifforapageinaclient'scpt,the themostrecentupdatetotheregiononthepage,andthusthepagewillnot EveryclientotherthanChasthepagecachedeitherinaninvalid thelistforlockl. stateorwithlsngreaterthanorequaltothelsnforthepagein 5SharedDiskRecovery:ModelandCommonStructures Intheshareddiskapproach,anumberofmachinesareinterconnectedandalsohave nothingarchitecture,suchasfasteraccesstonon-localdisksandfault-tolerance. manysystems,suchasthedecvaxclusters,andprovidesbenetsoverashared directaccesstodisksoverafastnetwork.theshareddiskenvironmentisusedin 15

17 ourintendedapplications. concurrencycontrol.thisallows,forexample,read-onlytransactionswithafully levelrecovery,ourmainconcernisminimizingfalsesharingthroughne-grained preventingonesystemfrombecomingabottleneckinthesystem.asinourclientserverscheme,inadditiontocarefulconsiderationoftheinteractionwithmulti- isthatthealgorithmsaresymmetricwithrespecttowhichsiteexecutesthem, Also,thebasicadvantageofshareddiskschemesovertheclient-serverschemes cachedworkingsettoproceedatmain-memoryspeeds,animportantpropertyfor Wenowdescribeourshareddiskrecoverymodel. Sitescachelocks,andrelinquishlocksbasedonthecallbacklockingmechanismdescribedinSection4.WeassumethenetworkisFIFOandreliable. managercouldbedistributedforspeedandreliability,butthisisorthogonal toourdiscussion. systemlogondisk.thustherearebemultiplelogsinthesystem. SitesobtainlocksfromaGlobalLockManager(GLM);thefunctionofthelock Eachsitemaintainsitsowncopyoftheentiredatabaseinmemoryanditsown Eachsitehasitsownsystemlogondiskandthereforethelogsaredistributed.Torepeathistoryduringrestartrecovery,weneedsomemechanismto temporallyorderlogrecordsthataectthesameregion.toenablethis,each fromthiscounterisstoredineachphysicalredologrecordforanupdate.we sitemaintainsaglobaltimestampcountertsctr,andatimestampobtained Eachsitemaintainsitsownversionofthedirtypagetabledpt,systemlog(in Asinglepairofcheckpointedimagesismaintainedondiskforthedatabase. memoryandondisk),andanatt(withseparateundoandredologrecords executeatthatsite. foreachtransaction)whichstoresinformationrelatingtotransactionsthat Acheckpointimageconsistsofanimageofthedatabase,thedirtypagetable willseethedetailsofhowthistsctrismaintainedandusedlater. controlandrecovery.therstisapage-shippingapproachwhichissimilarin ckptdpt,andforeverysite: schemewhichallowsconcurrentuseofnon-overlappingregionsonapageacross spirittotheinvalidate-on-updateclient-servermode.thesecondisalogshipping sites. Inthenexttwosections,wepresenttwoschemesforshareddiskconcurrency 2.acopyoftheATTatthesite(containingundologs). 1.endofstablelog{thepointinthesite'ssystemlogfromwhichthesystem logmustbescannedduringrecovery. 16

18 Site 1 DB Site 2 DB Site N DPT1 ATT PTT DPT2 1 1 ATT2 PTT Memory Sys Log Tail Memory Sys Log Tail serverschemeinthatatransactionatasiteupdatingaregiononapageisguaran- teedtohavethelatestcopyofthepage.therefore,concurrentupdatestodierent Figure3:Page-ShippingSharedDiskArchitecture Ourpage-shippingschemeissimilarinspirittotheInvalidate-on-Updateclient- 6Page-ShippingSharedDiskRecoveryScheme Logs Stable Sys Log Stable Sys Log Shared N End of Stable log 6.1DataStructures 2 cur_ckpt ATT (undo logs) Checkpoints Ckpt A Wenowdescribedatastructuresspecictothepage-shippingscheme.Common regionsofapagearenotpossibleinthisscheme. Ckpt B Database ckpt_ptt ckpt_dpt ofthedatabaseisstoredacheckpointpagetimestamptable,referredtoasckptptt. ensuringthatatransactionalwayshasthelatestcopyofthepagewhileaccessing lockingmechanismdescribedearlier.alongwitheachofthetwocheckpointimages orupdatingthepage.sitescachelocks,andrelinquishlocksbasedonthecallback thepagewaslastupdated.eachpagehasanassociatedpagelockwhichhelpsin datastructuresweredescribedinsection5.anoverviewofthedatastructuresfor thisschemeisgiveninfigure3. eachsiteinthepagetimestamptable,pttwhichkeepstrackofthetsctrvaluewhen InadditiontotheTSctrforthesite,atimestampforeachpageismaintainedat 6.2NormalProcessing performedinthecentralizedcase,tosupportdistributedconcurrencycontroland Wedescribebelowtheactionstakenduringnormalprocessing,inadditiontothose recovery.checkpointingandrecoveryfromsystemandsitefailurearedescribedin subsequentsubsections. 17

19 Update:Likeinthecentralizedcase,beforeaccessingaregion,eachtransactionobtainsaregionlockfromtheLLM.Additionalpagelocksareacquired ins(x)modewhileaccessing(updating)dataonapage.ifthislockisnot cachedatthesite,actionsareperformedasdescribedbelowunderlockacquisition. Pagelocksforanaccessarereleasedbyatransactiononcetheaccessis intheredologrecordwhentheupdateisperformed,butthelogrecordis totheupdate.also,thetimestampfortheupdatedpage(intheptt)atthe redologrecordwasgeneratedisstoredintheredologrecordcorresponding Animportantpointtonoteisthatlogrecordsinthesystemlogmaynotbe siteissettothetsctrstoredinthelogrecord. orderedontheirtsctrvalues.thisisbecausethevalueoftsctrisstored completed;pagelocksforanupdatearereleasedbyatransactiononlyafter appendedtothetransactionlocallog,whichisnotushedtothesystemredo theupdateonthepageiscompleted.thevalueoftsctratthesitewhenthe LockRelease:WhenatransactionreleasesanXmoderegionlockoroperationlock,itstorestheendofloginmemorywiththelock(thisisstoredto whichheldtheregionlockwillbemovedtothegloballogbythenormalop- regionoroperationlocksisdonetoensurethatitispossibletorepeathistory erationcommitsemanticspriortothereleaseofthislock.thus,foraregion lock,allredologrecordsforupdatestotheregioncoveredbythelockprecede duringrestartrecovery,andappropriatelocksforundoingoperationsareheld theendoflogpointstoredwiththelock(similarforoperations).whenasite siteuntiltheendoflogpointstoredwiththelock.theushonreleaseofx relinquishesanxregionlockoroperationlock,itushesthegloballogatits loguntiloperationortransactioncommit. optimizetheamountofushingthatneedstobedonewhenalockisrelinquishedasintheclient-serverscheme).notethatallupdatesfortheoperation LockAcquisition: storeswitheachpagelockthesitethatlastheldthepagelockinxmode;the byothersitesthatlateracquirethelock,aswewillseeshortly.theglmalso incaseofsitecrashes.notethatnoushesareperformedwhenpagelocks arereleased. Ifitisapagelock,thenthepageisalreadycurrentatthissite. AtransactionacquiringalockcachedbytheLLMneedtakenospecialaction. informationisupdatedeachtimeasiterelinquishesanxmodepagelock, Additionally,whenasitereleasesanXpageorXregionlockbacktothe GLM,itstampsitwiththesite'sTSctr;theTSctrvalueofthelockisused WhenanX-modepageorregionlockarrivesfromtheGLM,itincludesthe timestampfromthelastsitethatheldthelockinxmode,asdescribedabove. UponreceivinganXregionlockorpagelockatasite,thesite'sTSctrisset 18

20 tothemaximumof1)it'scurrentvalue,and2)thetsctrvalueassociated (thatis,thelockisnotalreadycachedatthesite),thesiterequeststhepage fromthelastsitethatheldthepagelockinxmode(usingthesiteidentier withtheincominglockplusone. WhenasiteacquiresapagelockonbehalfofatransactionfromtheGLM datestoapageatdierentsitesareassignedincreasingtimestampvalues.shipping timestampswithregionlocksensurethatlogrecordsgeneratedunderconicting locksareappliedinthecorrectorderduringrecoveryeventhoughredologrecords Shippingtimestampswithpagelocksensuresthatlogrecordsforsuccessiveup- sentwiththelock).inordertohandlesingle-siterecovery,failureofthe intheindividualsitemaynotbeorderedbytimestamp(asmentionedearlier). acquiringsitetoobtainacopyofthepage,duetoafailureofthesitefrom However,thealgorithmstillworkscorrectly,asshowninthediscussionofrecovery whichitisbeingrequested,causesthelockacquisitiontofailandthelockto andcorrectnessbelow. bereturnedtotheglmunchanged. environmentrequirescoordinationamongthevarioussites.asmentionedabove,a 6.3Checkpointing Unlikethecentralizedandclient-serverscheme,checkpointingintheshareddisk singlepairofcheckpointedimagesismaintainedforallthesites. ATTand3)ushingthegloballog.Below,wedescribeeachstep: followingthreestepsateachsite{1)writingthedatabaseleimage2)writingthe Thesiteinitiatingthecheckpointcoordinatestheoperation,whichconsistsofthe 1.Thecoordinatorannouncesthebeginningofthecheckpoint,atwhichtime thenmakeacopyoftheirdptsandzerotheirdpts.notethatzeroingthedpt alongwiththeendofstablelog(notedabove),andacopyofthedpt.the Eachsitethenmakesacopyofitscurrentpttandsendsittothecoordinator coordinatorconstructsckptdptbyor'ingtogetherthecopyofitsdptandall andrecordingendofstablelogisdoneatomicallywithrespecttoushes. thedptsreceivedfromothersites(recallthatweareassumingthedptisa bitmap).thedatabasepagestobewrittenoutduringthecheckpointarethe allsites(includingthecoordinator)notetheircurrentendofstablelogvalues, Foreachpagetobewrittenout,thecoordinatorusesthepttssenttoitby highesttimestampforthepage.thissiteisresponsibleforwritingthepageto thecheckpointimage.oncethecoordinatorhaspartitionedthesetofpages theothersitesanditsownptttodeterminethesitewhosepttcontainsthe pagesthataredirtyinckptdptorintheckptdptinthepreviouscheckpoint. tobewrittenoutamongthevarioussites,eachsiteissentthesetofpage write,proceedstowritethosepagestothecheckpointimage.sincenotwo identiersassignedtoit.asite,uponreceivingitsassignedsetofpagesto siteswillbeassignedthesamepage,sitecanwritepagesconcurrently. 19

21 previouscheckpointintomemory.foreverypagethatwasdeterminedtobe writtenout(bysomesitei),thetimestampforthepageinckptpttissetto Thecoordinatorthenconstructsckptpttbyrstreadingtheckptpttinthe itstimestampinthecopyofthepttforsitei.finally,ckptdptconstructed earlier,ckptpttandtheendofstablelogsforallthesitesarewrittentothe 2.Onceeverysitehaswrittenoutthedatabaseimageandreportedthistothe Notethatsincethesitewiththehighesttimestampforapagewritesthe checkpoint. thatmultiplesitescanbeconcurrentlywritingouttheatt. coordinator,thecoordinatorinstructseachsitetowriteoutitsatt.note more,aswillbediscussedinthecorrectnesssectionbelow,updatesforapage recordedinlogrecordswithtimestampslessthanthetimestampforthepage inckptpttarealsocontainedinthecheckpoint. endofstablelogrecordedforasite,arecontainedinthecheckpoint.further- pagetothecheckpointimage,updatestothepagebylogrecordspreceding Incasetheentiresystemfails,restartrecoveryisperformedbyanyonesite,sayj. 6.4Recovery 3.AfterwritingouttheATT,eachsiteushesthegloballogatthatsiteasin thecentralizedcase.finally,thedatabasecheckpointiscommittedafterall andforeachsiteiaseparatedpt,dptiisinitializedtocontainzerobitsforallpages. theattandtheendofstablelog.aseparatepagetablepttisinitializedtockptptt Thesitej,whichwewillcalltheactingcoordinatorsite,readsthefollowingfromthe mostrecentcheckpointimage:thedatabaseimage,theckptptt,andforeachsite, siteshavecompletedtheirushing. Startingfromtheendofstablelogpointstoredforasiteinthecheckpoint,thelog recordsinallthesystemlogsaremergedasdescribedbelow,andappliedtothe inthedptforthesitewhosesystemlogcontainstherecord,and3)thetimestamp database.tomergethesystemlogs,theyarescannedinparallel;ateachpoint, ifthenextlogrecordinanyofthesystemlogsisnotaredologrecord,thenany thelogrecord. forthepageinpttissettothemaximumofitscurrentvalueandthetimestampin onesuchrecordisprocessedandtheattforitssiteismodiedasdescribedfor thecentralizedcaseinsection2.7.ontheotherhand,ifthenextrecordsinallthe asmentionedearlier.however,thisdoesnotcauseaproblemandconictinglog systemlogsareredologrecords,thenthelogrecordoutputnextistheoneamongst recordsareappliedintheorderinwhichtheyweregenerated.thereasonforthis themwiththelowesttimestampvalue.if,forapageupdatedbythelogrecord,the isthatfortwoconictinglogrecordsinseparatesystemlogs,theearlierlogrecord timestampinthelogrecordisgreaterthanorequaltothetimestampforthepage inckptptt,then1)theupdateisappliedtothepage,2)thepageismarkeddirty Notethatredorecordsinthesystemlogforasitemaynotbeintimestamporder 20

22 andlogrecordsprecedingitinitssystemloghavelowertimestampsthanthelog recordgeneratedlater.thisfactisrevisitedbelowinouroverviewofcorrectness. fromtheredologareappendedtothesystemlogforsitei. jissettothelargesttimestampcontainedinthepttatsitej.sitejthenrollsback recordgeneratedwhenprocessinganoperationforsiteiisassignedatimestamp in-progressoperationsintheattsforthevarioussitesbeginningwithlevell0and thenconsideringsuccessivelevelsl1;l2andsoon(asdescribedinsection2.7). performedontheundoandredologsfortheentry.furthermore,eachredolog WhenanoperationinanATTentryforasiteiisbeingprocessed,actionsare equaltotsctratsitej,andwhenanoperationpre-commits/aborts,logrecords Oncethelastlogrecordhasbeenprocessed,TSctrattheactingcoordinatorsite sitesaredeletedfromsitej,bringingrecoverytocompletion. forthesiteduringrecoveryatsitej,andthedatabaseimageandpttateachsiteis afterincrementingitbyone.thedptateachsiteisthensettothedptmaintained setequaltothedatabaseimageandpttatsitej.finallyckptpttanddptforother forthesite(maintainedatsitej)tobemarkeddirty.afterthispoint,theother sitesareinvolvedinrecovery.thetsctrateverysiteissettothetsctratsitej Next,sitejusheseverysite'ssystemlogscausingappropriatepagesinthedpt Inthissection,wepresentadditionalargumentsaboutthecorrectnessofourpage- 6.5OverviewofCorrectness rectnessisbased. shippingrecoveryschemebydiscussingbelowseveralpropertiesonwhichthecor- 2.Anylogrecordaectingpageipriortoendofstablelogatanysitehas 1.Apage,i,inacheckpointimagereectsallupdateswithtimestamplessthan 3.IfL1andL2areconictinglogrecordsandL1isgeneratedbeforeL2,thenif 4.IfL1andL2areconictinglogrecordsindierentsystemlogsandL1is timestamplessthanorequaltockptptt[i]andisreectedinthecheckpoint imageofpagei. ckptptt[i]. generatedbeforel2,thenl1andalllogrecordsprecedingitinitssystemlog L2isushedtothestablelog,thensoisL1. ppt[i]atthesiteisgreaterthanorequaltothetimestampinthelogrecord,(b) thepageisinthedptofthesiteand(c)thepageatthesitecontainstheupdate areupdated,andpassingtimestampswithpagelocksguaranteesthatsuccessiveupdatestoapagehavenon-decreasingtimestamps(andinturn,assignnon-decreasing timestampstothepttentry). (1)followsfromthefactthattimestampsforpagesinthepttaresetonlyafterthey (2)Foralogrecordthatupdatespageipriortoendofstablelogatasite,(a) havelowertimestampsthanl2. 21