An effective recovery under fuzzy checkpointing in main memory databases

Transcription

1 Information and Software Technology 42 (2000) An effective recovery under fuzzy checkpointing in main memory databases S.K. Woo, M.H. Kim*, Y.J. Lee Department of Computer Science, Korea Advanced Institute of Science and Technology, 373-1, Kusung-Dong, Yusung-Gu, Taejon , South Korea Received 13 October 1998; received in revised form 28 June 1999; accepted 30 June 1999 Abstract In main memory databases, fuzzy checkpointing gives less transaction overhead due to its asynchronous backup feature. However, till now, fuzzy checkpointing has considered only physical logging schemes. The size of the physical log records is very large, and hence it incurs space and recovery processing overhead. In this paper, we propose a recovery method based on a hybrid logging scheme, which permits logical logging under fuzzy checkpointing. The proposed method significantly reduces the size of log data, and hence offers faster recovery processing than other conventional fuzzy checkpointing with physical logging Elsevier Science B.V. All rights reserved. Keywords: Database recovery; Main memory databases; Fuzzy checkpointing; Hybrid logging 1. Introduction In main memory databases (MMDB), since the primary copy of the data resides in the main memory, MMDB provides much better performance than disk-resident databases (DRDB). Due to the significant decrease of memory cost with the fast increase in memory capacity, the importance of MMDB has been increasingly recognized [1]. However, due to the volatility of main memory, the updated data in MMDB must be flushed to backup databases on disks in order to maintain the consistency of the database against system failures. The recovery-related works, e.g. checkpointing and logging, involve disk I/Os, so they have to provide for less overhead in transaction processing as well as fast recovery at the restart. A lot of recovery methods for MMDB have been proposed in the literature [2 10]. So far, recovery methods based on fuzzy checkpointing, introduced in Ref. [4], have shown efficient performance because of the asynchronous backup feature of fuzzy checkpointing [5]. By the asynchronous feature, partially update pages may be flushed to disks, so fuzzy checkpointing employs a physical logging scheme. This is because under fuzzy checkpointing, the last consistent database state is difficult to reestablish from the last * Corresponding author. Tel.: ; fax: addresses: skwoo@cs.kaist.ac.kr (S.K. Woo), mhkim@cs.kaist. ac.kr (M.H. Kim), yjlee@cs.kaist.ac.kr (Y.J. Lee). complete checkpoint without physical logging. In general, the size of the physical log records is very large, which results in space overhead and long recovery time. Ref. [5] indicates that fuzzy checkpointing incurs longer recovery time than other consistent checkpointing methods with logical logging because of the large size of physical log records. There has been some studies in the past that attempted to reduce the size of log data. Refs. [2] and [4] present a log compression method, where the redo parts of log records for aborted transactions and the undo parts of log records for committed transactions are not maintained. In Refs. [7] and [8], redo log records are flushed to disks and undo log records are discarded, when a transaction is committed. However, most of those methods are not based on fuzzy checkpointing. Ref. [5] employs a shadow updating policy to record only redo log data. They, however, also indicate that fuzzy checkpointing with physical logging incurs significant overhead, even though only redo log records are maintained through shadow updating. In DRDB, there have been some works on logical logging under fuzzy checkpointing. Ref. [11] describes a penultimate fuzzy checkpointing method with logical logging. Ref. [12] introduces a recovery method, called ARIES that is based on a fuzzy checkpointing. ARIES supports logical logging, which is, however, restricted to objects with increment or decrement kinds of operations, e.g. garbage collection and changes to the amount of free space. Note that two fuzzy checkpointing methods mentioned above are for DRDB, which is not directly /00/$ - see front matter 2000 Elsevier Science B.V. All rights reserved. PII: S (99)

2 186 S.K. Woo et al. / Information and Software Technology 42 (2000) Begin of Commit_Work Commit_Work End of Commit_Work Start (1) (2) (3) (4) (5) Transaction Execution Copy log to log buffer Propagate to MMDB Release Shadow area Release locks Ensure WAL Fig. 1. Commit processing model. applicable to MMDB. Since MMDB has data in main memory permanently without buffering activities, the idea of penultimate checkpointing in Ref. [11] and the scheme of un-flushing dirty pages in Ref. [12] cannot be applied to MMDB. In this paper, we propose a recovery method based on a hybrid logging scheme which permits logical logging under a recovery method based on fuzzy checkpointing. Since logical logging can replace a large physical log record by a single record of smaller size, it reduces the size of log data for recovery. In the end, the reduced log data make recovery processing fast. Even though we accommodate logical logging, we still keep the asynchronous backup feature of fuzzy checkpointing. The rest of the paper is organized as follows. In Section 2, we propose our recovery policies including a hybrid logging scheme, revised fuzzy checkpointing, etc. We describe recovery processing and some considerations for our recovery method in Section 3 and Section 4, respectively. In Section 5, we analyze the performance of the proposed method. Finally, Section 6 gives concluding remarks. 2. Proposed recovery method 2.1. Basic policies The database area on main memory consists of MMDB area, log buffer, and shadow area. We assume that an entire database can be stored in the MMDB area. The log buffer is composed of several log pages where the log data of transactions are recorded. A log page is flushed into a log disk when it is full. We employ a shadow updating policy. The updated data of a transaction are temporarily stored in the shadow area. Then, these data are propagated to the MMDB area appropriately during the commit work of the transaction. Shadow updating provides some advantages for MMDB: reduced log space, reduced MMDB access, faster reload processing, and reduced UNDO time [13]. It also prevents the partial undo of a transaction and generates only redo log records. Furthermore, performance studies in Refs. [10] and [14] have shown that shadow updating in MMDB provides better performance on transaction processing and post-crash log processing. We use a ping pong policy [15] as a backup method. During checkpointing, only portions of the database that have been updated are written out to the backup database according to ping pong policy. So, checkpointings alternate between two backup databases. This backup policy increases the number of pages to be flushed, but avoids the violation of write-ahead logging (WAL) under fuzzy checkpointing. The violation in MMDB may occur when fuzzy checkpointing is used carelessly [5]. Under shadow updating, a transaction writes its log records to a log page and reflects its updates on MMDB. Then, the transaction waits until the log page is flushed into the log disk. At this time, if checkpointing is in progress, a partially updated page in MMDB can be flushed to a backup database before the corresponding log records are flushed. This is the violation of WAL. When a failure occurs in this situation, the flushed page cannot be recovered, because there is no corresponding log records in the log disk. Here, the ping-pong policy can be used to avoid the violation of WAL. This is because the ping-pong policy maintains two backup databases, and hence the previous copy of a page being flushed always exists Hybrid Logging Scheme Under fuzzy checkpointing [4] in MMDB, the checkpointer flushes dirty pages without considering transaction activities. It may flush partially updated pages. Thus, physical logging is inevitable during checkpointing because only physical log records can reestablish a previous consistent database state without worrying about the current activities on the data [11]. However, during the time between two checkpointings, physical logging may not be necessary. In other words, logical logging can be used during the time when there is no flushing, i.e. the time between two consecutive checkpointings. To support logical logging under fuzzy checkpointing, we need a mechanism that can reestablish the consistent database state where the logical log records were created. That is, the reestablished database state should be either a transaction-consistent or an action-consistent state. Otherwise, we cannot apply logical log records to the recovered checkpoint because the logical log records can only be effective when the database has a consistent state. In our scheme, we establish a transaction-consistent state from a fuzzy checkpoint by applying physical log records generated during the

3 S.K. Woo et al. / Information and Software Technology 42 (2000) time T 2 T 1 exec logging redo point exec logging corresponding checkpointing, so it is possible to use logical logging during the interval between two fuzzy checkpointings. This is the basic idea of accommodating logical logging under fuzzy checkpointing. We refer to the approach as the Hybrid Logging Scheme. Hybrid Logging Scheme: Write physical log records during checkpointing, and use logical log records otherwise Commit processing update BC update Fig. 2. Example of redo point problem. When shadow updating policy is employed, the data updated by a transaction are first written to the shadow area, not to the MMDB area in place. The updated data are propagated to the MMDB area at the commit time. Fig. 1 shows the commit processing model in our recovery method, which is similar to the pre-commit scheme in Ref. [2]. Before reaching the beginning of its commit_work, a transaction either aborts or completes its normal operations. All the updated data are temporarily stored in the shadow area, and logical log records are stored in a temporary area called private log space. After the completion of all the normal operations, a transaction begins the commit processing that consists of five steps described as follows. [Step 1] At the beginning of its commit_work, either the physical log records or logical log records of the transaction are copied to the current log page. When checkpointing is ongoing, physical log records are copied. Otherwise, logical log records are copied. Note that the physical log records can be obtained from the shadow area, and logical log records can be obtained in the private log space. The current log pages are locked and then unlocked at the start and end of this step, respectively. [Step 2] The updated data in the shadow area are propagated to the MMDB. [Steps 3 and 4] The pages used in the shadow area and all the acquired locks on the data items are released. This is a strict two phase locking (S2PL), which guarantees the EC crash serialization execution of transactions in committed order at restart [11]. [Step 5] If the log page containing the log records of the transaction has not been yet flushed, the transaction waits until the log page is flushed; the end of this step is the completion of the commit_work. During normal transaction execution, logical log records are produced for the operations of the transaction. These logical log records are stored in the private log space of the transaction. The private log space reduces contentions to the log buffer and enables only commit log records to be written to the log buffer. So, it makes the log applying step at restart simple and fast. At Step 1 of the commit processing, we first lock the current log page in the log buffer and then determine the type of log. If checkpointing is active, the physical log records are written to the log page; otherwise, the logical log records in the private log space are used. To indicate whether checkpointing is active, we use a global checkpointing flag, chkpt_flag. The flag is maintained by the checkpointer; it is set when the checkpointer writes the beginning record, BC, of checkpointing to the log buffer and reset when the checkpointer writes the ending record, EC. As the lock of the current log page is the synchronization point between transactions and the checkpointer, the checkpointer cannot start (or finish) its work while a transaction is in the middle of Step 1; that is, the checkpointer cannot write its BC (or EC) record to the log buffer, because a transaction holds a lock on the current log page for recording its log data. Moreover, under S2PL, after transaction T has read or written an object x, no other transaction can access x in a conflicting mode until after T has committed or aborted [11]. This guarantees only one transaction can update an object. Thus, overlapping of physical and logical logging does not occur Checkpointing When a new fuzzy checkpoint begins, the checkpointer writes a BC record to the current log page, and then flushes dirty data pages to the backup databases without considering locks and other transaction activities. When finishing the backup work, the checkpointer writes an EC record to the current log page, and flushes the log pages to log disks. After the log page with the EC record is written to disk, the checkpointer finally records the position of the BC record at the well-known location on disk. This is a normal fuzzy checkpointing process [5]. There is one consideration related with the redo point for recovery in using the hybrid logging scheme. The redo point is the first log record to be applied to the reloaded database for recovery. As we do not consider the quiescence of transactions, some transactions may be updating MMDB (i.e. propagating pages in shadow area to MMDB) at the BC, and hence the partially updated pages can be flushed to the backup databases. Then, the redo point for database

4 188 S.K. Woo et al. / Information and Software Technology 42 (2000) time T 2 T 1 exec logging exec logging T 4 redo point BC update T 3 update exec logging T backup exec logging update Fig. 3. The concept of delayed backup. update EC recovery is in the oldest log page among the log pages of transactions that is updating MMDB at the BC. That is, the redo point can be placed in front of the BC. The problem related to the redo point is the type of log record of the transaction that updates pages at the BC; the type is of logical log in our scheme. This is because in the hybrid logging scheme the transaction obeys logical logging during the non-checkpointing interval. As an example, consider the situation in Fig. 2. Since the log records of transactions T 1 and T 2 are recorded before BC, the log records are logical log. If the partially updated pages of T 1 and T 2 are flushed to the backup databases during checkpointing, we may not be able to recover those pages. The reason is that the logical log records cannot be applied to the partially updated pages. Our solution for this problem is to delay backup processing, until transactions T 1 and T 2 finish MMDB updating. By delaying the backup start time, the checkpointer can avoid the backup of partially updated pages whose corresponding logs are logical. Fig. 3 illustrates the concept of the delayed backup strategy with four transactions T 1, T 2, T 3, and T 4. The checkpointer does not begin the backup right after recording the BC, but delays the backup until all the transactions updating MMDB at the BC finish the updates. In the figure, T backup is the starting point of the backup work. Some partially updated pages by T 3 and T 4 may be written to a backup database. However, consistent states of those pages can be reestablished because the type of log record of T 3 and T 4 is physical. Since the current log page in the log buffer is the synchronization point, the BC cannot be recorded to the log buffer during the logging of a transaction, i.e. until the end of logging by T 1 in Fig. 3. We can easily implement the delayed backup strategy by using an array variable num_updating_tr[]; num_updating_tr[k] counts the number of transactions that record their last log data (i.e. the commit of transaction) to log page k, but have not finished updating MMDB yet. The size of the array is the number of pages in the log buffer. A transaction first acquires the lock for log page k, and stores its log records to that log page. Then, it increases Fig. 4. Checkpointing and commit work algorithm.

5 S.K. Woo et al. / Information and Software Technology 42 (2000) Fig. 5. Consecutive checkpointing and commit work algorithm. num_updating_tr[k] by one before releasing the lock on the log page. After the transaction finishes updating MMDB, num_updating_tr[k] is decreased by one. When num_updating_tr[k] is zero, it means that all the transactions whose log records were written to log page k finish their MMDB updating steps. By using this variable, the checkpointer can determine the backup start time. That is, if BC is stored in log page i, the backup processing begins when num_updating_tr[k] is zero for all k i. Fig. 4 shows the proposed fuzzy checkpointing procedure with the delayed backup strategy. After writing BC to the current log page i, the checkpointer waits while tail i.the variable tail points to the oldest log page with one or more MMDB updating transactions, i.e. those transactions that copied their logical log records in private log spaces to the log buffer, but have not finished propagating shadow pages to the MMDB. Suppose BC is written to log page i and tail points to log page k where k i. A transaction that recorded their log data to log page k decreases num_updating_tr[k] by one after finishing MMDB updating. If num_updating_tr[k] becomes zero, tail is set to the next log page whose num_updating_tr[ ] is not zero. The checkpointer begins backup processing if tail i. Note that since all the updates at MMDB are processed in the main memory without disk I/ Os, the delay time is very small. The analysis on this matter will be described in Section 5.2. By using the delayed backup strategy, the BC of the last complete checkpoint becomes the redo point for database recovery. That is, the delayed backup strategy guarantees that a transaction, which wrote its log records to the log buffer before BC, completes its MMDB updates before backup processing begins. Thus, only log records that are written to the log buffer after the BC need to be applied for recovery. Note that we do not have to search for the redo point because the position of BC is recorded to a wellknown location on disk. This effect is a fundamental basis on the log applying rule described in Section 3. Under shadow updating, only transactions in the commit_work can update MMDB, which prevents the partial undo of a transaction and generates only redo log records. In the proposed method, each transaction writes all of its log records to the log buffer at its logging time (Step 1 in Fig. 1). Therefore, by using the delayed backup strategy together with shadow updating and the private log space, a transaction-consistent database state can be effectively reestablished after only log records generated during checkpointing are applied to the last complete fuzzy checkpoint Extension to consecutive checkpointing There has been some research on the consecutive fuzzy checkpointing [5,10], where the EC of a checkpoint becomes the BC of the next checkpoint. In this way, the checkpointer is always active. The proposed hybrid logging scheme can also be applied to the consecutive checkpointing by partitioning MMDB to several segments and checkpointing the segments in the round-robin fashion. A segment consists of one or more pages. Every database object (relation, index, etc) is stored in a segment. The hybrid logging scheme in the segmented MMDB is that when the checkpointer is flushing dirty pages in segment i, we use physical logging for objects in segment i and logical logging for objects in the other segments. The hybrid logging scheme in the segmented MMDB is stated as follows. Hybrid Logging Scheme for Segmented MMDB: Write physical log records for objects in the segment that is under checkpointing, and use logical log records for objects in the other segments. The ping-pong backup policy is used in the segmented MMDB as well. The checkpointer flushes dirty pages in each segment to one of two backup databases in a roundrobin fashion. In order to adjust the redo point of each

6 190 S.K. Woo et al. / Information and Software Technology 42 (2000) Checkpoint segment, we also apply the delayed backup strategy to checkpointing of each segment. Before flushing dirty pages in a segment, the checkpointer delays the backup timing point. A consecutive checkpointing procedure on the segmented MMDB is given in Fig. 5. The BC segid indicates the start of the checkpoint of segment segid as well as the end of the checkpoint of the previous segment. Whenever the checkpointer begins the backup of segment segid,it records each position of the BC segid at the well-known location on disk, as in the case of the non-segmented MMDB. 3. Recovery processing Checkpoint 2 Crash Fig. 6. Consecutive checkpointing in MMDB with four segments. Since the non-segmented MMDB can be considered as a special case of the segmented MMDB with one segment, we only describe the recovery procedure for the segmented MMDB. The recovery processing consists of two phases: reloading the backup database and applying the log. In the reloading phase the last complete backuped database is restored in main memory, and in the log applying phase log data are applied to the reloaded database. As described in the previous section, the redo point of segment i is BC i when the delayed backup strategy for each segment is used. To reestablish a consistent database state, we have to first determine the last complete checkpoint. Consider Fig. 6 that shows consecutive checkpointing in MMDB with four segments. In the non-segmented MMDB, Checkpoint 1 is the last complete checkpoint. However, the backuped database after checkpointing segment 1 in Checkpoint 2 already includes the updated database images for all the log records about segment 1 generated during Checkpoint 1 ; the delayed backup strategy guarantees this. The same argument can also be applied to segment 2 and 3. Therefore, the last complete checkpoint for recovery is Checkpoint 2. This approach reduces the amount of log data required for recovery. Note that a similar approach has been proposed in Ref. [16], which, however, uses stable memory as the log buffer. Based on this approach, the size of log data to be read from disks can be further reduced in the proposed hybrid logging scheme. When the backup database is reloaded into memory, the log records generated after the beginning of the last complete checkpoint, i.e. BC 4 in the example of Fig. 6, are applied to the reloaded database. In this time, we do not have to apply all the physical and logical log records to the reloaded backup database. This is because backup processing of segment i begins only after all transactions that write their log records before BC i finish updates according to the delayed backup strategy. This means that pages backuped during checkpointing of segment i have all the after-images for the objects in segment i whose corresponding log records are generated before BC i. Thus, we do not have to consider log records stored before BC i in recovery for segment i. For example, consider Fig. 7 that shows log records in the last complete checkpoint of Fig. 6. Here, L j i (or Pj i ) denotes logical (or physical) log records for the objects in segment i, generated during checkpointing segment j. The checkpointing of segment 4 begins with BC 4, and the log records generated during that checkpoint consists of physical log records for the objects in segment 4 (i.e. P 4 4) and logical log records for objects in the other segments (i.e. L 4 1; L 4 2 and L 4 3). Now, consider the checkpointed image of segment 1 after checkpointing segment 1, i.e. just before BC 2.It reflects all the updates represented by L 4 1: Therefore, L 4 1 need not be applied to the reloaded backup database at all. Likewise, L 4 2 and L 1 2 need not be applied because the checkpointed image of segment 2 already reflects all the updates represented by L 4 2 and L 1 2. In other words, we can recover a segment by applying only logical log records for the objects in the previously recovered segments and physical log records of its segment to the reloaded backup database. Those log records are denoted by circle in Fig. 7. This log applying method reduces the number of log records for recovery processing. Following is the description of the log applying policy. Log Applying Rule: Consider the recovery of the segmented MMDB with N segments. We first have to establish a consistent database state from the last complete checkpoint. Suppose the last complete checkpoint was made based on segments S i1 ; S i2 ; ; S 1N by this order, where i i ; i 2 ; ; i N is an arrangement of 1; 2; ; N in the round-robin order. We scan the log from BC Si1 : While we scan the log records generated during checkpointing segment S ik k ˆ 1; ; N ; we apply physical log records BC 4 BC BC 1 2 BC 3 L 4 L 4 L P 4 4 P 1 1 L 1 2 L 1 3 L 1 4 L 2 1 P 2 2 L 2 3 L 2 4 L 3 1 L 3 2 P 3 3 L 3 4 Fig. 7. Contents of log for the case in Fig. 6.

7 S.K. Woo et al. / Information and Software Technology 42 (2000) for the objects in, i.e. log records denoted by P ik ik and apply only logical log records for the objects in segments where q is i 1 ; ; i k 1 : After establishing the consistent database state from the last complete checkpoint, apply to MMDB all log records of the remaining committed transactions. S i1 ; ; S ik 1 ; i.e. log records denoted by L ik q Note that in the segmented MMDB under conventional fuzzy checkpointings that permit only physical logging scheme, the main idea of the above log applying rule can also be used if the delayed backup strategy is provided. In that case, logical log records L p are changed to physical log records P p in the above description. 4. Discussion 4.1. Fuzzy checkpoint state To recovery the database, the last complete checkpoint first needs to be reestablished by applying log records from the redo point to the ending record of the checkpoint. In the case of hybrid logging scheme, the reestablished database state should be either a transaction-consistent or an actionconsistent state. Otherwise we cannot apply logical log records to the recovered checkpoint because the logical log records can only be effective when the database has a consistent state [11]. In our method, we achieve a transaction-consistent database state from the last complete checkpoint, through fuzzy checkpointing, by using shadow updating and a private logging for each transaction appropriately. This is because these policies enable the log records of a transaction to be copied to the log buffer by a unit of transaction at commit time Consistency of logical logging According to the hybrid logging scheme, a way of logging is converted from logical logging to physical logging by checkpointing, i.e. from an object-level log record to a page-level log record. To overcome the gap, we should ensure that the redo result of logical log records is page-action consistent. In other words, a redo operation of a logical log record must have the same result as when it executed in normal processing. There is little research related to page-action consistency of logical logging [7,17]. Ref. [7] proposes an abstract data-type modeling of a logical operation. Ref. [17] presents a physiological logging scheme with another form of logical log record including page number information. Our policy below is based on Ref. [7]. First, we consider locking granularity. To maintain pageaction consistency, we should prevent more than one transaction from concurrently updating the same page. That is, executions have to be strict on the level of pages. S2PL can guarantee the serializable execution on a page in the order of committed transactions on log if we use page-level locking granularity. In the case of an operation that needs new page allocation, it must be considerately taken care of. The corresponding redo logical log record must contain not only the operation requesting an allocation, but also the allocated page number. During the execution step, we get information of the page to which an operation applies and allocate a new page if it is needed. The page number of the new allocated page is stored to the log record. In the MMDB updating step at commit_work, the updated data in the shadow area for the new page are reflected to MMDB. When executing the operation at restart, the allocation module gets the page number and allocates the page as the allocated page Shadow area size The limited size of the shadow area may cause performance degradation. However, the requirements of normal transactions in general are confined to a small subset of the entire database entities [18]. Ref. [19], a study based on actual reference strings, indicates that the minimum cache size for the DB cache is about 100 cache pages each of 200KB. Moreover, since the shadow area contains only portions of updated pages with smaller shadow granules, the amount of the shadow area would be much less than this as described in Ref. [13]. Some approaches proposed in Ref. [13] can also be used to minimize the size of the shadow area Correctness In this section, we describe that our recovery method can restore the database to a consistent state which guarantees serialization of committed transactions. Since we make use of S2PL with page-level granularity under shadow updating, no page may be read or overwritten until the transaction that previously wrote into it terminates, either by aborting or by committing [11]. Thus, all the transaction s log data are recorded in the execution order on log. This is because all the committed transactions release the locked pages after the steps of logging and MMDB updating in our transaction processing model. By using shadow updating and the private log space, we can record only redo log data of committed transactions; this avoids undo of transactions at restart and makes the applying step simple. A transaction can commit only when the log page containing its log records is flushed to the log disk. Furthermore, the delayed backup strategy guarantees that the log records written before BC do not have to be applied, because the backuped database generated by the backup processing includes all the images of the log records as described in the previous section. This also implies that the beginning record of the last complete checkpoint is the redo point. Thus, the proposed recovery method restores the database to a consistent state guaranteeing the serialization order, by applying

8 192 S.K. Woo et al. / Information and Software Technology 42 (2000) Table 1 Ratio of size of log data Case Logging Log applying rule ChkptLogSize ApplyLogSize 1 Physical X Physical O N 3 Hybrid O L P 1 N L NP the last complete checkpoint log records of committed transactions from the redo point. 5. Performance analysis In this section, we analyze the performance of our proposed method on the segmented MMDB with consecutive checkpointing. The metrics are the size of log data generated in a complete checkpoint (ChkptLogSize), the size of log data to be applied for recovering the last complete checkpoint (ApplyLogSize), and recovery time. ChkptLogSize has a direct effect on the recovery time. ApplyLogSize is for measuring the effect of the log applying rule Reduced ratio of log data Without hotspot MMDB is partitioned into N segments. We assume that all the segments are uniformly accesses by transactions; that is, log records generated during the checkpoint of a segment consist of S log records for each segment. Therefore, SN log records are generated during a segment checkpoint. Let P be the size of a physical log record and L be the size of a logical log record. When all the segments are uniformly accessed, ChkptLogSize in physical logging only is the sum of the log size of one segment multiplied by P; that is, ChkptLogSize physical ˆ SN PN ˆ SN 2 P: L 2P 1 N L 2NP If the hybrid logging scheme is used, the log records generated in a checkpoint consist of physical log records for objects in the checkpointed segment and logical log records for objects in the other segments. The size of log data generated during a segment checkpoint is SP S N 1 L: Thus, ChkptLogSize in the hybrid logging is ChkptLogSize hybrid ˆ SP S N 1 L N ˆ SN P LN L ˆ SN 2 L SNP SNL: Next, we analyze ApplyLogSize. When only a physical logging scheme is used and our log applying rule is not considered, ApplyLogSize in the segmented MMDB has the same complexity as ChkptLogSize physical, i.e. SN 2 P, When the proposed log applying scheme is used, ApplyLog- Size in physical logging is ApplyLogSize physical ˆ SP SP SP.. SP SP N 1 ˆ SP N N N 1 ˆ SP 2 1 ˆ 1 2 SN2 P 1 SNP: 2 2 When our proposed hybrid logging scheme is considered, only logical log records are applied to the previous segments by using the log applying rule. Thus, ApplyLogSize in the hybrid logging is ApplyLogSize hybrid ˆ SP SP SL Ratio of Log Size ApplyLogSize of case 2 ChkptLogSize of case 3 ApplyLogSize of case Number of Segments Fig. 8. Ratio of reduced log size. SP 2SL.. SP N 1 SL ˆ SPN SL N 1 N 1 N ˆ SPN SL 2 ˆ 1 2 SN2 L SPN 1 SNL: 3 2

9 S.K. Woo et al. / Information and Software Technology 42 (2000) Ratio of Log Size % 30% 40% 50% Number of Segments Fig. 9. Effects of reduced log under hotspots. Let H be the number of log records generated during a complete checkpoint of whole segments. First, suppose the database is partitioned into two segments, hotspot and non-hotspot. Then, the generation rate of dirty pages in the hotspot segment may be equal to the access rate of the hotspot segment, (1 f H ). Thus, the number of log records generated during the checkpoint of the hotspot segment is (1 f H ) H. When a uniform distribution over accessed positions is considered, (1 f H ) portions of (1 f H ) H log records are related to objects in the hotspot segment. Thus, (1 f H )(1 f H ) H log records are physical log, and the remaining f H (1 f H )H log records are logical log. A similar computation is also applied to the non-hotspot segment. Then, ChkptLogSize with the two segments is To evaluate the effect of the hybrid logging and the log applying scheme, Table 1 shows the ratio of the log size in hybrid logging to that in only physical logging; that is, Eqs. (1) (3), are divided by the log size in only physical logging, ChkptLogSize physical, respectively. According to Ref. [5], we assume that L is 64 words and P is 192 words. The ratio of the size of log data with varying N is shown in Fig. 8. The size of log data is inversely proportional to the number of segments. This is because the portions of physical log records in a segment checkpoint are reduced, as the number of segments is increased. With a small number of segments, the size of log data generated during a checkpoint can be reduced to about half, compared with that of only physical logging. For ApplyLogSize, we can reduce the size of log data to be applied for recovery to more than half by using both the hybrid logging and the log applying scheme With hotspot In this section, we consider hotspots of the database and measure only ChkptLogSize. We assume that portions of all database pages have (1 f H ) of accesses, e.g rule. Table 2 Parameters and their defaults Symbol Meaning Default S db Database size 512M words S 1pg Log page size 1024 words S page Page size 8K words S rec Record size 32 words S op Logical log entry size 32 words S init Log header size 32 words M Concurrent transactions 100 T rate Transaction arrival rate 1000 TPS T seek Average seek time s T latency Average rotation time s T transfer Average transfer time s/page N bdisks Number of backup disks 20 N act Actions per transaction 5 P abort Abort probability 0.05 f H Fraction of hotspot 0.2 f hotact Fraction of actions to hotspot 1 f H R spa Pages per action 1.1 ChkptLogSize 2seg ˆ 1 f H 1 f H HP f H 1 f H HL f H f H HP f H 1 f H HL ˆ 1 f H H 1 f H P f H LŠ f H H f H P 1 f H LŠ: We expand the above idea to N segments: Nf H hotspot segments and (N Nf H non-hotspot segments. Let be N H Nf H. We assume every hotspot (or non-hotspot) segment has the same access rate. Thus, the number of log records generated during the checkpoint of a hotspot segment is 1 f H H N H : Among the above number of log records, only 1 f H =N H portions are related to objects in the segment that is under checkpointing, and the type of these log records is of physical log. The type of log records for the remaining hotspot segments and all non-hotspot segments is of logical log. Thus, the size of log data generated during the checkpoint of a hotspot segment is 1 f H H 1 f H P 1 f H N N H N H 1 L f H L : 4 H Similarly, the size of log data for a non-hotspot segment can be given by f H H N N H f H N N H P N H f H N N H N N H 1 L 1 f H L Since there are N H hotspot segments and N N H nonhotspot segments, The size of log data generated during the last complete checkpoint under the segmented MMDB, ChkptLogSize Nseg, is the sum of Eqs. (4) and (5) multiplied by N H and N N H, respectively. The size of log data in only physical logging is HP. Thus, the ratio of the log size with the hybrid logging to that of : 5

10 194 S.K. Woo et al. / Information and Software Technology 42 (2000) Recovery Time (sec) only physical logging is 1 f 1 f H H 1 f H N N H N H 1 L H P f L H P f f H H f H N N N N H N N H 1 L H P 1 f H L : P 6 If we consider equal access rate for each segment, f H ˆ 1=2; the result of Eq. (6) is 1 N L P L NP ; which is the same as ChkptLogSize of case 3 in Table 1. Fig. 9 shows the ratio of ChkptLogSize Nseg with varying N for some hotspot rates, with L of 64 words and P of 192 words. As we have more segments, the portion for physical logging is smaller. Thus, the whole size of log data is reduced. In this result, we know that the size of log data is minimum at 50% hot spot rate. A 50% hot spot rate means all the segments are accessed uniformly by transactions. So, this result indicates that for MMDB segmentation, the access rate of segments is a more important partition factor than the size of the segments Recovery time Physical Hybrid - 20 Hybrid - 50 Logical Number of Segments Fig. 10. Recovery times of three logging schemes Parameters The parameters for analyzing the recovery time consist of the size of the database, the size of log-related structures, disk I/O time, the ratio of abort, etc. Table 2 shows these parameters and their default values. They are derived from Refs. [5] and [10] Delay time In our recovery method, checkpointing time increases by the delay time of the delayed backup strategy. Before measuring the recovery time, we analyze the delay time. After writing BC record in the current log page, the checkpointer waits until all transactions in the updating step of the commit_work finish their MMDB updating, rather than waiting until commit. So, disk I/O is not involved in the delay, and the delay time is proportional to the amount of updating work and the number of transactions. Then, the maximum delay time can be no more than the sum of updating times in serial executions of the maximum number of concurrent transactions. Given the parameters in Table 2, the number of words updated by these transactions is S rec- N act M. If reading or writing of a word is processed with one CPU instruction, the maximum delay time equals the time of S rec N act M instructions. Thus, under 100 MIPS CPU processing power, the maximum delay time can be about s. Furthermore, no disk I/O occurs during the delay time. Compared with the inter-checkpoint interval, the amount of delay time is so small that the time has little influence on the inter-checkpoint interval. Therefore, we do not consider the delay time in checkpoint interval for the recovery time analysis Analysis The recovery time consists of the backup database reloading time, log pages reading time, and log applying time. For simplicity, the recovery time of the last complete checkpoint is considered. The time to read the backup database, T back,is T back ˆ Sdb S page T seek T latency T transfer : The size of log data for recovering the last complete checkpointing, S log,is S log ˆ 1 P abort T rate t icp D redo where t icp is an inter-checkpoint interval and D redo is the size of redo log data per transaction. If only physical logging is used, D redo ˆ S init S rec N act : When only logical logging is used, we simply assume D redo ˆ S op S init : The D redo of the hybrid logging is calculated according to the result of Fig. 9. t icp is a period between the beginnings of checkpoints and is determined by the number of dirty pages and the I/O capability. According to Ref. [10], the expected number of dirty pages generated during time t, N dirty t ; is " N dirty t ˆ 1 1 R! fhotact N act T ratet # spa f N H N page page " 1 1 R! 1 fhotact N act T ratet # spa 1 f N H N page page where N page is S db =S page :Since we use ping-pong backup policy, the number of dirty pages to be flushed during time t, N flush (t), is N dirty (2t icp ). According to Ref. [5], the number of pages that can be written out to the disks during time t, N io (t), is given by N io t ˆN bdisks t T seek T latency T transfer :

11 S.K. Woo et al. / Information and Software Technology 42 (2000) Recovery Time (sec) Physical Hybrid - 20 Hybrid - 50 Logical Transaction Arrival Rate (trans./sec) Recovery Time (sec) Physical Hybrid - 20 Hybrid - 50 Logical Pages per Action Fig. 11. Recovery time at various T rate and R spa. By setting N flush t ˆN io t ; we find the minimum t icp. In general, since disk reading and CPU processing can be overlapped, and since disk I/O time is much larger than CPU processing time, the log page reading time may be regarded as total log processing time. Moreover, due to the locality of the log, that is, the sequential file, the actual average seek time may be only 25 33% of the given number (T seek ) [20]. Thus, the time to read log, T readlog,is T readlog ˆ Slog S lpg 0:3T seek T latency T transfer : Fig. 10 shows the result of analyzing the recovery time, i.e. T back T reading : The recovery time of logical logging (Logical) is the ideal case because we cannot use only logical logging with fuzzy checkpointing in MMDB. Our hybrid logging scheme (Hybrid) presents better performance than physical logging (Physical). As having more segments makes the portion of physical log data smaller, the recovery time of the hybrid logging would converge to that of logical logging. With 20 segments, the recovery time gap between physical logging and the hybrid logging is about 20 s at 20% hot spot rate. This is not a small time in high transaction Log Processing Time (sec) Physical Hybrid - 20 Hybrid - 50 Logical Database Size (G words) Fig. 12. Log processing time at various database sizes. 7 processing. At the 1000 TPS rate, transactions can be processed during the gap. Fig. 11 shows the results of recovery time with varying transaction arrival rates (T rate ) and the number of pages per action (R spa ) when the number of segments is 20. As the arrival rate of transactions and the number of pages per action increase, the more dirty pages are produced, which make the number of corresponding log records be increased. In the conventional fuzzy checkpointing, since these log records are physical log, the increased amount of log records affects recovery time significantly due to a large amount of physical log records. On the other hand, in our proposed hybrid logging scheme, since the portion of physical logging is considerably reduced, the influence on the increased amount of log records is much less than that of conventional fuzzy checkpointing. Finally, we have analyzed log processing time with varying sizes of databases. Fig. 12 shows the result. Since the time to read the backup database is fixed with a given database size, we compare recovery time with respect to log processing time. As in the previous results, our proposed scheme gives better performance than conventional fuzzy checkpointing based on physical logging. Here, even though the database size becomes larger, the number of pages to be flushed would not increase in proportion to the increased rate of database size. This is because the parameters about update rates, e.g. T rate and R spa, we fixed. However, a little increase of the number of flushed pages gives a great influence on determining t icp in Eq. (7). The related data are Table 3 Values of t icp and N dirty (1) for varying database sizes Database size (G words) t icp (s) N dirty (1)

12 196 S.K. Woo et al. / Information and Software Technology 42 (2000) described in Table 3, which shows the changes of t icp and the number of pages to be flushed per second with varying sizes of databases. Due to the increase of checkpointing interval, the recovery time increases in proportion to the increased rate of the database size. 6. Concluding remarks Fuzzy checkpointing is an efficient way for MMDB backup due to its asynchronous flushing feature. Most previous works on fuzzy checkpointing in MMDB have used physical logging. Although physical logging is relatively simple to be apply, it however incurs space and recovery time overhead. In this paper, we have focused on the reduction of the size of log data and have proposed a recovery method based on the hybrid logging scheme. The hybrid logging scheme accommodates logical logging under fuzzy checkpointing, when applicable, and thus significantly reduces the size of log data. We have also presented an efficient log applying rule in the segmented MMDB, which results in efficient recovery processing by reducing the number of log records to be applied. We have performed analyses for evaluation of our proposed method. The result shows that in our method the size of log data is reduced to more than half, compared with that in only physical logging. We have shown that the size of log data is inversely proportional to the number of segments. We have also shown that the recovery time based on the proposed method can be close to the case where only logical logging is used. The result of the log applying rule shows that we can reduce the number of log records to be applied for recovery to more than half of the number of log records generated in the segmented MMDB. Thus, the hybrid logging scheme along with the log applying rule makes ordinary log processing as well as database recovery quite efficient. References [1] H. Garcia-Molina, K. Salem, Main memory database systems: an overview, IEEE Transactions on Knowledge and Data Engineering 4 (6) (1992) [2] D.J. Dewitt, R.H. Katz, F. Olken, L.D. Shapiro, M.R. Stonebraker, D. Wood, Implementation techniques for main memory databases systems, in: Proceedings of ACM SIGMOD International Conference on Management of Data, 1984, pp [3] M.H. Eich, A Classification and comparison of main memory database recovery techniques, in: Proceedings of International Conference on Data Engineering, IEEE, 1987, pp [4] R.B. Hagmann, A crash recovery scheme for a memory-resident database system, IEEE Transactions on Computers C-35 (9) (1986) [5] K. Salem, H. Garcia-Molina, Checkpointing memory-resident databases, in: Proceedings of International Conference on Data Engineering, 1989, pp [6] L. Gruenwald, M.H. Eich, MMDB reload algorithms, in: Proceedings of ACM SIGMOD International Conference on Management of Data, ACM, 1991, pp [7] H.V. Jagadish, A. Silberschatz, S. Sudarshan, Recovering form mainmemory lapses, in: Proceedings of the 19th International Conference on Very Large Data Bases, 1993, pp [8] T.J. Lehman, M.J. Carey, A recovery algorithm for a high-performance memory-resident databases system, in: Proceedings of ACM SIGMOD International Conference on Management of Data, 1987, pp [9] E. Levy, A. Silberschatz, Incremental recovery in main memory databases systems, IEEE Transactions on Knowledge and Data Engineering 4 (6) (1992) [10] X. Li, M.H. Eich, Post-crash log processing for fuzzy checkpointing main memory databases, in: Proceedings of International Conference on Data Engineering, IEEE, 1993, pp [11] P.A. Bernstein, V. Hadzilacos, N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley, Reading, MA, [12] C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, P. Schwarz, ARIES: a transactions recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging, ACM Transactions on Database Systems 17 (1) (1992) [13] M.H. Eich. Main memory database research directions. Technical Report TR 88-CSE-35, Southern Methodist University, [14] V. Kumar, A. Burger, Performance measurement of main memory database memory algorithms based on update-in-place and shadow approaches, IEEE Transactions on Knowledge and Data Engineering 4 (6) (1992) [15] K. Salem, H. Garcia-Molina, Crash recovery for memory-resident databases, in: Technical Report CS-TR , Department of Computer Science, Princeton University, November [16] J.-L. Lin, M.H. Dunham, Segmented fuzzy checkpointing for main memory databases, in: Proceedings of ACM Symposium on Applied Computing, February 1996, pp [17] J. Gray, A. Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann, Los Altos, CA, [18] V. Kumar, Recovery in main memory database systems, in: Proceedings of Database and Expert Systems Applications, 1996, pp [19] K. Elhardt, R. Bayer, database cache for high performance and fast restart in database systems, ACM Transactions on Database Systems 9 (4) (1984) [20] J.L. Hennessy, D.A. Patterson, Computer Architecture: a Quantitative Approach, 2, Morgan Kaufmann, Los Altos, CA, 1996.