Tashkent: Uniting Durability with Transaction Ordering for High-Performance Scalable Database Replication

Size: px
Start display at page:

Download "Tashkent: Uniting Durability with Transaction Ordering for High-Performance Scalable Database Replication"

Transcription

1 EuroSys Tashkent: Uniting Durability with Transaction Ordering for High-Performance Scalable Database Replication Sameh Elnikety Steven Dropsho Fernando Pedone School of Computer and Communication Sciences EPFL Switzerland School of Computer and Communication Sciences EPFL Switzerland Faculty of Informatics Università della Svizzera Italiana USI Switzerland Abstract In stand-alone databases, the functions of ordering the transaction commits and making the effects of transactions durable are performed in one single action, namely the writing of the commit record to disk. For efficiency many of these writes are grouped into a single disk operation. In replicated databases in which all replicas agree on the commit order of update transactions, these two functions are typically separated. Specifically, the replication middleware determines the global commit order, while the database replicas make the transactions durable. The contribution of this paper is to demonstrate that this separation causes a significant scalability bottleneck. It forces some of the commit records to be written to disk serially, where in a standalone system they could have been grouped together in a single disk write. Two solutions are possible: (1) move durability from the database to the replication middleware, or (2) keep durability in the database and pass the global commit order from the replication middleware to the database. We implement these two solutions. Tashkent-MW is a pure middleware solution that combines durability and ordering in the middleware, and treats an unmodified database as a black box. In Tashkent-API, we modify the database API so that the middleware can specify the commit order to the database, thus, combining ordering and durability inside the database. We compare both Tashkent systems to an otherwise identical replicated system, called Base, in which ordering and durability remain separated. Under high update transaction loads both Tashkent systems greatly outperform Base in throughput and response time. Categories and Subject Descriptors H.2.4 Systems distributed databases, concurrency. General Terms Measurement, Performance, Design, Reliability. Keywords Database replication, Generalized snapshot isolation. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. EuroSys'06, April 18-21, 2006, Leuven, Belgium. Copyright 2006 ACM /06/ $ Introduction Database replication is a cost-effective technique to improve performance and to enhance availability. We are concerned here with fully replicated designs in which any transaction, read-only or update, may execute on any (single) replica and all replicas agree on the commit order of the updates. The primary contribution of this paper is to reveal a dependency between durability and commit ordering of update transactions in such designs. In particular, we show that committing transactions in a global order and relying on an off-the-shelf database to make transactions durable (i.e., write their effects to disk) requires commit records to be written to disk serially, a significant scalability bottleneck. The root cause is the separation in replicated systems of commit ordering from the disk writes that ensure durability. In standalone databases, these two functions are performed together, permitting group-commits to minimize the number of synchronous writes to disk. In replicated databases, however, ordering is determined in the replication middleware rather than at the database to ensure a consistent global order. Durability has typically been left in the database. We propose two approaches to unite these functions. The first approach is to move durability to the middleware layer where ordering is determined. This solution can be pure middleware, so that it can be used with off-the-shelf databases. The second approach is to extend the database API so that the replication middleware can specify a commit order for update transactions, thus combining durability and ordering in the database. We implement instances of both approaches, called Tashkent- MW and Tashkent-API, respectively, and compare them to an otherwise identical replication system, called Base, in which ordering and durability remain separated. All systems use generalized snapshot isolation (GSI) [3] for concurrency control, and are derived from PostgreSQL [19]. Both Tashkent systems greatly improve scalability under high update transaction loads. For instance, at 15 replicas under an update-intensive workload the Tashkent systems outperform Base by factors of five and three in throughput, while also providing lower response times. This paper makes the following contributions: 1. We identify the dependency between durability and ordering and demonstrate the performance impact if they are not united. 2. We design and implement Tashkent-MW, a pure middleware solution that unites durability and ordering in middleware.

2 118 EuroSys We extend the database API and implement Tashkent-API which unites durability and ordering in the database. 4. We assess both systems for their benefits and costs. The rest of the paper is structured as follows. Section 2 gives the necessary background on generalized snapshot isolation. In Section 3 we detail the issue of separating durability from ordering and justify uniting the two. In Section 4 we present the design of the Base replication system. In Section 5 we present Tashkent-MW and Tashkent-API. We discuss the implementation of the replication middleware in Section 6. In Section 7 we discuss fault-tolerance of the systems. We discuss the replication middleware interface to PostgreSQL in Section 8. In Section 9 we experimentally compare the three systems and analyze their relative performance. Section 10 contrasts this research to related work. Finally, in Section 11 we summarize the main conclusions. 2 Background Snapshot Isolation (SI) [1] is a concurrency control model for centralized multi-version databases. In SI, when a transaction begins it receives a snapshot of the database. After the snapshot is assigned, it is unaffected by concurrently running transactions. A read-only transaction reads from the snapshot and can always commit. An update transaction reads from and writes to its snapshot, and can commit if it has no write-write conflict with any committed update transaction that ran concurrently with it. Many database vendors use SI, e.g., Oracle, PostgreSQL, Microsoft SQL Server, InterBase [1, 7, 10, 16]. SI is weaker than serializability, but in practice most applications run serializably under SI, including the most widely-used database benchmarks TPC-A, TPC-B, TPC-C, and TPC-W. SI has attractive performance properties. Most notably, read-only transactions never block or abort, and they never cause update transactions to block or abort. Generalized Snapshot Isolation (GSI) [3] extends SI to replicated databases such that the performance properties of SI in a centralized setting are maintained in a replicated setting. In particular, read-only transactions do not block or abort, and they do not cause update transactions to block or abort. Both of these properties are important for scalability. In addition, workloads that are serializable under SI are also serializable under GSI [3]. Informally, a replica using GSI works as illustrated in Figure 1. When a transaction starts at a database replica, the replica assigns its latest snapshot to the transaction. All transaction read and write operations are executed locally on the replica against the assigned snapshot. At commit, the replica extracts the transaction s modifications to the database into a writeset. A writeset captures the minimal set of actions necessary to recreate a transaction s modifications. If the writeset is empty (i.e., the transaction is read-only), the transaction commits immediately. Otherwise, a certification check is performed to detect write-write conflicts among update transactions in the replicated system. If no write-write conflict is found, then the transaction commits, else the transaction aborts. The certifier performs certification and assigns a global total order to the commits of update transactions. Since committing an update transaction creates a new version (snapshot) of the database, the total order defines the sequence of snapshots the database replicas go through. Writesets are propagated to all replicas to update their state. We refer to writesets of remote transactions during the propagation phase as remote Figure 1: Processing an Update Transaction in GSI writesets. Before committing a local update transaction, if the replica is not up-to-date, it must apply remote writesets to bring that replica s state up-to-date. 3 How Durability and Ordering Affect Scalability In the GSI algorithm, there is a dependency between maintaining the global commit order and writing state to disk for durability. Identifying this dependency is one of the core contributions of this paper. In this section, we explain the problem and its underlying causes via an example, and discuss its impact on scalability. Centralized SI database. The initial case is a centralized system to which we will contrast a replicated system. In a centralized system, if clients concurrently submit two update transactions T4 and T9 whose writesets, W4 and W9, do not conflict the database can commit them in any order: T4 then T9, or T9 then T4. However, to guarantee durability, each commit requires a disk write, a disk write for T4 and another for T9. The IO subsystem can group the commits of T4 and T9 into a single disk write, which greatly improves performance. One GSI replica. Next, consider a replicated GSI database consisting of one replica and one certifier, both at version zero. If clients submit T4 and T9 concurrently to the replica, it executes them and sends their writesets, W4 and W9, to the certifier. The certifier checks the writesets, determines that there are no conflicts, and assigns them an order. Let us assume that the certifier orders W4 at version 1 and W9 at version 2, and sends these results back to the replica. Upon receiving the responses from the certifier, the replica must commit T4 first to reach version 1, then commit T9 to reach version 2, the same sequence that the certifier determined. Under GSI, the replica cannot change this order. Otherwise, a new transaction could receive a snapshot containing the effects of T9 but not T4, a snapshot that never existed globally (i.e., at the certifier). A typical database offers no mechanism to specify a commit order externally. Yet the middleware must not allow the replica to swap the order in which the commits of T4 and T9 occur. Without an external mechanism to enforce a particular commit order, the middleware must submit each commit serially, waiting for each commit (which includes a disk write) to complete. Thus, two disk writes, one to commit T4 and another for T9, are required. This serializes a costly component of executing update transactions at the replica.

3 EuroSys Multiple GSI replicas. We continue the example with a replicated GSI system having N replicas. One replica receives T4 and T9 and then sends W4 and W9 to the certifier. The certifier receives the two writesets, and receives writesets from other replicas as well, and creates the following total order: T1, T2, T3, T4, T5, T6, T7, T8, T9. Under GSI, the replica must observe this total order. Not only must the replica commit transaction T4 before T9, but the remote transactions T1, T2, T3 must commit before T4, and T5, T6, T7, T8 before T9. In a naïve implementation, the replica would go through the version sequence: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, which would 9 synchronous disk writes, but this can be improved, as follows. Grouping remote writesets. We batch the writesets of several remote transactions by combining their effects into one transaction T1_2_3 with writeset {W1, W2, W3}, and similarly another transaction T5_6_7_8 with writeset {W5, W6, W7, W8}. The replica version follows the sequence: 0, 3, 4, 8, 9, which requires 4 disk writes (2*M disk writes in general for M local update transactions at each replica). It is important to note that the middleware is outside the database and cannot submit the commits of T1_2_3, T4, T5_6_7_8, T9 concurrently to a standard database and force committing T1_2_3 first, T4 second, T5_5_7_8 third, then finally T9. Databases do not provide such low level mechanisms to the client interface. The replication middleware cannot allow the database to commit the transactions in any order. The commits must follow the global order because under certain conditions changing the commit order can result in the final state of the database not being the same. For example, if T1_2_3 and T5_6_7_8 both modify a common database item, then the order in which they commit is significant. Furthermore, allowing transactions to commit in any order would require the middleware to block the start of all new transactions (including read-only transactions), lest they observe an inconsistent snapshot (e.g., T9 commits internally before T4). This approach has a major performance drawback as read-only transactions would block waiting for other transactions to finish, voiding the main performance benefit of GSI. Therefore, we propose two solutions for uniting ordering and durability. Solution 1: Move durability to the middleware. One solution is to have the middleware, which decides the ordering, be responsible for making transaction modifications durable. The middleware can then batch all available writesets at certification. In the above example, the middleware batches all nine writesets into a single disk write. At the replicas, commits become in-memory actions and serializing them is not a performance issue. Solution 2: Pass the ordering information to the database. The database API interface is extended such that the middleware can specify the commit order. With this API, the replication middleware submits all transactions concurrently to the database with their required commit order. In the above example, the database makes T1_2_3, T4, T5_6_7_8, and T9 durable in one disk write and commits them in the correct order. The key insight is that in both solutions durability is united with ordering. Regardless of whether the two are united in the replication middleware or in the database, both solutions permit IO subsystem optimizations that greatly improve throughput. For this paper, the two example solutions assume that the standalone database (1) supports the SI concurrency control model, (2) has the ability to capture and extract the writesets of update transactions, and (3) has the ability to enable/disable synchronous writes to disk. 4 Architecture and Design of Base Base represents a traditional replicated database solution in which the middleware performs global ordering but relies on the database replicas for durability. In Section 5, we derive from Base two systems, Tashkent-MW and Tashkent-API, that unite ordering and durability. We discuss fault tolerance and recovery for each of the three replicated designs in Section Architecture Base is a replicated database system that uses GSI for concurrency control. Base consists of two main logical components both of which are replicated: (1) database replica and (2) certifier. When a replica receives a read-only transaction, the replica executes it entirely locally. When a replica receives an update transaction it executes it locally, except the commit operation which requires certification. Replicas communicate only with the certifier component, not directly with each other. The certifier certifies update transactions from all replicas and orders them. The Base design is a pure middleware solution since no modification to the database source code is required. Attached to each replica is a transparent proxy that intercepts database requests. The proxy appears as the database to clients, and appears as a client to the database. The proxies and certifiers are the replication middleware. Figure 2 - Architecture of Base Figure 2 shows the architecture of the Base system. The two main components, database replica and certifier, are replicated asymmetrically with different replication degrees. Database replicas are replicated mainly for performance, whereas the certifier is replicated mainly for availability. For this study, we simply assume a separate certifier component which is itself replicated, though our conclusions apply to other configurations. For example, the certifier component could be implemented via an atomic broadcast mechanism incorporated into the proxy at every replica. For update propagation, we use writesets rather than the original SQL text of update transactions. Although for some transactions propagating the original SQL text may be shorter in size than the writeset, it is generally more expensive to reexecute the SQL text at the certifier and then at each replica when propagating the effects of the transaction.

4 120 EuroSys Processing Transactions Each transaction is assigned a snapshot. The transaction operations are executed locally against the snapshot. The commit of a read-only transaction is simple as it always commits successfully. However, processing the commit of an update transaction is more complex and we informally present here the steps under GSI. We provide additional implementation details in Sections 6 and 8. We use the following terminology. We use version to count database snapshots. The version at a particular database replica is called replica_version. The database starts at version zero, i.e., in the initial state replica_version=0. When an update transaction commits, replica_version is incremented. Each transaction has two numbers, its version at start, tx_start_version, and the version created at its commit, tx_commit_version (tx_commit_version is valid only for update transactions). Certification. When an update transaction attempts to commit, the proxy at the replica makes a request for certification to the certifier. The certifier performs writeset intersection, a fast main memory operation that compares table and field identifiers for matches against recent writesets to detect write-write conflicts. For the committing transaction T c with writeset W c, the certifier compares W c to the set of writesets committed at versions more recent (greater) than T c.tx_start_version. Successfully certified writesets (i.e., with no conflicts) are recorded in a persistent log, thus creating a global commit order. The log is necessary to allow the certifier to recover because we use the crash-recovery model. The state of any replica is always a consistent prefix of the certifier s log. Writeset intersection is a quick operation and writing to the persistent log is efficient because multiple writesets can be batched. Thus, certification is lightweight. Responding to replicas. The certifier responds to the replica with the following: (1) the result of the requested certification test, (2) any remote writesets for the intervening update transactions at other replicas, and (3) the version at which the update transaction commits (if the certification test succeeds). The middleware proxy at the replica applies the remote writesets to the database before it commits the local update. Durability. The durability function is performed in the databases. Since the certifier determines the global order of commits, proxies commit update transactions and remote writesets serially to ensure the same global order is followed at each replica. 5 Tashkent-MW and Tashkent-API Tashkent-MW and Tashkent-API are derived from Base to combine durability and commit ordering. 5.1 Tashkent-MW To unite durability with ordering in the middleware, we disable synchronous writes that guarantee durability in the database replicas. Instead, the replication middleware logs the database modifications (i.e., writesets) for durability. However, the certifier already logs this information via its persistent log to allow certifier recovery. Thus, it is relatively simple to transform the Base design into Tashkent-MW, because enabling/disabling synchronous writes is a standard feature in many databases. Furthermore, since Base is a pure middleware solution, so then is Tashkent-MW. The commit operations at each database replica are still serial, but they are now fast in-memory operations. System throughput and scalability are greatly improved because the durability function is performed at the certifier, which efficiently groups the writesets into a much smaller number of synchronous writes. There is, however, a cost to moving durability out of the databases and it appears in the recovery procedure. We address recovery in Section Tashkent-API To unite ordering with durability in the database, the API available to the middleware is extended so that the commit order of update transactions can be specified. The change is to the SQL COMMIT command to permit an optional integer parameter giving the sequence number to the commit (e.g., COMMIT 9 to commit current transaction at version 9). Since we change the interface of the database, Tashkent-API is not a pure replication middleware and the source code of the database must be available. With the new API, upon receipt of the commit decision from the certifier, the Tashkent-API proxy forwards to the database both the commit command and the version number returned from the certifier. Similarly, remote writesets are applied in separate transactions with their appropriate commit sequence numbers. The proxy applies remote writesets and local commits to the database concurrently. Thus, the database can group the commit records as usual for efficient disk IO, but internally the database enforces the supplied commit ordering. This extended API is a simple interface yet powerful; its use should be restricted to the replication middleware. Normal clients are unaware of system-wide commit ordering. If the interface is abused (e.g., issuing COMMIT 9, without ever providing COMMIT 1-8), the database may deadlock, potentially aborting the committing transaction to resolve the deadlock Constraints under Tashkent-API With the new API, it is not always possible for the middleware to submit transaction commits concurrently. At these times, some commit operations must be serialized, and this reduces the performance gain. This section describes the condition for which a commit must be serialized and how to detect if this condition arises. Conflicts may arise when committing two different local update transactions. Should their accompanying two remote writesets modify a shared element, they generate an artificial local write-write conflict at the database replica if the middleware submits these commits concurrently. Under snapshot isolation, a write-write conflict results in one of the transactions being aborted. Therefore, the middleware must avoid creating these artificial conflicts. The root of the problem is that the artificially conflicting remote writesets, which were generated by remote transactions that did not run concurrently, now appear as concurrent to the database replica when applied. This situation can only occur in Tashkent- API, not in Tashkent-MW or Base as they serialize commits. We illustrate this case with a concrete example and present a detection scheme. Artificial Conflict Example. Figure 3 shows two local concurrent transactions, T 44 and T 46 at a replica and two remote writesets, W 43 and W 45, that are applied as local transactions, T 43 and T 45. The transactions are labeled with the version at which

5 EuroSys they commit globally. For example, T 44 starts from a snapshot with version 30 and its writeset is W 44. T 44 commits as version 44 globally. But before T 44 can commit, remote writeset W 43 is received from the certifier and applied to the replica as local transaction T 43. T 43 must commit before T 44 commits as explained in Section 3 to follow the global commit order. Figure 3: Example of concurrent commits The proxy can safely submit the commits of transactions T 44 and T 46 concurrently at the database along with transactions T 43 and T 45 if none of the remote writesets (W 43 and W 45 ) conflict. If there is no conflict, then the database can write the commit records of T 43, T 44, T 45, and T 46 in one disk write. This is the desired behavior. If an artificial conflict exists between T 43 and T 45, e.g., W 43 sets x=17 and W 45 sets x=39, then this artificial conflict is detected and T 45 will be serialized waiting for T 43 to commit. In this case, at least two disk writes are needed. For example, the proxy submits transaction T 43 and the commit of T 44 concurrently, waits for the acknowledgement from the database. Then the proxy submits transaction T 45 and the commit of T 46 concurrently. The best the database can do in this case is to group the commits of T 43 and T 44 into one disk write and the commits of T 45 and T 46 into a second disk write. Therefore, artificial conflicts decrease opportunities to group commits. If artificial conflicts between remote writesets are frequent, then in the worst-case all commits are forced to be serialized and Tashkent-API degrades towards the performance of Base. Artificial Conflict Detection. The proxy in Tashkent-API needs to know if it is safe to apply a remote writeset concurrently with previous remote writesets. Concurrency is safe if intersection testing is null between the remote writeset and the previous remote writesets (e.g., W 43 and W 45 do not modify a shared item). The certifier may already have done this intersection testing in the normal certification process at their original commit time if the remote writesets were from concurrent transactions. Our solution is to have the proxy tell the certifier how far back (i.e., what prior version) it would like the remote writesets to have been tested for write-write conflicts (i.e., intersection testing). If one or more of the remote writesets has not already been tested that far back, the certifier performs additional intersection testing for those writesets. With the commit/abort decision, the certifier also returns the version to which the remote writesets are conflict-free. The proxy uses this information to determine which remote writesets can be submitted concurrently. If two remote writesets conflict, then the proxy must commit the earlier writeset first before submitting the latter one. The version the proxy submits to the certifier need only be the current version of the database as no remote writesets will be returned prior to this version (recall that remote writesets carry updates the replica has not seen yet). The proxy stores the current version of a replica in replica_version and it is this version the proxy submits with a transaction to the certifier. In the example of Figure 3, to certify T 46, W 46 is certified back to version 42 (T 46.tx_start_version) in normal certification. Next, assuming that the accompanying remote writeset W 45 was only certified to version 44 at the time of its commit, then W 45 is also additionally certified back to replica_version=42 if it is to be committed concurrently with T 43 and T 44. If there is a conflict then the proxy delays submitting W 45 until the conflicting transaction T 43 commits. The overhead of the additional certification checks at the certifier is minimal. The certifier records for each writeset the point to where it has been (further) certified and avoids repeated checks for responses to other replicas. 6 Implementation of the Replication Middleware In this section, we provide the pseudo-code of GSI and show how it is implemented as pure middleware. In Section 8 we discuss the mechanisms used by the proxy to interface to PostgreSQL. 6.1 Certifier The certifier maintains (1) a persistent log recording tuples of each writeset and its commit version (writeset, tx_commit_version) for all committed update transactions and (2) the global system version (system_version). A certification request provides the certifier with the writeset and start version of the transaction (T.writeset, T.tx_start_version). The certifier returns the remote writesets, the certification result (commit or abort), and the commit version of the transaction (tx_commit_version). The pseudo code of actions at the certifier for a transaction T is the following: On a certification request for transaction T having (T.tx_start_version, T.writeset): 1. The input T.writeset is tested for intersection against entries in LOG whose tx_commit_version is greater than T.tx_start_version. An intersection occurs if two writesets overlap (signaling a write-write conflict). 2. IF there is no intersection, THEN { decision! commit, increment system_version, T.tx_commit_version! system_version, append (T.writeset, T.tx_commit_version) to persistent LOG } ELSE decision! abort. 3. The output of the procedure contains: the remote writesets that the replica has not received yet between tx_start_version and T.tx_commit_version, decision (either commit or abort ), T.tx_commit_version.

6 122 EuroSys 2006 We implement the certifier as a multi-threaded server in C with worker threads to receive and process certification requests and send responses. The certifier has a single writer thread for writing the certifier log to the disk, which makes the certifier very efficient at batching all outstanding writesets to disk via a single fsync call. 6.2 Transparent Proxy A proxy in front of each database intercepts incoming database requests. The proxy tracks the database version (replica_version), maintains a small amount of state for each active transaction, invokes certification, and applies the remote writesets. The pseudo code for actions at the proxy is the following: A- On proxy intercepting T s BEGIN: 1. T receives a snapshot of the database, 2. T.tx_start_version! replica_version. B- On T read and write operations: 1. Read and write operations are executed on T s snapshot. C- On proxy intercepting T s COMMIT: 1. T.writeset is extracted. 2. IF T.writeset is empty (i.e., T is read-only), THEN T commits, ELSE invoke certification ( T.tx_start_version, T.writeset ) 3. Receive three outputs from certification: (a) remote writesets, (b) commit or abort, (c) T.tx_commit_version. 4. Replica applies the remote writesets (a) in their own transaction. 5. IF the second output (b) is commit, THEN { T commits, and replica_version! T.tx_commit_version (c)}. ELSE T aborts. The proxy intercepts database requests without any modification to the database as explained below. The labels refer to the corresponding parts of the pseudo code. [A1] Intercepting BEGIN. The proxy creates a new record (tx_start_version, tx_commit_version), and assigns tx_start_version = replica_version. The proxy forwards BEGIN to the database. [C2] Intercepting COMMIT. The proxy extracts the writeset of the transaction. If the writeset is empty (i.e., a read-only transaction), COMMIT is forwarded to the database; else the proxy invokes certification by sending T.writeset, T.tx_start_version to the certifier, and waits for the output from the certifier. [C3] Receiving certifier output. The proxy receives remote writesets, the certification decision and T.tx_commit_version. [C4] Remote writesets. Remote writesets are first stored in a memory-mapped file, called proxy_log, a new transaction is started containing the remote writesets, the transaction is sent to the database, and the proxy waits for the reply. The database executes the transaction, applies the updates, and sends the commit to the proxy. [C5] Finalizing the COMMIT. If the certifier decision is abort, the proxy forwards ABORT to the database, aborting transaction T. If the certifier decision is commit, the proxy forwards COMMIT to the database and awaits the reply. In Tashkent-API, T.tx_commit_version is added as a parameter to the COMMIT. The database commits the transaction. Then, the proxy updates replica_version = T.tx_commit_version. Steps [C4] and [C5] are serialized both in Base and Tashkent- MW. The proxy applies the remote writesets, waits for the reply from the database, sends the transaction commit, and again waits for the reply from the database. For Tashkent-MW, this serialization is quick since the database acts essentially as an inmemory database. For Tashkent-API, steps [C4] and [C5] can be concurrent since the commit command includes the commit order. Local certification. Local certification is a performance optimization where the proxy performs partial certification against a local copy of the remote writesets it has seen so far. By locally certifying a writeset, the proxy increments the effective start version of a transaction which reduces the work at the certifier. If local certification fails, the transaction is aborted, obviating unnecessary processing at the certifier. Conservative assigning of versions is safe under GSI. The proxy always assigns its most recent value of replica_version to new transactions (step [A1] above). However, during the execution of steps [C4] and [C5], the database commits the transaction before informing the proxy. Therefore, it is possible that the database has already performed the commit and gives the new transaction a newer snapshot of the database. This is safe under GSI because certification is correct as long as the new transaction is labeled with a version that is the same or earlier than its actual version. The certifier still detects all write-write conflicts. Bounding Staleness. Remote writesets are sent in response to update transaction commits. If a replica does not receive an update transaction for a period of time (e.g., a few seconds), its proxy proactively requests remote writesets from the certifier to bring the replica up-to-date. Proxy Implementation. The proxy is implemented as a multithreaded Java program. The proxy provides a JDBC interface to clients, and it uses a JDBC driver suitable for the database. All access to the replicated tables must go through the proxy, and not directly to the database. 7 Fault Tolerance In all systems, we use the crash-recovery model a node may crash and subsequently recover an unbounded number of times. Safety is always guaranteed. If any number of replica or certifier nodes fail, no transaction is lost as all effects of every committed update transaction are durable in the system. Progress depends on the number of failures. Read-only transactions can be processed if at least one replica is up. Update transactions can be processed if a majority of certifier nodes are up and at least one replica is up. 7.1 Replica Recovery for Tashkent-MW Moving durability outside the database may interfere with the database recovery subsystem. Informally, turning off durability in a database should not affect physical data integrity (correctness of data pages on disk). In some databases, however, when durability is disabled, physical data integrity may be violated as well, producing corrupt data pages in case of

7 EuroSys database crash. In this subsection we explain this further and outline a middleware recovery scheme. Modern databases use write-ahead logging (WAL), which improves their IO performance. For guaranteeing physical data integrity, a database can write a dirty data page only if the corresponding record in the WAL is stable on disk. Therefore, if the database crashes while writing a dirty data page and the data page becomes corrupt, then the page can still be recovered using the undo/redo information in the WAL. Thus, the WAL is written to disk for two reasons: (1) to commit update transactions and make their effects durable (i.e., for durability), and (2) to allow dirty data pages to be safely written to disk without violating integrity (i.e., for physical data integrity). Databases typically offer one of two options for disabling WAL activity. This generates two cases on how recovery is performed. Case 1: Disable both integrity and durability. Most databases offer the option of disabling all WAL synchronous writes, voiding both durability and physical data integrity guarantees that is if the database crashes, some committed update transactions may be lost and a data page may be corrupt. When we deploy Tashkent-MW on such databases, we disable all WAL synchronous writes for performance, which voids physical data integrity (but durability is still guaranteed in the middleware). Therefore, to guarantee physical data integrity, the Tashkent-MW middleware periodically asks the database to make a copy, with the middleware recording the database version at the point of its request. Most databases have direct support for taking such a copy even while the database is normally processing transactions. In SI databases taking a database copy is logically equivalent to a reading database snapshot. The Tashkent-MW middleware maintains two complete copies of the database. If the database crashes, the middleware restarts the database with the last copy, or the second to last copy (in the case where the database crashed while dumping the last copy). Next, the middleware updates the state of the database by applying all remote writesets that have occurred since the version of the copy used for recovery. We use this alternative with PostgreSQL because it uses write-ahead logging and can either enable or disable all WAL synchronous writes. Case 2: Disable only durability. The second case is where databases offer the option of enable WAL synchronous writes, but disable WAL synchronous writes on commits of update transactions. This option guarantees data integrity but does not guarantee durability. In other words, if the database crashes, it can recover to a previous committed state but some update transactions that committed after the last WAL synchronous write may be lost. If the database offers this feature, Tashkent- MW does not need to take database copies. The database uses its recovery mechanism and then the middleware applies the necessary remote writesets to bring the database up-to-date. 7.2 Replica Recovery for Base and Tashkent-API In both Base and Tashkent-API, the database uses its standard recovery scheme, redoing/undoing transactions in the database log as necessary. The middleware proxy re-applies update transactions whose commits were forwarded to the database but were not acknowledged. In Base, this is at most one transaction because update transactions commit serially. In Tashkent-API, this is at most the set of update transactions whose commit operations were concurrent at the time of the crash. If the set of uncommitted update transactions are not known (e.g., the proxy crashes with the database), the proxy can obtain the missing writesets from the certifier s log and apply them. Reapplying writesets in the global order is always safe. At the end of this step, the proxy knows the version of the database. After the database recovers and the proxy updates the database, the replica resumes normal operations. 7.3 Certifier Replication and Recovery The certifier is identical in all three systems. The certifier state is replicated for availability across a small set of nodes using Paxos [13]. The replication algorithm uses a leader elected from the set of certifiers. The leader is responsible for receiving all certification requests. Normal Case. When the leader receives certification requests, it performs normal GSI certification and selects which transactions may commit. Then, it sends the new state (i.e., the log records containing writesets of the selected transactions) to all certifiers including itself. All certifiers write the new state to disk and reply to the leader. When a majority of certifiers reply, the leader declares those transactions as committed. On Failure. When a certifier crashes, a new leader is elected (if necessary) and certification continues making progress whenever a majority of certifiers are up. On Recovery. When a certifier recovers from a crash, it requests an update via a state transfer from another up certifier, participates in electing a new leader (if necessary), and logs certification requests to disk. 8 Middleware Interface to PostgreSQL 8.1 Proxy-Database Interface The interaction between the proxy and database is databasespecific. We describe here how to interface to PostgreSQL. For another database, the specifics would change, but the underlying mechanisms are common in most databases. Writeset Extraction. Extracting the writeset of a transaction is not defined in the SQL standard, but many databases provide mechanisms to do so, such as triggers in PostgreSQL, direct support in Oracle, and log sniffing in Microsoft SQL Server. In PostgreSQL, we define triggers on the following events INSERT, UPDATE, and DELETE for replicated tables. When a trigger is fired it (1) captures the new row in case of INSERT, (2) captures the primary key(s) and the modified columns for UPDATE, or (3) captures the primary key for DELETE. The trigger also records the table name and operation type. These changes are stored in a memory mapped file in order to give access of partial writesets to the proxy (see Eager Pre-certification below). Recovery. For recovery in Tashkent-MW, the proxy sends a DUMP DATA command to the database periodically to create the backup copy. The proxy stores replica_version, timestamp, dump data, end-of-file marker with a checksum in a file. If the database crashes, the proxy restarts the database using the appropriate dump file. Soft Recovery. The proxy may send COMMIT to the database, and the database decides to abort the transaction due to exceptional circumstances, such as running out of disk space, performing garbage collection to delete old snapshots, or a crash of one database process. In such cases, the proxy aborts

8 124 EuroSys 2006 all active transactions, and re-applies sequentially the remote writesets and the previously aborted transaction. If soft recovery fails, the proxy performs normal system recovery. 8.2 Deadlock Avoidance Because PostgreSQL uses write locks, a deadlock may develop between the writeset of a local transaction and the writeset from a remote transaction. This deadlock scenario exists in all three designs Base, Tashkent-MW, and Tashkent-API. Write Locks in PostgreSQL. Since PostgreSQL (and other centralized SI databases) immediately sees all partial writesets of active update transactions, it uses write locks to eagerly test for write-write conflicts during transaction execution rather than at commit time. The first transaction that acquires a write lock on a database item may proceed, blocking competitors who want to update the same database item. If the lock holder commits, all competitors abort. If the lock holder aborts, one of the competitors may proceed. Traditional Deadlock Scenario. In a centralized database, deadlock may develop between two update transactions T1 and T2. Suppose T1 holds a write-lock on x and T2 holds a writelock on y. Then, T2 tries to update x and T1 tries to update y. Neither T1 nor T2 can proceed as each is waiting for the other. Deadlock between a Local Writeset and a Remote Writeset. In a replicated system, a similar deadlock situation can arise if T2 is on a remote replica, has been certified, and is being applied as a remote writeset to the replica where T1 is currently executing and holding a lock. Unlike the standalone system, in the replicated system T2 must eventually be permitted to commit at the replica and T1 will eventually be aborted either at the replica or at the certifier. Some databases allow tagging transactions with priorities. If such a mechanism is available then avoiding a deadlock is straight forward: we mark remote writesets with high priority, aborting any conflicting local transaction. However, PostgreSQL does not have priorities. Therefore, we can either (a) do nothing letting PostgreSQL handle deadlocks between remote writesets and local update transactions (we run soft recovery if PostgreSQL aborts a remote writeset), or (b) detect and prevent deadlocks eagerly in the middleware as an optimization using eager pre-certification. Eager Pre-certification. The proxy can avoid deadlocks by eagerly detecting write-write conflicts between remote writesets and local writesets. Conceptually, we leverage the local certification functionality at the proxy (Section 6.2) to eagerly certify every write in a transaction as it occurs against the pending remote writesets. Similarly, remote writesets received at the replica are verified against the current group of partial writesets. If a write-write conflict is found in any case, the proxy aborts the conflicting local update transaction, which allows the remote writeset to be executed. Both local certification and eager pre-certification do not increase the total number of low-level certification comparisons. They change when (earlier than the commit time) and where (on the replica rather than on the certifier) some certification comparisons are performed, so they have the benefit of distributing the certification load to the replicas (though the time for certifying a writeset is typically an order of magnitude less than that of executing the transaction itself). 8.3 Extending the Database API For Tashkent-API, the database must enforce a specified commit ordering. In PostgreSQL, modifications of a transaction are visible after (1) the commit record is written to disk and (2) the transaction is announced as committed. Thus, to control the sequence of commits we need only control the order in which transactions are announced as committed. We modify PostgreSQL so that all pending commit operations wait upon a semaphore after writing their commit records to disk, i.e., we control only the second step, when a transaction is announced as committed. The semaphore is initialized to 0 at system start and incremented after each commit. Each commit requests the semaphore with a count matching the specified ordering sequence and blocks until the semaphore count progresses sufficiently. This method requires minimal changes to PostgreSQL (20 lines of source code). The database may write the commit records to disk in an order different from the order in which transactions are announced as committed. Since this behavior is identical to the original standalone PostgreSQL system, database recovery is not affected by this change. 9 Performance Evaluation 9.1 Methodology Under GSI read-only transactions are executed locally on the receiving replica. The scalability of the replicated system is limited by the rate of updates and the overhead of maintaining consistency. Thus, we assess the performance of uniting durability with ordering by comparing Tashkent-MW and Tashkent-API to Base using three benchmarks that vary widely in their update rates. AllUpdates Benchmark. We developed a benchmark, called AllUpdates, in which clients rapidly generate back-to-back short update transactions that do not conflict. The average writeset size is 54 bytes for each update transaction. AllUpdates represents a worst-case workload for a replicated system. TPC-B Benchmark. TPC-B is a benchmark from the Transaction Processing Council [23] that uses transactions containing small writes and one read. The average writeset size is 158 bytes. In contrast to the AllUpdates benchmark, TPC-B transactions have both reads and writes, plus its workload contains write-write conflicts. TPC-W Benchmark. TPC-W is a benchmark from the Transaction Processing Council designed to evaluate e- commerce systems. It implements an on-line bookstore and has three workload mixes. We report results from the most commonly used shopping mix workload (with 20% updates). The average writeset size is 275 bytes. In contrast to the other two benchmarks, the relatively heavy-weight transactions of TPC-W make CPU processing the bottleneck. Thus, the three benchmarks AllUpdates, TPC-B, and TPC-W represent a spectrum of workloads differing in terms of fraction of writes, transactions complexity and conflict rate. We measure the performance of a single standalone database and determine the number of clients needed to generate 85% of the peak throughput. In the following experiments, each replica is driven at this load.

9 EuroSys Figure 4: Throughput for AllUpdates (shared IO). Figure 5: Resp. Time for AllUpdates (shared IO). Figure 6: Throughput for AllUpdates (dedicated IO). Figure 7: Resp. Time for AllUpdates (dedicated IO). System specification. Each machine in our cluster runs the Linux kernel on a single Intel Xeon 2.4GHz CPU with 1GB ECC SDRAM, and a 120GB 7200pm disk drive. The machines are connected through a switched 1Gbps Ethernet LAN. We use a leader certifier and two backups for fault tolerance (see Section 7.3). We use the PostgreSQL database configured to run transactions at the snapshot isolation level (which is the strictest isolation level in PostgreSQL where it is called the serializable transaction isolation level ). IO channel for durability. To guarantee durability we use synchronous disk writes with the Linux system calls write() and fsync(), such that the fsync call returns only after the data has been flushed to disk (i.e., to the disk media rather than to the disk cache). On our system fsync takes about 8ms, but the actual time varies depending on where the data resides on disk (6ms- 12ms). 9.2 AllUpdates Figure 4 shows the throughput results for Base, Tashkent-MW, and Tashkent-API running AllUpdates. The x-axis is the number of replicas and the y-axis is the requests per second (i.e., transactions per second). The corresponding response times are shown in Figure 5. In Figure 4, the throughput curve for Base is at the bottom and grows linearly at the rate of just over 49 req/sec, up to 735 req/sec at 15 replicas. This matches the capacity of each replica s IO channel to process synchronous writes. At 8 ms per fsync call, when commits are serialized the upper limit on throughput is 125 req/sec. If remote writesets are returned in the certification process (which is the case even at two replicas) then two writes are required to commit each local update transaction, one for the grouped remote writesets and one for the local transaction, for an expected limit of local transaction commits per second depending on the time to complete the fsync. In contrast, Tashkent-MW achieves a total throughput of 3657 req/sec at 15 replicas, or five times that of Base. This high throughput is achieved because with durability united with ordering at the certifier, it is able to group an average of 29 writesets per fsync. Tashkent-API also performs much better than Base, achieving 2240 req/sec, or three times that of Base. While its throughput is also very high, Tashkent-API does not achieve the

10 126 EuroSys 2006 performance of Tashkent-MW, for three reasons. First, Tashkent-API has an inherently longer round-trip latency for each client request than Tashkent-MW. There are two disk writes in the path of each update transaction: one at the certifier for middleware recovery and one in the database for durability. This extra fsync delay extends the response time and depresses the throughput somewhat (Base is not likewise limited as commit serialization at the database is by far its bottleneck). Second, each of our servers has only one disk. With a single disk, the IO channel is shared between three different IO streams: reading the database pages, writing back dirty database pages, and durability logging. Third, Tashkent-MW and Tashkent-API write somewhat different log information in different ways. To explore the effects of these factors on Tashkent-API, we first eliminate the extra latency at the certifier. While durability in the certifier is necessary for middleware recovery, we run with it off and show the resulting throughput in the fourth curve tashapinocert in Figure 4. In this case the certifier performs certification as usual, but it does not write information to disk. Without the IO delay at the certifier, the throughput increases to 2901 req/sec, but still less than Tashkent-MW. Next, since logging IO is in the critical path when durability is in the database, we create a dedicated logging channel by putting the database in ramdisk. Therefore, the IO channel is dedicated to logging, and reading and writing database pages occurs in the main memory. We show in Figures 6 and 7 the results of using ramdisk. The relative throughput behavior in Figure 6 is similar to that in Figure 4, but all the curves adjust up somewhat because less contention improves throughput. However, the effect is minor as AllUpdates runs essentially from memory, resulting in very little activity for reading and writing database pages in the steady state. We attribute the remaining performance difference between Tashkent-MW and Tashkent-API (i.e., between curves tashmw and tashapinocert in Figure 6) to differences between PostgreSQL and the certifier. The certifier logs only the writeset, whereas PostgreSQL logs before/after images of data pages. In addition, PostgreSQL has a heavier-weight multiprocess design compared to the multithreaded design of the certifier for Tashkent-MW. We expect the differences would disappear with proper changes to PostgreSQL. The response times in Figures 5 and 7 both show a jump in the response time for Base between one replica and two replicas. At one replica there is only one fsync to commit each local transaction (there are no remote writesets). At two or more replicas an additional fsync call per local transaction is necessary for the remote writesets. More specifically, for Base in Figure 7 where the logging channel is dedicated, there are 10 clients at each replica which results in approximately 10 concurrent local requests at each replica. Thus, the response time at one replica is about 90 msec (each client request waits for the other 9 fsync calls plus its own fsync). At two replicas and onwards, the response time rises to nearly 180 msec. This delay includes the 10 fsyncs to commit the local transactions, plus another 10 fsyncs to commit the grouped remote writeset for each local transaction. For the Tashkent systems, the delay increases slowly reflecting the overhead of processing more remote writesets at each replica. The certifier is lightweight. For example, at 15 replicas in the Tashkent-MW system, the certifier certifies 3657 req/sec while its disk is less than 50% utilized and its CPU utilization is below 20%. The replication middleware does not add significant overheard. The throughput of a 1-replica Tashkent-MW system running the full replication protocol is within 5% of a standalone system. For example, under the AllUpdates benchmark using a shared IO channel, a standalone database gives a throughput of 517 req/sec with 17 msec average response time compared to 490 req/sec with 18 msec for 1-replica Tashkent-MW system. The performance is similar when using a dedicated IO channel (515 req/sec with 17 msec for standalone versus 491 req/sec with 18 msec for 1-replica Tashkent-MW). Moreover, we see the same behavior under TPC-B and TPC-W; a 1-replica Tashkent-MW gives a performance close to that of a standalone database. In summary, under this worst-case update-intensive workload, both Tashkent-MW and Tashkent-API show impressive scalability improvements. Tashkent-MW and Tashkent-API have improvements of 5.0x/3.0x (shared IO) and 5.0x/3.2x (dedicated IO) over Base. 9.3 TPC-B TPC-B is also an update-intensive benchmark, but the transactions include reads from the database as well. The throughput and response times are shown in Figures 8 and 9, in the case of a shared IO channel. We see the same relative performance of the systems: Tashkent-MW gives highest performance (2.6x Base), Base the lowest, and Tashkent-API (1.3x Base) performs in between the other two. Again, to explore the difference between Tashkent-MW and Tashkent-API, we remove the fsync in the certifier and show its improvement in the curve tashapinocert, resulting in a somewhat higher throughput. We also plot in Figures 10 and 11 the throughput and response time results in the case of a dedicated IO channel (i.e., the database is in ramdisk). Here all curves are higher, but there is still a significant difference between Tashkent-MW and Tashkent-API. This difference is not due only to the additional latency at the certifier as the curve tashapinocert shows little additional throughput with the dedicated logging channel. Unlike AllUpdates, TPC-B generates artificial conflicts among remote writesets (Section 5.2.1). An artificial conflict among remote writesets prevents Tashkent-API from submitting them concurrently to the database. This reduces opportunities to unite durability and ordering. When such an artificial conflict is detected, some remote writesets must be submitted serially in separate fsync calls. These additional fsyncs degrade Tashkent-API s performance. The artificial conflict rate between remote writeset groups is 35% in TPC-B. Finally, the response times in Figures 9 and 11 increase steadily as replicas are added. This reflects the overhead of applying writesets at the replicas. 9.4 TPC-W In Figure 12 we show the throughput of TPC-W with the shopping mix (20% updates) for a shared IO channel. There is no difference between Tashkent-API and Base because the update rate is very low. At the maximum throughput of 240 tps, there are only 48 updates per second system wide. At 15 replicas, this means each replica generates only about 3 updates per second on average (requiring 3*2 fsync() calls per second at each replica), much lower than what is needed to saturate the local logging channel. Thus, Tashkent-API has no opportunity to group commits to disk. At very low update rates, separating ordering and durability does not create a bottleneck.

11 EuroSys Figure 8: Throughput for TPC-B (shared IO). Figure 9: Resp. Time for TPC-B (shared IO). Figure 10: Throughput for TPC-B (dedicated IO). Figure11: Resp. Time for TPC-B (dedicated IO). The response times of read-only and update transactions are shown in Figure 13. Update transaction response times are similar for Tashkent-API and Base. Furthermore, the response times for read-only transactions are similar for all systems as read-only transactions are handled in the same way in all three systems. However, the performance of Tashkent-MW is better than both Tashkent-API and Base. The reason is the shared IO channel for Tashkent-API and Base. With the shared IO channel and the larger database, Tashkent-API and Base experience significantly higher critical path fsync delays due to nonlogging IO congestion. Using a dedicated logging channel would alleviate this congestion, but the large TPC-W environment does not fit in ramdisk on our machines. We expect that both Tashkent-API and Base would match the performance of Tashkent-MW if the logging channel is dedicated and the update rate remains low. 9.5 The effect of aborts One may wonder if the benefits of Tashkent remain present even under workloads with a substantial number of aborts. In Figure 14 we demonstrate that this is the case. We show the effects of aborts using the AllUpdates benchmark because TPC- B and TPC-W have very few (non-artificial) conflicts and subsequently low abort rates. In AllUpdates, we force a system-wide abort rate by having the certifier randomly abort requests at that rate. If a certification request is selected to be aborted, the abort occurs after the full certification check, so that all computational overhead at the certifier is incurred. In Figure 14 we use exaggerated abort rates to demonstrate how the goodput (the throughput of non-aborted requests) is affected. We force three different abort rates: 0%, 20%, and 40%. The top three lines are for Tashkent-MW, the middle three are for Tashkent-API, and the bottom three are for Base. Thus, even under high abort rates the Tashkent systems have significantly higher performance than Base. 9.6 Recovery time We report the recovery times for TPC-W because it is the largest of the three benchmarks (the database size is ~700 MB) and the most costly to recover.

12 128 EuroSys 2006 Figure 12: Throughput for TPC-W shopping mix (shared IO). Figure 13: Resp. Time for TPC-W shopping mix (shared IO). Tashkent-MW database recovery. Tashkent-MW requires taking periodic copies of the database for its recovery. In a 15- replica system, a replica requires about 230 seconds to dump a complete copy of the database while it continues processing transactions. That replica s throughput degrades by 13% during the 230 seconds. After a crash, a replica requires 140 seconds to restore the database from the dump file. Tashkent-API and Base database recovery. In Tashkent-API and Base, a database replica recovers exactly as a standalone database. The database reaches a consistent state using its own internal recovery procedure in a few seconds (2-4 seconds). Certifier recovery. A certifier recovers much faster than a database. Its recovery time is dominated by the time required to update its log. The growth rate of the certifier log at 15 replicas is 201,600 writesets per hour at an average of 275 bytes per writeset, or 56 MB per hour. The log is transferred from one of the up certifiers and is essentially a file transfer that takes about 1 second in our LAN for each hour of down time. 9.7 Summary Under high update workloads, separating durability from ordering creates a significant scalability bottleneck. Combining these two functions, either in the middleware or in the database, alleviates this bottleneck and greatly improves performance. The Tashkent-MW solution is more robust than Tashkent-API if the workload has a significant fraction of artificial conflicts. Such a workload forces Tashkent-API to serialize some commits and behave more like Base. 10 Related Work Figure 14. Certifier goodput under different abort rates (dedicated IO). Applying writesets (all systems). After a database recovers, the proxy applies the remote writesets that occurred during the down time and recovery time of the database. The proxy batches the remote writesets and applies them to the database at a rate of 900 writesets per second. At 15 replicas, the rate of updates is 56 writesets per second (20% update of 280 req/sec, in Tashkent-MW). For H hours of down time, it requires approximately 222*H seconds SI, GSI, and Serializability Serializability [17, 2] is the most common database correctness criteria. Snapshot isolation (SI) provides a weaker form of consistency than serializability. Several researchers [5, 7, 4] have recently demonstrated that, under certain conditions on the workload, transactions executing on a database with SI produce serializable histories. Those conditions hold for many applications (including the TPC-A, TPC-B, TPC-C, and TPC- W benchmarks). Even if these conditions do not hold for an application that uses a fixed set of transaction templates, these transaction templates can be modified automatically to be serializable under SI. In practice, database developers understand SI and are capable of using it. The notion of generalized snapshot isolation (GSI) is introduced in Elnikety et al. [3]. Generalized snapshot isolation preserves SI isolation guarantees. In particular, the authors prove that if an application runs serializably under SI, it also runs serializably under GSI. In our experimental results section, the workload assigned to the databases is serializable under GSI as previously indicated [3, 4].

13 EuroSys Database Replication Gray et al. [9] have classified database replication into two schemes: eager and lazy. Eager replication provides consistency, usually at the cost of limited scalability. Lazy replication increases performance by allowing replicas to diverge, possibly exposing clients to inconsistent states of the database. Tashkent has both eager and lazy features. Tashkent uses GSI and avoids global write-write conflicts eagerly during certification. It appears as a single snapshot isolated database to clients. It does not need update reconciliation and the effects of committed update transactions are not lost in the system due to a replica failure. However, replicas are not updated in lockstep. If a replica does not receive update transactions, it may become stale. Tashkent bounds staleness using a simple mechanism to fetch updates from the certifier (Section 6.2) Replication in SI Databases Kemme et al. [12] discuss how to implement different isolation levels (including serializability and snapshot isolation) in replicated databases using group communication primitives. In addition, they implement Postgres-R [11], and integrate the replica management with the database concurrency control [24, 14] in Postgres-R(SI). Postgres-R(SI) uses a replicated form of snapshot isolation. In contrast to Postgres-R(SI) [24, 14], the two Tashkent systems have the key feature of uniting ordering and durability, they never block read-only transactions and Tashkent-MW is a pure middleware solution. Postgres-R(SI) commits update transactions sequentially. This limits scalability if durability is guaranteed by the database but ordering is determined in the middleware. The database needs one fsync call for each commit. The Tashkent systems remove this fsync bottleneck. In Postgres-R(SI) the replication protocol is tightly coupled with the concurrency control. For example in one of the prototypes, the replication middleware accesses PostgreSQL lock tables and replicas map internal transaction IDs to global transaction IDs. The validation function in Postgres-R(SI) (similar to our certification) is replicated with each database; whereas in our work the certifier and replica components are replicated asymmetrically to enhance certifier availability and replica performance. In large scale systems under heavy loads, co-locating the certifier with each database replica marginally improves the certifier availability (over asymmetric replication) and makes the certifier compete on resources with the database at each replica, possibly reducing replica performance. Plattner et al. [18] present Ganymed, a master-slave scheme for replicating snapshot isolated databases in clusters of machines. In Ganymed, all update transactions are handled by a master and read-only transactions are distributed among slaves. Ganymed serializes commits on the master using one fsync for each commit. During update propagation, transactions commit serially at the slaves as well. Both the master and the slaves could benefit from the extended API described in this paper. However, a single master system such as Ganymed is limited to the throughput of a single machine to fully process all update transactions. The Tashkent systems process most of each update transaction locally at the receiving replica and only certify the core writesets at the certifier. Certifying the writesets is an order of magnitude less work than executing the transaction. Thus, the Tashkent designs effectively distribute much of the update transaction workload. 11 Conclusions This paper identifies a limitation to scalability in replicated database designs concerning durability and commit ordering. We analyze the dependency between guaranteeing durability and maintaining a global commit order. This dependency appears in replicated database systems in which replicas agree both on which update transactions commit and on the order of their commits. By uniting durability with ordering, the two actions are done in one phase, which greatly improves scalability. We present Tashkent-MW and Tashkent-API as two example solutions that unite durability and ordering. In Tashkent-MW, durability is united with commit ordering in the replication middleware. Tashkent-MW is a pure middleware solution for high-performance replicated databases. Tashkent-API unites durability and commit ordering inside the database. The replication middleware passes the commit order information to the database. In this solution, additional care must be taken to ensure the middleware proxy does not generate artificial conflicts between remote writesets that are submitted concurrently. This constraint prevents some opportunities to combine durability and ordering. We also present Base, a replicated database system where durability and commit ordering are separate. We experimentally compare Tashkent-MW and Tashkent-API to Base to assess the benefits of uniting durability and commit ordering. We implement the Base, Tashkent-MW and Tashkent-API systems on top of PostgreSQL and analyze their performance using our AllUpdates benchmark as well as the standard benchmarks TPC-B and TPC-W. Under high update transaction loads the advantages of uniting durability and ordering become significant. We show that both versions of Tashkent greatly improve scalability under high update loads and outperform Base by factors of 5 and 3 times, respectively, in throughput while also providing lower response times. 12 Acknowledgments We thank Willy Zwaenepoel from EPFL for his insightful suggestions and valuable feedback. We also thank the anonymous reviewers for their constructive feedback. Finally, we thank Anastassia Ailamaki for her guidance in preparing the final draft. This research was partially supported by the Swiss National Science Foundation grant number References [1] Hal Berenson, Phil Bernstein, Jim Gray, Jim Melton, Elizabeth O Neil, and Patrick O Neil. A critique of ANSI SQL isolation levels. In proceedings of the SIGMOD International Conference on Management of Data, May [2] Philip Bernstein, Vassos Hadzilacos, and Nathan Goodman. Concurrency Control and Recovery in Database Systems. Addison-Wesley, [3] Sameh Elnikety, Fernando Pedone, and Willy Zwaenepoel. Database Replication Using Generalized Snapshot Isolation. IEEE Symposium on Reliable Distributed Systems (SRDS 2005), Orlando, Florida, Oct [4] Alan Fekete. Allocating Isolation Levels to Transactions. ACM Sigmod, Baltimore, Maryland, June 2005.

14 130 EuroSys 2006 [5] Alan Fekete. Serialisability and snapshot isolation. In proceedings of the Australian Database Conference, pages , Auckland, New Zealand, January [6] Lars Frank. Evaluation of the basic remote backup and replication methods for high availability databases. Software Practice and Experience, 29: , [7] Alan Fekete, Dimitrios Liarokapis, Elizabeth O Neil, Patrick O Neil, and Dennis Shasha. Making snapshot isolation serializable. In proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages , June [8] Lei Gao, Mike Dahlin, Amol Nayate, Jiandan Zheng, and Arun Iyengar. Application specific data replication for edge services. In Proceedings of the twelfth international conference on World Wide. Web, pages ACM Press, [9] Jim Gray, Pat Helland, Patrick O Neil, and Dennis Shasha. The dangers of replication and a solution. In proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Canada, June [10] K. Jacobs. Concurrency control, transaction isolation and serializability in SQL92 and Oracle7. Technical report number A33745, Oracle Corporation, Redwood City, CA, July [11] Bettina Kemme and Gustavo Alonso. Don t be lazy, be consistent: Postgres-R, a new way to implement database replication. In proceedings of 26th International Conference on Very Large Data Bases (VLDB 2000), Cairo, Egypt, September [12] Bettina Kemme and Gustavo Alonso. A suite of database replication protocols based on group communication primitives. In proceedings 18th International Conference on Distributed Computing Systems (ICDCS), Amsterdam, The Netherlands, May [13] Leslie Lamport. The Part-time Parliament. ACM Transactions on Computer Systems, 16(2): , May [14] Yi Lin, Bettina Kemme, Marta Patiño-Martínez, and Ricardo Jiménez-Peris. Middleware based Data Replication providing Snapshot Isolation. ACM Int. Conf. on Management of Data (SIGMOD), Baltimore, Maryland, June [15] Oracle parallel server for windows NT clusters. Online White Paper. [16] Data Concurrency and Consistency, Oracle8 Concepts, Release 8.0: Chapter 23. Technical report, Oracle Corporation, [17] Christos Papadimitriou. The theory of database concurrency control. Computer Science Press. July [18] Christian Plattner and Gustavo Alonso. Ganymed: Scalable Replication for Transactional Web Applications. In proceedings of the 5th ACM/IFIP/USENIX International Middleware Conference, Toronto, Canada, October [19] PostgreSQL, SQL compliant, open source object-relational database management system. [20] Calton Pu and Avraham Leff. Replica control in distributed systems: an asynchronous approach. SIGMOD Record (ACM Special Interest Group on Management of Data), 20(2): , June [21] Robbert van Renesse and Fred B. Schneider. Chain Replication for Supporting High Throughput and Availability. Sixth Symposium on Operating Systems Design and Implementation (OSDI '04), San Francisco, California, December [22] Fred B. Schneider. Implementing fault-tolerant services using the state machine approach: a tutorial. In ACM Computing Surveys. 22 (4): , December [23] Transaction Processing Performance Council [24] Shuqing Wu and Bettina Kemme. Postgres-R(SI): Combining Replica Control with Concurrency Control based on Snapshot Isolation. In proceedings of International Conference on Data Engineering (ICDE), April [25] Matthias Wiesmann, Fernando Pedone, André Schiper, Bettina Kemme, and Gustavo Alonso. Understanding replication in databases and distributed systems. In proceedings of 20th International Conference on Distributed Computing Systems (ICDCS 2000), Taipei, Taiwan, April 2000.

Tashkent: Uniting Durability with Transaction Ordering for High-Performance Scalable Database Replication

Tashkent: Uniting Durability with Transaction Ordering for High-Performance Scalable Database Replication Tashkent: Uniting Durability with Transaction Ordering for High-Performance Scalable Database Replication Sameh Elnikety Steven Dropsho Fernando Pedone School of Computer and Communication Sciences EPFL

More information

Strongly consistent replication for a bargain

Strongly consistent replication for a bargain Strongly consistent replication for a bargain Konstantinos Krikellas #, Sameh Elnikety, Zografoula Vagena, Orion Hodson # School of Informatics, University of Edinburgh Microsoft Research Concentra Consulting

More information

Outline. Failure Types

Outline. Failure Types Outline Database Management and Tuning Johann Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Unit 11 1 2 Conclusion Acknowledgements: The slides are provided by Nikolaus Augsten

More information

ZooKeeper. Table of contents

ZooKeeper. Table of contents by Table of contents 1 ZooKeeper: A Distributed Coordination Service for Distributed Applications... 2 1.1 Design Goals...2 1.2 Data model and the hierarchical namespace...3 1.3 Nodes and ephemeral nodes...

More information

Conflict-Aware Scheduling for Dynamic Content Applications

Conflict-Aware Scheduling for Dynamic Content Applications Conflict-Aware Scheduling for Dynamic Content Applications Cristiana Amza Ý, Alan L. Cox Ý, Willy Zwaenepoel Þ Ý Department of Computer Science, Rice University, Houston, TX, USA Þ School of Computer and

More information

Postgres-R(SI): Combining Replica Control with Concurrency Control based on Snapshot Isolation

Postgres-R(SI): Combining Replica Control with Concurrency Control based on Snapshot Isolation Postgres-R(SI): Combining Replica Control with Concurrency Control based on Snapshot Isolation Shuqing Wu Bettina Kemme School of Computer Science, McGill University, Montreal swu23,kemme @cs.mcgill.ca

More information

Pangea: An Eager Database Replication Middleware guaranteeing Snapshot Isolation without Modification of Database Servers

Pangea: An Eager Database Replication Middleware guaranteeing Snapshot Isolation without Modification of Database Servers Pangea: An Eager Database Replication Middleware guaranteeing Snapshot Isolation without Modification of Database Servers ABSTRACT Takeshi Mishima The University of Tokyo mishima@hal.rcast.u-tokyo.ac.jp

More information

Data Management in the Cloud

Data Management in the Cloud Data Management in the Cloud Ryan Stern stern@cs.colostate.edu : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server

More information

Chapter 14: Recovery System

Chapter 14: Recovery System Chapter 14: Recovery System Chapter 14: Recovery System Failure Classification Storage Structure Recovery and Atomicity Log-Based Recovery Remote Backup Systems Failure Classification Transaction failure

More information

Data Replication and Snapshot Isolation. Example: Cluster Replication

Data Replication and Snapshot Isolation. Example: Cluster Replication Postgres-R(SI) Data Replication and Snapshot Isolation Shuqing Wu McGill University Montreal, Canada Eample: Cluster Replication Cluster of DB replicas Read-one-Write-All- Available Performance Distribute

More information

Chapter 18: Database System Architectures. Centralized Systems

Chapter 18: Database System Architectures. Centralized Systems Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and

More information

High Availability for Database Systems in Cloud Computing Environments. Ashraf Aboulnaga University of Waterloo

High Availability for Database Systems in Cloud Computing Environments. Ashraf Aboulnaga University of Waterloo High Availability for Database Systems in Cloud Computing Environments Ashraf Aboulnaga University of Waterloo Acknowledgments University of Waterloo Prof. Kenneth Salem Umar Farooq Minhas Rui Liu (post-doctoral

More information

Database Replication with Oracle 11g and MS SQL Server 2008

Database Replication with Oracle 11g and MS SQL Server 2008 Database Replication with Oracle 11g and MS SQL Server 2008 Flavio Bolfing Software and Systems University of Applied Sciences Chur, Switzerland www.hsr.ch/mse Abstract Database replication is used widely

More information

Database replication for commodity database services

Database replication for commodity database services Database replication for commodity database services Gustavo Alonso Department of Computer Science ETH Zürich alonso@inf.ethz.ch http://www.iks.ethz.ch Replication as a problem Gustavo Alonso. ETH Zürich.

More information

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures Chapter 18: Database System Architectures Centralized Systems! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types! Run on a single computer system and do

More information

(Pessimistic) Timestamp Ordering. Rules for read and write Operations. Pessimistic Timestamp Ordering. Write Operations and Timestamps

(Pessimistic) Timestamp Ordering. Rules for read and write Operations. Pessimistic Timestamp Ordering. Write Operations and Timestamps (Pessimistic) stamp Ordering Another approach to concurrency control: Assign a timestamp ts(t) to transaction T at the moment it starts Using Lamport's timestamps: total order is given. In distributed

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Distributed Data Management

Distributed Data Management Introduction Distributed Data Management Involves the distribution of data and work among more than one machine in the network. Distributed computing is more broad than canonical client/server, in that

More information

Database Replication with MySQL and PostgreSQL

Database Replication with MySQL and PostgreSQL Database Replication with MySQL and PostgreSQL Fabian Mauchle Software and Systems University of Applied Sciences Rapperswil, Switzerland www.hsr.ch/mse Abstract Databases are used very often in business

More information

Eloquence Training What s new in Eloquence B.08.00

Eloquence Training What s new in Eloquence B.08.00 Eloquence Training What s new in Eloquence B.08.00 2010 Marxmeier Software AG Rev:100727 Overview Released December 2008 Supported until November 2013 Supports 32-bit and 64-bit platforms HP-UX Itanium

More information

COS 318: Operating Systems

COS 318: Operating Systems COS 318: Operating Systems File Performance and Reliability Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall10/cos318/ Topics File buffer cache

More information

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory) WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...

More information

Configuring Apache Derby for Performance and Durability Olav Sandstå

Configuring Apache Derby for Performance and Durability Olav Sandstå Configuring Apache Derby for Performance and Durability Olav Sandstå Database Technology Group Sun Microsystems Trondheim, Norway Overview Background > Transactions, Failure Classes, Derby Architecture

More information

DATABASE REPLICATION A TALE OF RESEARCH ACROSS COMMUNITIES

DATABASE REPLICATION A TALE OF RESEARCH ACROSS COMMUNITIES DATABASE REPLICATION A TALE OF RESEARCH ACROSS COMMUNITIES Bettina Kemme Dept. of Computer Science McGill University Montreal, Canada Gustavo Alonso Systems Group Dept. of Computer Science ETH Zurich,

More information

! Volatile storage: ! Nonvolatile storage:

! Volatile storage: ! Nonvolatile storage: Chapter 17: Recovery System Failure Classification! Failure Classification! Storage Structure! Recovery and Atomicity! Log-Based Recovery! Shadow Paging! Recovery With Concurrent Transactions! Buffer Management!

More information

Chapter 16: Recovery System

Chapter 16: Recovery System Chapter 16: Recovery System Failure Classification Failure Classification Transaction failure : Logical errors: transaction cannot complete due to some internal error condition System errors: the database

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

A Proposal for a Multi-Master Synchronous Replication System

A Proposal for a Multi-Master Synchronous Replication System A Proposal for a Multi-Master Synchronous Replication System Neil Conway (neilc@samurai.com), Gavin Sherry (gavin@alcove.com.au) January 12, 2006 Contents 1 Introduction 3 2 Design goals 3 3 Algorithm

More information

Portable Scale-Out Benchmarks for MySQL. MySQL User Conference 2008 Robert Hodges CTO Continuent, Inc.

Portable Scale-Out Benchmarks for MySQL. MySQL User Conference 2008 Robert Hodges CTO Continuent, Inc. Portable Scale-Out Benchmarks for MySQL MySQL User Conference 2008 Robert Hodges CTO Continuent, Inc. Continuent 2008 Agenda / Introductions / Scale-Out Review / Bristlecone Performance Testing Tools /

More information

The Google File System

The Google File System The Google File System By Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (Presented at SOSP 2003) Introduction Google search engine. Applications process lots of data. Need good file system. Solution:

More information

Database Replication Policies for Dynamic Content Applications

Database Replication Policies for Dynamic Content Applications Database Replication Policies for Dynamic Content Applications Gokul Soundararajan, Cristiana Amza, Ashvin Goel Department of Electrical and Computer Engineering University of Toronto Toronto, Canada ABSTRACT

More information

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications White Paper Table of Contents Overview...3 Replication Types Supported...3 Set-up &

More information

Byzantium: Byzantine-Fault-Tolerant Database Replication

Byzantium: Byzantine-Fault-Tolerant Database Replication Byzantium: Byzantine-Fault-Tolerant Database Replication Cristóvão Tavares Honorato INESC-ID and Instituto Superior Técnico cristovao.honorato@gmail.com Abstract. Database systems are a key component behind

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

EMC MID-RANGE STORAGE AND THE MICROSOFT SQL SERVER I/O RELIABILITY PROGRAM

EMC MID-RANGE STORAGE AND THE MICROSOFT SQL SERVER I/O RELIABILITY PROGRAM White Paper EMC MID-RANGE STORAGE AND THE MICROSOFT SQL SERVER I/O RELIABILITY PROGRAM Abstract This white paper explains the integration of EMC Mid-range Storage arrays with the Microsoft SQL Server I/O

More information

Comparison of ITTIA DB and SQLite

Comparison of ITTIA DB and SQLite Comparison of ITTIA DB and SQLite Table of Contents 1.Overview...2 2.Performance Expectations...2 3.Standard SQL Benchmark...3 3.1.Test Environment...3 3.2.Test Methodology...4 3.2.1.Product Configuration...4

More information

PostgreSQL Concurrency Issues

PostgreSQL Concurrency Issues PostgreSQL Concurrency Issues 1 PostgreSQL Concurrency Issues Tom Lane Red Hat Database Group Red Hat, Inc. PostgreSQL Concurrency Issues 2 Introduction What I want to tell you about today: How PostgreSQL

More information

SAN Conceptual and Design Basics

SAN Conceptual and Design Basics TECHNICAL NOTE VMware Infrastructure 3 SAN Conceptual and Design Basics VMware ESX Server can be used in conjunction with a SAN (storage area network), a specialized high speed network that connects computer

More information

A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES

A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES By: Edward Whalen Performance Tuning Corporation INTRODUCTION There are a number of clustering products available on the market today, and clustering has become

More information

Conflict-Aware Load-Balancing Techniques for Database Replication

Conflict-Aware Load-Balancing Techniques for Database Replication Conflict-Aware Load-Balancing Techniques for Database Replication ABSTRACT Vaidė Zuikevičiūtė University of Lugano (USI) CH-9 Lugano, Switzerland vaide.zuikeviciute@lu.unisi.ch Middleware-based database

More information

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1 Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System

More information

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7 Introduction 1 Performance on Hosted Server 1 Figure 1: Real World Performance 1 Benchmarks 2 System configuration used for benchmarks 2 Figure 2a: New tickets per minute on E5440 processors 3 Figure 2b:

More information

The Hadoop Distributed File System

The Hadoop Distributed File System The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS

More information

Chapter 15: Recovery System

Chapter 15: Recovery System Chapter 15: Recovery System Failure Classification Storage Structure Recovery and Atomicity Log-Based Recovery Shadow Paging Recovery With Concurrent Transactions Buffer Management Failure with Loss of

More information

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful. Architectures Cluster Computing Job Parallelism Request Parallelism 2 2010 VMware Inc. All rights reserved Replication Stateless vs. Stateful! Fault tolerance High availability despite failures If one

More information

Data Distribution with SQL Server Replication

Data Distribution with SQL Server Replication Data Distribution with SQL Server Replication Introduction Ensuring that data is in the right place at the right time is increasingly critical as the database has become the linchpin in corporate technology

More information

The Oracle Universal Server Buffer Manager

The Oracle Universal Server Buffer Manager The Oracle Universal Server Buffer Manager W. Bridge, A. Joshi, M. Keihl, T. Lahiri, J. Loaiza, N. Macnaughton Oracle Corporation, 500 Oracle Parkway, Box 4OP13, Redwood Shores, CA 94065 { wbridge, ajoshi,

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)

More information

Database Replication Techniques: a Three Parameter Classification

Database Replication Techniques: a Three Parameter Classification Database Replication Techniques: a Three Parameter Classification Matthias Wiesmann Fernando Pedone André Schiper Bettina Kemme Gustavo Alonso Département de Systèmes de Communication Swiss Federal Institute

More information

Technical Note. Dell PowerVault Solutions for Microsoft SQL Server 2005 Always On Technologies. Abstract

Technical Note. Dell PowerVault Solutions for Microsoft SQL Server 2005 Always On Technologies. Abstract Technical Note Dell PowerVault Solutions for Microsoft SQL Server 2005 Always On Technologies Abstract This technical note provides information on the Dell PowerVault storage solutions, based on the Microsoft

More information

Recovery System C H A P T E R16. Practice Exercises

Recovery System C H A P T E R16. Practice Exercises C H A P T E R16 Recovery System Practice Exercises 16.1 Explain why log records for transactions on the undo-list must be processed in reverse order, whereas redo is performed in a forward direction. Answer:

More information

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database WHITE PAPER Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

Microsoft SQL Server Always On Technologies

Microsoft SQL Server Always On Technologies Microsoft SQL Server Always On Technologies Hitachi Data Systems Contributes Always On Storage Solutions A Partner Solutions White Paper By Rick Andersen and Simon Pengelly December 2006 Executive Summary

More information

Lecture 7: Concurrency control. Rasmus Pagh

Lecture 7: Concurrency control. Rasmus Pagh Lecture 7: Concurrency control Rasmus Pagh 1 Today s lecture Concurrency control basics Conflicts and serializability Locking Isolation levels in SQL Optimistic concurrency control Transaction tuning Transaction

More information

Snapshots in Hadoop Distributed File System

Snapshots in Hadoop Distributed File System Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any

More information

Information Systems. Computer Science Department ETH Zurich Spring 2012

Information Systems. Computer Science Department ETH Zurich Spring 2012 Information Systems Computer Science Department ETH Zurich Spring 2012 Lecture VI: Transaction Management (Recovery Manager) Recovery Manager ETH Zurich, Spring 2012 Information Systems 3 Failure Recovery

More information

DATABASE ENGINES ON MULTICORES SCALE: A PRACTICAL APPROACH. João Soares and Nuno Preguiça

DATABASE ENGINES ON MULTICORES SCALE: A PRACTICAL APPROACH. João Soares and Nuno Preguiça DATABASE ENGINES ON MULTICORES SCALE: A PRACTICAL APPROACH João Soares and Nuno Preguiça Context 2 Multicore processors are here to stay With an increasing number of cores Reports show that multicores

More information

Running a Workflow on a PowerCenter Grid

Running a Workflow on a PowerCenter Grid Running a Workflow on a PowerCenter Grid 2010-2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)

More information

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE 1 W W W. F U S I ON I O.COM Table of Contents Table of Contents... 2 Executive Summary... 3 Introduction: In-Memory Meets iomemory... 4 What

More information

Review: The ACID properties

Review: The ACID properties Recovery Review: The ACID properties A tomicity: All actions in the Xaction happen, or none happen. C onsistency: If each Xaction is consistent, and the DB starts consistent, it ends up consistent. I solation:

More information

Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010

Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010 Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010 This document is provided as-is. Information and views expressed in this document, including URL and other Internet

More information

ORACLE DATABASE 10G ENTERPRISE EDITION

ORACLE DATABASE 10G ENTERPRISE EDITION ORACLE DATABASE 10G ENTERPRISE EDITION OVERVIEW Oracle Database 10g Enterprise Edition is ideal for enterprises that ENTERPRISE EDITION For enterprises of any size For databases up to 8 Exabytes in size.

More information

Distributed Databases

Distributed Databases C H A P T E R19 Distributed Databases Practice Exercises 19.1 How might a distributed database designed for a local-area network differ from one designed for a wide-area network? Data transfer on a local-area

More information

DELL TM PowerEdge TM T610 500 Mailbox Resiliency Exchange 2010 Storage Solution

DELL TM PowerEdge TM T610 500 Mailbox Resiliency Exchange 2010 Storage Solution DELL TM PowerEdge TM T610 500 Mailbox Resiliency Exchange 2010 Storage Solution Tested with: ESRP Storage Version 3.0 Tested Date: Content DELL TM PowerEdge TM T610... 1 500 Mailbox Resiliency

More information

High-Performance Concurrency Control Mechanisms for Main-Memory Databases

High-Performance Concurrency Control Mechanisms for Main-Memory Databases High-Performance Concurrency Control Mechanisms for Main-Memory Databases Per-Åke Larson 1, Spyros Blanas 2, Cristian Diaconu 1, Craig Freedman 1, Jignesh M. Patel 2, Mike Zwilling 1 Microsoft 1, University

More information

Massive Data Storage

Massive Data Storage Massive Data Storage Storage on the "Cloud" and the Google File System paper by: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung presentation by: Joshua Michalczak COP 4810 - Topics in Computer Science

More information

New method for data replication in distributed heterogeneous database systems

New method for data replication in distributed heterogeneous database systems New method for data replication in distributed heterogeneous database systems Miroslaw Kasper Department of Computer Science AGH University of Science and Technology Supervisor: Grzegorz Dobrowolski Krakow,

More information

High Availability Essentials

High Availability Essentials High Availability Essentials Introduction Ascent Capture s High Availability Support feature consists of a number of independent components that, when deployed in a highly available computer system, result

More information

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...

More information

Replication on Virtual Machines

Replication on Virtual Machines Replication on Virtual Machines Siggi Cherem CS 717 November 23rd, 2004 Outline 1 Introduction The Java Virtual Machine 2 Napper, Alvisi, Vin - DSN 2003 Introduction JVM as state machine Addressing non-determinism

More information

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale WHITE PAPER Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale Sponsored by: IBM Carl W. Olofson December 2014 IN THIS WHITE PAPER This white paper discusses the concept

More information

The ConTract Model. Helmut Wächter, Andreas Reuter. November 9, 1999

The ConTract Model. Helmut Wächter, Andreas Reuter. November 9, 1999 The ConTract Model Helmut Wächter, Andreas Reuter November 9, 1999 Overview In Ahmed K. Elmagarmid: Database Transaction Models for Advanced Applications First in Andreas Reuter: ConTracts: A Means for

More information

<Insert Picture Here> Oracle In-Memory Database Cache Overview

<Insert Picture Here> Oracle In-Memory Database Cache Overview Oracle In-Memory Database Cache Overview Simon Law Product Manager The following is intended to outline our general product direction. It is intended for information purposes only,

More information

Google File System. Web and scalability

Google File System. Web and scalability Google File System Web and scalability The web: - How big is the Web right now? No one knows. - Number of pages that are crawled: o 100,000 pages in 1994 o 8 million pages in 2005 - Crawlable pages might

More information

Geo-Replication in Large-Scale Cloud Computing Applications

Geo-Replication in Large-Scale Cloud Computing Applications Geo-Replication in Large-Scale Cloud Computing Applications Sérgio Garrau Almeida sergio.garrau@ist.utl.pt Instituto Superior Técnico (Advisor: Professor Luís Rodrigues) Abstract. Cloud computing applications

More information

How To Understand The Power Of Zab In Zookeeper

How To Understand The Power Of Zab In Zookeeper Zab: High-performance broadcast for primary-backup systems Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini Yahoo! Research {fpj,breed,serafini}@yahoo-inc.com Abstract Zab is a crash-recovery

More information

Transactional Replication in Hybrid Data Store Architectures

Transactional Replication in Hybrid Data Store Architectures Transactional Replication in Hybrid Data Store Architectures Hojjat Jafarpour NEC Labs America hojjat@nec-labs.com Junichi Tatemura NEC Labs America tatemura@nec-labs.com Hakan Hacıgümüş NEC Labs America

More information

Sybase Adaptive Server Enterprise

Sybase Adaptive Server Enterprise technical white paper Sybase Adaptive Server Enterprise Data Transfer Utility www.sybase.com Contents 1. Executive Summary..........................................................................................................

More information

Last Class Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications

Last Class Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications Last Class Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos A. Pavlo Lecture#23: Crash Recovery Part 2 (R&G ch. 18) Write-Ahead Log Checkpoints Logging Schemes

More information

Low Overhead Concurrency Control for Partitioned Main Memory Databases

Low Overhead Concurrency Control for Partitioned Main Memory Databases Low Overhead Concurrency Control for Partitioned Main Memory bases Evan P. C. Jones MIT CSAIL Cambridge, MA, USA evanj@csail.mit.edu Daniel J. Abadi Yale University New Haven, CT, USA dna@cs.yale.edu Samuel

More information

Recovery Protocols For Flash File Systems

Recovery Protocols For Flash File Systems Recovery Protocols For Flash File Systems Ravi Tandon and Gautam Barua Indian Institute of Technology Guwahati, Department of Computer Science and Engineering, Guwahati - 781039, Assam, India {r.tandon}@alumni.iitg.ernet.in

More information

Lesson 12: Recovery System DBMS Architectures

Lesson 12: Recovery System DBMS Architectures Lesson 12: Recovery System DBMS Architectures Contents Recovery after transactions failure Data access and physical disk operations Log-Based Recovery Checkpoints Recovery With Concurrent Transactions

More information

Chapter 6, The Operating System Machine Level

Chapter 6, The Operating System Machine Level Chapter 6, The Operating System Machine Level 6.1 Virtual Memory 6.2 Virtual I/O Instructions 6.3 Virtual Instructions For Parallel Processing 6.4 Example Operating Systems 6.5 Summary Virtual Memory General

More information

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications

More information

Proactive, Resource-Aware, Tunable Real-time Fault-tolerant Middleware

Proactive, Resource-Aware, Tunable Real-time Fault-tolerant Middleware Proactive, Resource-Aware, Tunable Real-time Fault-tolerant Middleware Priya Narasimhan T. Dumitraş, A. Paulos, S. Pertet, C. Reverte, J. Slember, D. Srivastava Carnegie Mellon University Problem Description

More information

MySQL Enterprise Backup

MySQL Enterprise Backup MySQL Enterprise Backup Fast, Consistent, Online Backups A MySQL White Paper February, 2011 2011, Oracle Corporation and/or its affiliates Table of Contents Introduction... 3! Database Backup Terms...

More information

Scalable and Highly Available Database Replication through Dynamic Multiversioning. Kaloian Manassiev

Scalable and Highly Available Database Replication through Dynamic Multiversioning. Kaloian Manassiev Scalable and Highly Available Database Replication through Dynamic Multiversioning by Kaloian Manassiev A thesis submitted in conformity with the requirements for the degree of Master of Science Graduate

More information

A Shared-nothing cluster system: Postgres-XC

A Shared-nothing cluster system: Postgres-XC Welcome A Shared-nothing cluster system: Postgres-XC - Amit Khandekar Agenda Postgres-XC Configuration Shared-nothing architecture applied to Postgres-XC Supported functionalities: Present and Future Configuration

More information

Fast Checkpoint and Recovery Techniques for an In-Memory Database. Wenting Zheng

Fast Checkpoint and Recovery Techniques for an In-Memory Database. Wenting Zheng Fast Checkpoint and Recovery Techniques for an In-Memory Database by Wenting Zheng S.B., Massachusetts Institute of Technology (2013) Submitted to the Department of Electrical Engineering and Computer

More information

Bryan Tuft Sr. Sales Consultant Global Embedded Business Unit bryan.tuft@oracle.com

Bryan Tuft Sr. Sales Consultant Global Embedded Business Unit bryan.tuft@oracle.com Bryan Tuft Sr. Sales Consultant Global Embedded Business Unit bryan.tuft@oracle.com Agenda Oracle Approach Embedded Databases TimesTen In-Memory Database Snapshots Q&A Real-Time Infrastructure Challenges

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Oracle Database Scalability in VMware ESX VMware ESX 3.5 Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises

More information

Basics Of Replication: SQL Server 2000

Basics Of Replication: SQL Server 2000 Basics Of Replication: SQL Server 2000 Table of Contents: Replication: SQL Server 2000 - Part 1 Replication Benefits SQL Server Platform for Replication Entities for the SQL Server Replication Model Entities

More information

The Hadoop Distributed File System

The Hadoop Distributed File System The Hadoop Distributed File System The Hadoop Distributed File System, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, Yahoo, 2010 Agenda Topic 1: Introduction Topic 2: Architecture

More information

FioranoMQ 9. High Availability Guide

FioranoMQ 9. High Availability Guide FioranoMQ 9 High Availability Guide Copyright (c) 1999-2008, Fiorano Software Technologies Pvt. Ltd., Copyright (c) 2008-2009, Fiorano Software Pty. Ltd. All rights reserved. This software is the confidential

More information

Concurrency control. Concurrency problems. Database Management System

Concurrency control. Concurrency problems. Database Management System Concurrency control Transactions per second (tps) is the measure of the workload of a operational DBMS; if two transactions access concurrently to the same data there is a problem: the module who resolve

More information

Virtual Full Replication for Scalable. Distributed Real-Time Databases

Virtual Full Replication for Scalable. Distributed Real-Time Databases Virtual Full Replication for Scalable Distributed Real-Time Databases Thesis Proposal Technical Report HS-IKI-TR-06-006 Gunnar Mathiason gunnar.mathiason@his.se University of Skövde June, 2006 1 Abstract

More information

2 nd Semester 2008/2009

2 nd Semester 2008/2009 Chapter 17: System Departamento de Engenharia Informática Instituto Superior Técnico 2 nd Semester 2008/2009 Slides baseados nos slides oficiais do livro Database System c Silberschatz, Korth and Sudarshan.

More information

Oracle TimesTen In-Memory Database on Oracle Exalogic Elastic Cloud

Oracle TimesTen In-Memory Database on Oracle Exalogic Elastic Cloud An Oracle White Paper July 2011 Oracle TimesTen In-Memory Database on Oracle Exalogic Elastic Cloud Executive Summary... 3 Introduction... 4 Hardware and Software Overview... 5 Compute Node... 5 Storage

More information