The Cost of Increased Transactional Correctness and Durability in Distributed Databases

Similar documents
Data Management in the Cloud

Survey on Comparative Analysis of Database Replication Techniques

Database Replication with MySQL and PostgreSQL

Data Distribution with SQL Server Replication

Availability Digest. MySQL Clusters Go Active/Active. December 2006

Geo-Replication in Large-Scale Cloud Computing Applications

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Cassandra A Decentralized, Structured Storage System

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

High Availability Solutions for the MariaDB and MySQL Database

A Shared-nothing cluster system: Postgres-XC

Cloud DBMS: An Overview. Shan-Hung Wu, NetDB CS, NTHU Spring, 2015

A Review of Column-Oriented Datastores. By: Zach Pratt. Independent Study Dr. Maskarinec Spring 2011

The Oracle Universal Server Buffer Manager

Transactions and the Internet

Versioned Transactional Shared Memory for the

Conflict-Aware Load-Balancing Techniques for Database Replication

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Distributed File Systems

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

Postgres-R(SI): Combining Replica Control with Concurrency Control based on Snapshot Isolation

(Pessimistic) Timestamp Ordering. Rules for read and write Operations. Pessimistic Timestamp Ordering. Write Operations and Timestamps

Scalable Transaction Management on Cloud Data Management Systems

Database Replication Techniques: a Three Parameter Classification

Conflict-Aware Load-Balancing Techniques for Database Replication

DATABASE REPLICATION A TALE OF RESEARCH ACROSS COMMUNITIES

MASTER PROJECT. Resource Provisioning for NoSQL Datastores

Distributed Databases

Stretching A Wolfpack Cluster Of Servers For Disaster Tolerance. Dick Wilkins Program Manager Hewlett-Packard Co. Redmond, WA dick_wilkins@hp.

Conflict-Aware Load-Balancing Techniques for Database Replication

A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

TABLE OF CONTENTS THE SHAREPOINT MVP GUIDE TO ACHIEVING HIGH AVAILABILITY FOR SHAREPOINT DATA. Introduction. Examining Third-Party Replication Models

Improved Aggressive Update Propagation Technique in Cloud Data Storage

A Survey of Distributed Database Management Systems

Udai Shankar 2 Deptt. of Computer Sc. & Engineering Madan Mohan Malaviya Engineering College, Gorakhpur, India

The Cloud Trade Off IBM Haifa Research Storage Systems

IBM Global Technology Services March Virtualization for disaster recovery: areas of focus and consideration.

Locality Based Protocol for MultiWriter Replication systems

Designing a Cloud Storage System

Byzantium: Byzantine-Fault-Tolerant Database Replication

Chapter 3 - Data Replication and Materialized Integration

Database Replication with Oracle 11g and MS SQL Server 2008

Hosting Transaction Based Applications on Cloud

Evaluating Data Freshness in Large Scale Replicated Databases

"LET S MIGRATE OUR ENTERPRISE APPLICATION TO BIG DATA TECHNOLOGY IN THE CLOUD" - WHAT DOES THAT MEAN IN PRACTICE?

An Approach to Implement Map Reduce with NoSQL Databases

Distributed Data Management

Strongly consistent replication for a bargain

Massive Data Storage

Highly Available Hadoop Name Node Architecture-Using Replicas of Name Node with Time Synchronization among Replicas

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

A simple object storage system for web applications Dan Pollack AOL

Understanding Neo4j Scalability

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

WHAT WE NEED TO START THE PERFORMANCE TESTING?

Google File System. Web and scalability

Survey on Scheduling Algorithm in MapReduce Framework

Integrating Big Data into the Computing Curricula

Amr El Abbadi. Computer Science, UC Santa Barbara

High Availability in a J2EE Enterprise Application Environment

New method for data replication in distributed heterogeneous database systems

Eventually Consistent

Recovery and the ACID properties CMPUT 391: Implementing Durability Recovery Manager Atomicity Durability

Chapter 14: Recovery System

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems*

Chapter 2 TOPOLOGY SELECTION. SYS-ED/ Computer Education Techniques, Inc.

Data Consistency on Private Cloud Storage System

Principles of Distributed Database Systems

Processing of Hadoop using Highly Available NameNode

High Performance Cluster Support for NLB on Window

Chapter 18: Database System Architectures. Centralized Systems

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

HSS: A simple file storage system for web applications

Availability Digest. What is Active/Active? October 2006

Extending Multidatabase Transaction Management Techniques to Software Development Environments

Module 14: Scalability and High Availability

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

USING REPLICATED DATA TO REDUCE BACKUP COST IN DISTRIBUTED DATABASES

Cloud Computing at Google. Architecture

Conflict-Aware Scheduling for Dynamic Content Applications

Practical Cassandra. Vitalii

Virtual Full Replication for Scalable. Distributed Real-Time Databases

Review: The ACID properties

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Tashkent: Uniting Durability with Transaction Ordering for High-Performance Scalable Database Replication

Real-time Data Replication

Distributed Architectures. Distributed Databases. Distributed Databases. Distributed Databases

CAP Theorem and Distributed Database Consistency. Syed Akbar Mehdi Lara Schmidt

Assuring High Availability in Healthcare Interfacing Considerations and Approach

Transactional Support for SDN Control Planes "

Configuration Management of Massively Scalable Systems

Topics. Distributed Databases. Desirable Properties. Introduction. Distributed DBMS Architectures. Types of Distributed Databases

Appendix A Core Concepts in SQL Server High Availability and Replication

ORACLE DATABASE 10G ENTERPRISE EDITION

Database replication for commodity database services

Transcription:

The Cost of Increased Transactional Correctness and Durability in Distributed Databases Aspen Olmsted olmsted@email.sc.edu Csilla Farkas farkas@cec.sc.edu Center for Information Assurance Engineering Department of Computer Science and Engineering University of South Carolina, Columbia, SC 29208 Abstract In this paper we investigate the problem of providing durability for Web Service transactions in the presence of system failures. We show that the popular lazy replica update propagation method is vulnerable to loss of transactional updates in the presence of hardware failures. We propose an extension to the lazy update propagation approach to reduce the risk of data loss. Our approach is based on the buddy system, requiring that updates are preserved synchronously in two replicas, called buddies. The rest of the replicas are updated using lazy update propagation protocols. Our method provides a balance between durability (i.e., effects of the transaction are preserved even if the server, executing the transaction, crashes before the update can be propagated to the other replicas) and efficiency (i.e., our approach requires a synchronous update between two replicas only, adding a minimal overhead to the lazy replication protocol). Moreover, we show that our method of selecting the buddies ensures correct execution and can be easily extended to balance workload, and reduce latency observable by the client. cas of the services and their resources. Service requests are distributed among the replicas within a WS farm to ensure efficient response capabilities. Usually, a WS farm is placed behind a dispatcher. Clients send service requests to the dispatcher, and the dispatcher distributes the requests to one of the redundant services. In a simple architecture, the redundant web servers will share a single database, so all replicas will have access to the same data. It is often required to replicate the database to support high availability, geographic distribution for low latency response time and efficient performance. Figure 1 shows an example of a WS farm with replicated databases in two WS clusters. A WS farm also needs to ensure that the replicated databases are consistent. Several approaches have been proposed to support mutual consistency of replicated databases (see Ozsu and Valduriez [14] for Web Service 1 Web Service 2 Database 1 Introduction Service Oriented Architecture (SOA) supports the composition of multiple Web Services (WSs) to perform complex business processes. One of the important aspects for SOA applications is to provide a high-level of service availability. Replication of the services and their corresponding resources, such as databases, is frequently used to increase availability. In this paper we address issues related to durability and consistency of replicated databases in the context of SOA. A Web Service (WS) farm may host multiple repli- Load Balancer 1 Dispatcher Web Service 3 Load Balancer 1 Web Service 1 Web Service 3 Web Service 2 Database Figure 1: Web Service farm with two clusters. Each cluster contains a load balancer, a single database, and replicated services.

an overview). However, the risk of lost updates in the presence of hardware failures has not been addressed sufficiently. Consequences of lost updates range from inconvenience, e.g., consumer needs to reissue a service request, to major losses, e.g., legal liability of lost financial data. A related event was recently reported by CBC News[5]. According to CBC, Google s gmail service had a replication issue that affected 38,000 user accounts by removing emails stored on replicated servers. Google recovered lost replicas by restoring from tape and manually refreshing the replicas. This slow and manual process caused an outage from February 27th, 2011 at 9pm till February 28th, 2011 at 5pm. During this manual operation any inbound emails to the 38,000 user accounts were not written to a persistence store and were lost. Google has not released enough information for us to fully understand the cause of this particular problem but it highlights the issues with durability in today s SOA architectures. In this paper we focus on the effect of server failures in the presence of lazy update propagation strategies. We show that, using the popular WS farm architecture, it is possible that updates are lost and cannot be recovered by current update propagation protocols. For example, assume that the database server crashes after the transaction commits but before the updates are propagated to any other replicas. Current WS farms are often constructed with share nothing [13] architectures to enable assembly and replacement using commodity hardware. Service applications are implemented on the multiple copies of the commodity hardware. Similarly, replicated databases are implemented on commodity servers. If a database server crashes, technicians often replace a crashed server with a commodity replacement and never recover the original server. However, if the server crashes, before it has propagated the update of a transaction, then its update will be lost. We propose an extension to the lazy replica update propagation method, to reduce the risk of data loss. Our approach is based on a buddy system. The transaction executes on a primary replica. However, the transaction cannot commit until a secondary replica, called, buddy, also preserves the effects of the transaction. The rest of the replicas are updated using one of the standard lazy update propagation protocols. Our method provides a balance between a guarantee of transactional durability (i.e., effects of the transaction are preserved even if the server, hosting the primary replica crashes before the update can be propagated to the other replicas) and efficient update propagation (i.e., our approach requires the synchronized update between two replicas only, therefore adding minimal overhead to the lazy replication protocol). Our solution uses an application-layer dispatcher [1] to select the buddies based on the data items and the operations of the transactions, the data versions available, and the network characteristics of the WS farm. The main criteria is that the selection of the buddies ensures freshness of the read operations, that is readrequests must access the latest committed data version, and serializability of the update operations. We show that our approach reduces the window of vulnerability, that is the time period after a transaction is committed but before the updates are propagated to the other replicas. Although our solution can be viewed as a special case of a hybrid update propagation method, it provides several advantages not addressed in traditional distributed database replica update protocols. First, our approach provides the scalability required by modern applications, such as web services, and is suitable for the architectures and technologies implementing these applications. Second, the buddyselection algorithm supports dynamic master-slave site selection for data items and ensures correct transaction execution. Third, we show that our method can be easily extended to incorporate network specific characteristics, such as distance and bandwidth, that further reduce the latency observed by the client and to provide load-balancing among the replicas. Our empirical results support our hypothesis that in the presence of large data sets, the efficiency of our approach is comparable to the efficiency of the lazy update propagation method. The organization of the paper is as follows. Section 2 describes the related work and the limitations of current methods. In Section 3 we give an overview of the proposed Buddy System and present our algorithms. Section 4 contains the performance analysis of our method and compare it with the eager and lazy update propagation methods. In Section 5 we present our empirical results. We conclude and discuss future work in Section 6. 2 Related Work Several methods have been developed to ensure mutual consistency in replicated databases. The aim of these methods is to eventually provide one-copy serializability (1SR). Transactions on traditional replicated databases are based on reading any copy and writing (updating) all copies of data items. Based on the time of the update propagation, two main approaches have been proposed. Approaches that update all replicas before the transaction can commit are called eager update propagation protocols; approaches that allow the 2

propagation of the update after the transaction is committed are called lazy update propagation. While eager update propagation guarantees mutual consistency among the replicas, this approach is not scalable. Lazy update propagation is efficient but it may result in violation of mutual consistency. During the last decade, several methods have been proposed to ensure mutual consistency in the presence of lazy update propagation (see [14] for an overview.) More recently, Snapshot Isolation (SI) [10, 8] has been proposed to provide concurrency control in replicated databases. The aim of this approach is to provide global one-copy serializability using SI at each replica. The advantage is, that SI provides scalability, and is supported by most database management systems. One of the current application areas for replicated databases is Web Services applications. Lou and Yang[12] study the two primary replica update protocols in the context of web services. The authors state that eager replication has a problem of increasing latency as the number of replicas increases. This increasing latency diminishes the availability gains from introducing replicas. Most commercial implementations use lazy replication because its efficiency and scalability. Lazy replication methods are also more partition tolerant then eager replication methods. However, lazy replication protocols require additional considerations to ensure consistency. Breitbart and Korth [2], and Daudjee et al. [3] propose frameworks for master-slave lazy replication updates with consistency guarantee. These approaches are based on requiring all writes to be performed on the master replica. Updates are propagated to the other sites after the updating transaction is committeed. Their framework provides a distributed serializable schedule where the ordering of updates is not guaranteed. The approach proposed by Daudjee et al. provides multiversion serializability where different versions of data can be returned for read requests during the period that replication has not completed. Jajodia and Mutchler[7] and Long et al.[11] both define forms of hybrid replication that reduce the requirement that all replicas participate in eager update propagation. The proposed methods aim to increase availability in the presence of network isolations or hardware failures. Both approaches have limited scalability because they require a majority of replicas to participate in eager update propagation. Most recently, Garcia-Munos et al. [6] proposed a hybrid replication protocol that can be configured to behave as eager or lazy update propagation protocol. The authors provide empirical data and show that their protocol provides scalability and reduced communication cost over other hybrid update protocols. In addition to academic research, several database management systems have been developed that support some form of replicated data management. For example, Lakshman and Malik [9] describe a hybrid system, called Cassandra, that was built by Facebook to handle their inbox search. Cassandra allows a configuration parameter that controls the number of nodes that must be updated synchronously. The Cassandra system can be configured so nodes chosen for synchronous inclusion cross data center boundaries to increase durability and availability. 3 Proposed System Our proposed system addresses three problems: decrease the risk of losing committed transactional data in case of a site failure, increase consistency of transactions, and increase availability of read requests. The three main components of our proposed system are: 1) Synchronous Transactional Buddy System, 2) Version Master-Slave Lazy Replication, and 3) Serializable Snapshot Isolation Schedule. To support the above components, the dispatcher will operate at the OSI TCP/IP level 7. This will allow the dispatcher to use application specific data for transaction distribution and buddy selection. The dispatcher receives the requests from clients and distributes them to the WS clusters. Each WS cluster contains a load balancer, a single database, and replicated services. The load balancer receives the service requests from the dispatcher and distributes them among the service replicas. Within a WS cluster, each service shares the same database. Database updates among the clusters are propagated using lazy replication propagation. 3.1 Buddy System As we have shown in Section 2, lazy update propagation is vulnerable for loss of updates in the presence of a database server failure. This is a particularly serious problem in the context of WS farms, where efficiency and availability are often prioritized over consistency. The window of vulnerability for this loss is after the transaction has committed but before the replica updates are initiated. To guarantee data persistence even in the presence of hardware failures we propose to form strict replication between pairs of replica clusters, buddies. Our aim is to ensure that at least one of the replicas in addition to the primary replica is updated and, therefore, preserves the updates. 3

No Dispatcher sends request to Primary Buddy Can Transaction be Performed Yes Primary Acquires Locks on Transaction Resources Secondary Performs Transaction Finish Transaction On Primary Can Locks be Aquired Yes Send Confirmation to Client (a) Example Cluster List Cluster Table 1: Example Data Structures IP 1 192.168.1.1 2 192.168.1.2 3 192.168.1.3 (b) Example Object Version Table Object Completed In-Progress A 1012 1014 B 954 954 C 2054 2054 (c) Example Cluster Object Table Cluster Object Version 1 A 1014 2 A 1014 3 A 1012 1 B 954 2 B 954 3 B 954 1 C 2054 2 C 2054 3 C 2054 Send Error to Client Figure 2: Flowchart of Communication between Primary and Secondary buddies Figure 1 shows a WS farm architecture where each cluster has a load balancer. After receiving a transaction, the dispatcher picks the two clusters to form the buddy-system. The selection is based on versioning history. The primary buddy will receive the transaction along with its buddy s IP address. The primary buddy will become the coordinator in a simplified commit protocol between the two buddies. Both buddies perform the transaction and will commit or abort together. Figure 2 shows the flowchart of the transaction processing by the buddies. The dispatcher maintains metadata about the freshness of data items in the different clusters. The dispatcher will increment a version counter for each data item after is has been modified. Any two service providers (clusters) with the latest version of the requested data items can be selected as a buddy. Note, that the database maintained by the two clusters must agree on the requested data items but may be different for the other data items. 3.2 Dispatcher Data Structures The dispatcher will maintain a version table for every object modified by web services. Each service request may include modification and read requests for several objects. When a service request is received, the dispatcher ensures that the request is delivered to the appropriate cluster. If the request is read-only, the primary buddy must have the latest version of all committed objects in the request. If the request includes writes, the dispatcher needs to determine if there is any uncommitted transaction accessing the requested data items. If it finds such active transactions then the request is sent to the web service cluster where the active transaction is being executed. If the dispatcher No can not find a cluster with the latest version due to the distribution of the requested object, then the request is queued until the currently active transactions complete or the updates are propagated. The dispatcher must also ensure snapshot isolation anomalies can be avoided. For this we address blind writes and analyse the read log to determine if an anomaly could take place. Operationally blind writes are writes that follow an earlier read operation where the write updates a value that was read earlier. Fekete et al. [4] documented anomalies that can be avoided to turn a snap shot isolation schedule into a serialized schedule. We incorporate these results to support serializability. The dispatcher will maintain the following data structures for processing the algorithms: a.) Cluster List - contains the names of the clusters and their IP addresses. b.) Object Version Table - contains the name of the data items and their version numbers, corresponding to the completed and in-progress transactions. c.) Cluster Object Table - contains the cluster names, stored objects, and the version number of the objects at that cluster. For example, the example data structure tables (Table 1a, Table 1b, Table 1c) show that clusters 1 and 2 have two update operations on object A sent to them that are still in-progress. 3.3 Dispatcher Service Request Algorithm The dispatcher service request algorithm (Algorithm 1) is executed by the dispatcher for every incoming request containing write operations. The goal of the service request algorithm is to find a pair of buddies that have the correct version for the incoming request. If a pair can not be found then the request is added to a queue for later processing. For read only requests the dispatcher will execute the read only version of the algorithm (Algorithm 2). This version only requires a single cluster to respond to the request. The cluster must have completed versions for each object in the request. 4

Theorem 1: The Dispatcher Service Request Algorithms (Algorithms 1 and 2) guarantee one-copy serializability. Proof sketch: Our proof is based on the following claim: Let H be a history over a set of transactions T, such that each transaction T i, i = 1,..., n is made up of a set of read R i (A) and write W i (A) operations on elements from a data set. H is one-copy serializable if the following three conditions hold: 1. Each request (transaction) is an atomic transaction 2. Concurrent writes on the same data item are sent to the same cluster, and 3. Each cluster guarantees serializable transaction history on their local database. To show that the claim holds, assume, by contradiction that H is not one-copy serializable. Then, there must exists a cycle among the committed transactions in the serialization graph of H. Let T i and T j be the two transactions responsible for the cycle. We show that the serialization graph cannot contain a cycle for the three potential scenarios or these transactions. The three scenarios are: Read Set/Write Set overlap, Write Set/Write Set overlap, and Read Set/Read Set overlap. Load Balancing Algorithms 1 and 2 choose the first available cluster for read only requests, and the first available pair of clusters for requests containing write operations. The selection can be improved by decorating the Cluster List table (Table1a) with properties to represent system properties (e.g., processing power, available applications, process wait-time, etc.) and network-related information (e.g., link properties, hop-distances, etc.) that can influence buddy selection. For example, buddies may be selected based on their geographical location and the reliability of the communication network. Our current work extends Algorithm 1 with the capability of incorporating these semantics. 3.4 Dispatcher Version Update Algorithm The Dispatcher Version Update algorithm (Algorithm 3) is executed by the dispatcher when a data item is updated. When a primary buddy or any lazy update cluster completes a transaction, it will send an version update request to the dispatcher. The dispatcher will update the latest completed version value for these clusters. After the version is updated any requests in the queue will be reprocessed in hopes that the dispatcher can now find a pair of buddies with the correct versions. Algorithm 1 Dispatcher Service Request Algorithm - Request Contains Writes INPUT: requestobjects (Request containing objects to be read and written), clusterobjects (Table1c), objectversions (Table1b) OUTPUT: buddylist (Pair of buddies or empty list if no pair available), clusterobjects (Table1c), objectversions (Table1b) 1 Add a l l c l u s t e r s to a v a i l a b l e 2 f o r each o b j e c t O. o, a c t i o n O. a i n r e q u e s t O b j e c t s 3 i f O i n o b j e c t V e r s i o n s 4 g e t completed OV. c, i n p r o g r e s s OV. i from 5 o b j e c t V e r s i o n s 6 e l s e 7 OV. c = 1, OV. i = 1 8 f o r each c l u s t e r CO. c, v e r s i o n CO. v i n 9 c l u s t e r O b j e c t s 10 i f CO. v > OV. c && O. a == READ 11 a v a i l a b l e. remove (CO) 12 e l s e i f CO. v < OV. i && O. a == WRITE 13 a v a i l a b l e. remove (CO) 14 f o r each c l u s t e r CO. c i n c l u s t e r O b j e c t s 15 i f a v a i l a b l e (Co. c ) and count ( buddylist ) < 2 16 add Co. c to buddylist 17 i f count ( buddylist ) > 1 18 Choose B1, B2 from buddylist 19 f o r each O. o, a c t i o n O. a i n r e q u e s t O b j e c t s 20 g e t i n p r o g r e s s OV1. i from o b j e c t V e r s i o n s f o r B1 21 g e t i n p r o g r e s s OV2. i from o b j e c t V e r s i o n s f o r B2 22 i f O. a c t i o n == WRITE 23 increment OV1. i 24 increment OV2. i 25 send buddylist, r e q u e s t O b j e c t s to B1 26 e l s e 27 enqueue ( r e q u e s t O b j e c t s ) Algorithm 2 Dispatcher Service Request Algorithm - Read Only INPUT: requestobjects (Request containing objects to be read), clusterobjects (Table1c), objectversions (Table1b) OUTPUT: buddylist (Single cluster or empty list if none available) 1 Add a l l c l u s t e r s to a v a i l a b l e 2 f o r each o b j e c t O. o, a c t i o n O. a i n r e q u e s t O b j e c t s 3 i f O i n o b j e c t V e r s i o n s 4 g e t completed OV. c, i n p r o g r e s s OV. i from 5 o b j e c t V e r s i o n s 6 e l s e 7 OV. c = 1, OV. i = 1 8 f o r each c l u s t e r CO. c, v e r s i o n CO. v i n 9 c l u s t e r O b j e c t s 10 i f OV. c > CO. c 11 a v a i l a b l e. remove (CO) 12 f o r each c l u s t e r CO. c i n c l u s t e r O b j e c t s 13 i f a v a i l a b l e (Co. c ) 14 add Co. c to buddylist 15 i f count ( buddylist ) > 0 16 Choose B1 from buddylist 17 send buddylist, r e q u e s t O b j e c t s to B1 18 e l s e 19 enqueue ( r e q u e s t O b j e c t s ) 5

Algorithm 3 Dispatcher Version Update Algorithm INPUT: versionupdates (Triple of cluster, object, version ) OUTPUT: buddylist (Pair of buddies or empty list if no pair available), objectversions (Table1b) 1 f o r each o b j e c t, v e r s i o n i n v e r s i o n U p d a t e s 2 update completed = v e r s i o n i n o b j e c t V e r s i o n s 3 p r o c e s s r e q u e s t s from queue Algorithm 4 Primary Buddy Service Algorithm INPUT: requestobjects (Request containing objects to be read and written) OUTPUT: dataset (data requested in read operations), objectversions (Table1b) 1 i n i t i a l i z e w r i t e l i s t to an empty l i s t 2 f o r each o b j e c t, a c t i o n i n r e q u e s t O b j e c t s 3 update the l a t e s t completed v e r s i o n 4 i f a c t i o n == WRITE 5 Add o b j e c t to w r i t e l i s t 6 Lock o b j e c t 7 Log w r i t e o p e r a t i o n 8 Write undo l o g f o r w r i t e o p e r a t i o n 9 e l s e i f a c t i o n == READ 10 add data to dataset 11 12 i f t h e r e a r e items i n w r i t e l i s t 13 send w r i t e l i s t to secondary buddy 14 i f secondary buddy committed p r o p e r l y 15 f o r each o b j e c t i n r e q u e s t O b j e c t s 16 complete w r i t e on o b j e c t 17 r e l e a s e l o c k on o b j e c t 18 e l s e i f secondary buddy aborted 19 f o r each o b j e c t i n r e q u e s t O b j e c t s 20 undo w r i t e on o b j e c t 21 r e l e a s e l o c k on o b j e c t 22 23 send r e s p o n s e to c l i e n t 24 send v e r s i o n update to d i s p a t c h e r 3.5 Primary Buddy Service Algorithm This section describes the interaction between the primary and secondary buddies. The primary buddy service algorithm (Algorithm 4) is executed on the primary buddy for every incoming request from the dispatcher. The goal of the primary buddy algorithm is to prepare the request on it s cluster by locking resources. If the request includes write operations then the request is sent to the secondary buddy. If the secondary buddy can execute the transaction then the primary will finish the transaction and send a response to the client and a version update to the dispatcher. 4 Analysis of the Buddy System In this section we study a specific aspect of our proposed system. First, we evaluate the performance of our system in high-volume scenarios. Next, we compare our approach with eager and lazy replica update propagation in the presence of hardware failures. 4.1 Performance Analysis in High Volume Scenarios Some Web Service transactions involve large volumes of data items of the same type. For example, if a client is purchasing a concert ticket, multiple tickets have the same characteristics but different row and seat numbers. If we study a high volume scenario where there are large number of tickets being purchased, then there are three types of consumption patterns that are exposed in this scenario: Anonymous Item Consumption - In this pattern each ticket is interchangeable, for example all seats are general admission. The buddy system would not improve latency over simple masterslave replication since all concurrent resources requests would need to be sent to the same buddy pair. Attribute Item Consumption - In this pattern each client s request has attribute filters, such as mainfloor or balcony. The buddy system would improve latency over simple master-slave replication because each set of attributes could be sent to a different buddy pair. Serialized Item Consumption - In this pattern each client s consumption request is for a specific seat. The buddy system would greatly improve latency over simple master-slave replication because each seat request could be sent to a different buddy pair. 4.2 Analysis of Lost Updates in the Presence of Failures Lazy Replication Durability: In each proposed lazy replication scenario, there is one master for a particular data item. After a transaction has committed there is a period of time where there is a vulnerability that a lost update can occur if hardware hosting the master replica fails before the lazy update propagation is initiated. Eager Replication Durability: In eager replication the window of vulnerability of lost updates is removed because the updating transaction cannot commit until all other replicas are also updated. Generally, the two phase commit (2PC) protocol is used across replicas to achieve this goal. However,the update cost of eager update propagation is high, and, therefore, it is not used 6

Dispatcher Primary Cluster Load Balancer Primary Web Service Secondary Cluster Load Balancer Secondary Web Service Client Request Request w/buddy Request w/buddy Writelist Writelist Success/Failure Commit/Abort Version Update Figure 3: Buddy System Workflow frequently. Buddy System Durability: Using the buddy system, we can guarantee durability. The weakest point of the buddy system is the durability of the dispatcher. If the dispatcher fails, the data structures may get lost and recovery activities must be performed. Figure 3 shows the workflow of the hybrid eager and lazy solution we proposed. This solution has higher durability than the lazy propagation because two replicas will get the original transaction so a hardware error on one replica will not result in the loss of update. Table 2 presents our analysis of the hardware failures at the different stages of the transaction execution. The first column represents the failed hardware, the following columns detail the stages: before the transaction started, during execution, and after the transaction committed but before the update is propagated. Web Server housing Web Service of primary buddy Web Server housing Web Service of secondary buddy Database Server Dispatcher Table 2: Windows of Vulnerability InActive No issue because of redundancy No issue because of redundancy No issue. Dispatcher needs to be notified by web service that it is unavailable New dispatcher, Each clusters will update dispatcher versions During Transaction 2PC will roll back trans, Client will get error from primary 2PC will roll back trans, Client will get error from primary 2PC will roll back trans, Client will get error from primary After dispatch to primary no issue because primary responds directly to client After Transaction Committed No issue because of redundancy No issue because of redundancy No issue because of redundancy New dispatcher, Each clusters will update dispatcher versions Figure 4: Implementation Data 5 Implementation We tested the performance of our Buddy-system against the lazy and eager replica update protocols. We also considered two possible communication architectures: synchronous and asynchronous communication. Using asynchronous communication, the client sends a request and waits for a response to be sent asynchronously. In synchronous communication the client waits until the response is received. The major difference in these two methods is how the enqueue process is handled when the dispatcher can not fulfill the request with the current state of the clusters. Figure 4 shows the empirical data from an implementation using synchronous requests from a Java desktop application. The dispatcher is written in Java EE using a Tomcat servlet container. The dispatcher uses class attributes to share hash tables, the internal data structures, across all request threads. Each cluster is also implemented in Java EE using a Tomcat servlet container. A separate MYSQL database is used by each cluster in serializable isolation. The cluster uses a JDBC connection pool communicating to its individual database. A dataset with different sizes was generated with each transaction randomly selecting two items to read and one item to write. Buddy-100, Buddy-1000 and Buddy-10000 represent the performance of the Buddy algorithm with a dataset size of 100, 1000, and 10000 items, respectively. The same transactions were run against a single, master cluster system with lazy replication and a two clusters system with strict replica- 7

tion. Figure 4 shows that once the dataset size grew to 10,000 items the performance of the Buddy algorithm matches the performance of lazy replication, while increasing durability. The severe performance penalty observed with small datasets is the result of the enqueue process and the overhead of selecting buddies. Our ongoing work aims to improve the buddy selection algorithm and to reduce the number of transactions that cannot be processed concurrently. Also, in the current implementation the dispatcher stores the version data structures in memory. Our future implementation will store these tables in secondary storage to increase redundancy and durability of the dispatchers data. 6 Conclusion In this paper we propose an extension to the lazy replica update propagation methods. Our solution is based on using an application-layer dispatcher [1] to select the buddies based on the data items and the operations of the transactions, the data versions available, and the network characteristics of the WS farm. We show that although our solution can be viewed as a special case of hybrid update propagation method, it provides several advantages not addressed in traditional distributed database replica update protocols. First, our approach provides the scalability required by modern applications, such as web services, and is suitable for the architectures and technologies implementing these applications. The buddy-selection algorithm supports dynamic master-slave site selection for data items, ensures correct transaction execution, and aids load-balancing among the replicas. Finally, incorporating network specific characteristics, such as distance and bandwidth, reduces latency observed by the client. Our approach has a similar message complexity compared to a traditional lazy master-slave replication using group communication and less message complexity to a lazy master-slave replication without group communication. The limitation of our work is that the dispatcher represent a single point of failure. Our ongoing work extends our solution to incorporate multiple dispatchers to eliminate this limitation. Annual Technical Conference, ATEC 00, pages 26 26, Berkeley, CA, USA, 2000. USENIX Association. [2] Y. Breitbart and H. F. Korth. Replication and consistency: being lazy helps sometimes. In Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, PODS 97, pages 173 184, New York, NY, USA, 1997. ACM. [3] K. Daudjee and K. Salem. Lazy database replication with ordering guarantees. Data Engineering, International Conference on, 0:424, 2004. [4] A. Fekete. Allocating isolation levels to transactions. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS 05, pages 206 215, New York, NY, USA, 2005. ACM. [5] Gmail messages being restored after bug, Mar. 2011. [6] L. Irun-Briz, F. Castro-Company, F. Garcia-Neiva, A. Calero-Monteagudo, and F. D. Munoz-Escoi. Lazy recovery in a hybrid database replication protocol. In In Proc. of XII Jornadas de Concurrencia y Sistemas Distribuidos, 2005. [7] S. Jajodia and D. Mutchler. A hybrid replica control algorithm combining static and dynamic voting. IEEE Transactions on Knowledge and Data Engineering, 1:459 469, 1989. [8] H. Jung, H. Han, A. Fekete, and U. Rhm. Serializable snapshot isolation for replicated databases in highupdate scenarios. In PVLDB, pages 783 794, 2011. [9] A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2):35 40, Apr. 2010. [10] Y. Lin, B. Kemme, M. Patiño Martínez, and R. Jiménez-Peris. Middleware based data replication providing snapshot isolation. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, SIGMOD 05, pages 419 430, New York, NY, USA, 2005. ACM. [11] D. Long, J. Carroll, and K. Stewart. Estimating the reliability of regeneration-based replica control protocols. IEEE Transactions on Computers, 38:1691 1702, 1989. [12] M.-Y. Luo and C.-S. Yang. Constructing zero-loss web services. In INFOCOM, pages 1781 1790, 2001. [13] M. G. Norman, T. Zurek, and P. Thanisch. Much ado about shared-nothing. SIGMOD Rec., 25:16 21, September 1996. [14] M. T. Ozsu and P. Valduriez. Principles of Distributed Database Systems (3rd Edition). Springer, 3 edition, 2011. References [1] M. Aron, D. Sanders, P. Druschel, and W. Zwaenepoel. Scalable content-aware request distribution in cluster-based networks servers. In Proceedings of the annual conference on USENIX 8