Consistency of Replicated Datasets in Grid Computing

Consistency of Replicated Datasets in Grid Computing Gianni Pucciani, Flavia Donno CERN, European Organization for Nuclear Research, CH 1211 Geneva 23, Switzerland Andrea Domenici DIIEIT, University of Pisa, v. Diotisalvi 2, I 56122 Pisa, Italy Heinz Stockinger Swiss Institute of Bioinformatics, Quartier Sorge, CH 1015 Lausanne, Switzerland

Consistency of Replicated Datasets in Grid Computing I N T R O D U C T I O N Replica consistency is the property exhibited by a set of data items, such as files or databases located at different nodes of a Grid, that contain the same information; when these data items are modifiable, all of them should be updated (or synchronized) so that consistency is maintained. Replica consistency is a very well studied research topic and has its roots in distributed systems as well as in distributed database management systems, where it is sometimes referred to as external consistency (Cellary et al., 1988). Replica consistency is obviously related to data replication, a technique that is used pervasively in Grids to achieve fast data access, high availability, increased fault tolerance, and better load balancing. Data replication involves databases, files, and possibly other units of information, such as objects or records, and relies on the functions provided by plain file systems, storage systems, database management systems, and middleware services. Currently, existing Grids offer scarce support, if any, for data consistency. Often, data is considered to be read only, i.e. data is consistent by definition since no updates are allowed on existing data items. The rest of this entry presents an introduction to the problem in the Background section, where the key concepts are introduced. Furthermore, the data management capabilities provided by middleware services currently used in some of the largest Grids are reviewed, pointing out their approach to replica management and support for replica synchronization. The core analysis of the problem is presented in the Main Focus section, where the main issues in the development of a Replica Consistency Service for Data Grids are discussed. B A C K G R O U N D

Data replication Data replication is a technique that is most commonly used in distributed database management systems and is tightly coupled with the transaction system. For example, a relational database management system can have identical copies at three geographically different sites, each holding a full copy of the data. End users can place their SQL queries to any of the replicas: distributed transactions are then used to make sure that data does not get corrupted by multiple writers. Simply put, data consistency makes sure that different data copies are synchronized, i.e., have the same values. In Grid computing, data replication is done at different levels of granularity than in traditional, distributed relational database management systems. In particular, Grids often replicate entire files rather than database objects. Furthermore, data synchronization and therefore consistency has to be managed by external services which often do not provide a unique interface for reading and writing data based on traditional database transactions. In the rest of this article we concentrate on the specific issues of Data Grids. However, before we go into the details of replica consistency in Grids, let us first review typical data replication components and services that are commonly used in Grid computing. A Data Grid typically offers a Replica Management Service (RMS), a middleware component that creates replicas of files (rather than relational databases) on request by applications or possibly in a transparent way in order to optimize data access. This service uses a Replica Catalogue (RC) to keep track of the replicas. The RMS may also rely on a Replica Optimization Service (ROS) to select dynamically the best 1 replicas to be accessed by a given application. Such file replication tools must then implement policies concerning the following major issues: 1. When and where to create or remove replicas? A replication service should perform dynamic replication (Ranganathan, 2001), that is the automatic creation and removal of replicas based on different system parameters and/or user needs. 2. Data location and cataloguing. Replicas can be created and removed in the course of time: they are created somewhere when needed and they must be deleted when they are no longer used. How does a user or application, or the RMS itself know where a replica is, at a certain point in time? To this end, replica catalogues (Chervenak et al., 2002) are normally used. 3. Replica synchronization. When a replica must be updated, how are the other replicas synchronized with the new content? How is replica consistency enforced throughout the system? This is the topic of this entry. Users of a RMS need not be aware of the existence of replicas. Normally, they refer to a file by a logical name that identifies the information carried by the file, independently of the physical location of its replicas. Applications relying on the RMS pass the logical name to the service, that retrieves from the RC the physical names specifying the actual locations of the replicas. Replicas are created using RMS client tools, such as the Globus Data Replication Service (Chervenak et al., 2005) or LCG Data Management tools (Peris et al., 2004) (which most of the times rely on lower level services like GridFTP), and kept on Storage Elements (SE) of different types. A storage element is a complex system that may support a hierarchy of storage systems, such as fast disk 1 The best replica in this case is chosen considering access speed and supported protocols.

caches, long term and high capacity disks, and tapes. Different types of storage elements exist, providing support for different access protocols. Efforts in promoting the usage of a standard interface for heterogeneous SEs are present (Shoshani, A., 2003). A storage element may replicate data internally to optimize file access, but this kind of replication is independent of the RMS and will not be discussed further. Replica management services usually replicate files and possibly file collections, but they offer no support for the replication of data stored in relational or object oriented databases. Database replication relies on the proprietary mechanisms provided by the database management systems. For example, in the project WLCG, Worldwide LHC Computing Grid (WLCG, 2007), Oracle Streams is used for unidirectional replication of Oracle Databases. Replication of databases is especially important for the availability and reliability of the Grid middleware services since most of them use databases to keep track of service specific metadata that is frequently updated. Database replication has different and usually more complex requirements with respect to file replication. Databases can be large, they must be accessed through their management systems, and they cannot just be copied as a whole, but need to be installed with a rather complex procedure. Furthermore, different sites may want to keep copies of the same data in database systems provided by different vendors: this is called heterogeneous replication. In the remainder of this article we will discuss the requirements and features for both, file and database replication and their synchronization.. Key concepts Depending on the application (counting also middleware services as applications), data to be replicated may be stored in a file or a file collection, a database or a database table, or an object stored in a file or in a database. We will use the term dataset to cover these possibilities. Datasets may be structured or unstructured. A dataset is structured if a user/application accesses it by means of record oriented operations. A relational database is a typical example of a structured dataset that is accessed through SQL commands. Unstructured datasets are those whose internal structure is either unknown or ignored for the purposes of replication, and are accessed by users/applications by means of file management operations, local or remote file I/O protocols. Unstructured datasets will also be referred to as flat files. We distinguish between the logical contents of a dataset, i.e., the information it carries, and its physical instances, called replicas. A logical dataset (or dataset for short) may then be defined as an abstract entity composed of its contents and its logical name. To each logical dataset is associated a set of replicas, each identified by a physical name. Replicas can be stored at a particular location (on a file system, database or mass storage system) and accessed by the users/applications with some sort of access protocol. A replica contains a physical representation of the dataset contents and, as observed before, different representations are possible for a given dataset. A semantic function maps a representation to the contents: for flat files the semantic function is the identity function, while for structured datasets it is the mechanism that extracts information from the replica (in a database, it is simply the query

processing interpreter). Replica Synchronization Protocols A replica synchronization protocol is performed by any system (such as a distributed database manager or a Grid middleware service for file replication) whose purpose is to maintain a set of replicas consistent. Several such protocols have been proposed, each aimed at satisfying different sets of requirements that arise in different environments. For example, in Grid environments it is often not possible to keep all replicas up to date, and at any given time one or more replicas might be outdated. Depending on the application, certain more relaxed consistency requirements and states can be defined (Düllmann, 2002; Gray, 1997; Breitbart, 1997). For instance, certain applications can deal with datasets that are outdated for several minutes, sometimes even hours. If this feature is known a priori, adequate consistency models can be chosen. In particular, we can distinguish between two main approaches: Synchronous, or eager synchronization. In this approach, all replicas for a given logical dataset are updated within the same transaction, with a protocol that usually is a variation of the basic two phase commit protocol (Özsu et al., 1999). This has the consequence that no single replica can be accessed during the update process but after the transaction all replicas have the same physical state and they are consistent. Although high data consistency is a desirable feature, this approach has important limitations for distributed systems and in particular for Data Grids: replicas need to be locked which can result in long down times of replicas due to update contention. When no timeout or quorum systems are used, disconnected sites can block an update operation indefinitely. Asynchronous, or lazy synchronization. The second approach tries to overcome the problem of distributed locks by updating only a subset of replicas during an update transaction and propagating the update to the other replicas at a later time. Certainly, some of the replicas will be outdated for a certain period, which is the price for speeding up the write access and increasing data availability. In order to further characterize replica consistency mechanisms, we introduce more definitions. When using lazy synchronization, a simple and reliable solution is the one where a replica is designated as master or primary replica. In single master systems, the unique master replica is the only one that can be modified by users, while the other replicas (slave or secondary replicas) are updated by the replica synchronization protocol. Secondary replicas are useful to speed up read operations. In case of failures at the master site that compromise the use of the master replicas, an election algorithm (Garcia Molina, 1982) can be used among secondary replicas to elect a new master replica. Multi master solutions can expose the system to update conflicts. Conflict resolution is a highly application specific problem. With low write access rates or when the semantics of the application allows the resolution of a conflict without affecting the normal behavior and performance of the system, multi master solutions can be implemented, increasing data availability and speeding up both read and write operations.

Depending on how the update of a replica is performed, we can further classify synchronization protocols distinguishing between push versus pull based and log versus content transfer systems (Saito et al., 2005). Existing support for replica consistency Today, most of the commercial database management systems provide replication features with mechanisms enforcing consistency; Oracle Streams, IBM DB Replication and Microsoft SQL Server are some of the best known ones. However, in this case the replication is homogeneous in that it regards databases of the same vendor, with a few exceptions 2. As regards Grid environments, no consistency service has been yet developed in important middleware solutions such as the Globus Toolkit and the LHC Computing Grid. Both these solutions do provide file replication features, but the automatic management of replica consistency is not supported. A prototype Grid service for maintaining consistency of replicated files and databases can be found in (Domenici et al., 2006). The SDSC Storage Resource Broker (SRB, 2007) instead provides a rather complete set of replication and consistency management features including synchronous and asynchronous approaches. Other studies in replica consistency management can be found in (Yu et al., 2002) and (Susarla et al., 2005) but their application in a real Grid environment has not yet been considered. M A I N F O C U S The need for replica consistency mechanisms in Grid environments has been pointed out early in (Stockinger, 2001), (Düllmann et al., 2002), and (Casey et al., 2003), but few solutions have been proposed so far. This is partly due to the fact that many applications that are driving the development of Grid middleware 3 expect to use modifiable datasets in the future, but currently use mostly read only data. As a consequence, requirements for replica consistency are still unclear. More precise requirements will be defined when users begin to try new models of computation and data access. Issues in designing a Replica Consistency Service The design of a Replica Consistency Service (RCS) as part of a Grid middleware is faced with many difficult issues that derive from specific properties of a Grid environment. 2 Oracle Streams can use Oracle Heterogeneous Connectivity technology to replicate data from an Oracle system to a non Oracle (Informix, MS SQL Server and Sybase) system. IBM DB2 can share and replicate data with an Informix database. 3 It is the case, for example, in the WLCG middleware, where High Energy Physics experiments mainly use the Grid to perform analysis on read only files.

In general, being replica consistency a highly application specific problem, designing one consistency management mechanism for different applications requires finding trade offs on many different design choices. In the next paragraphs we review some of the most important issues that need to be dealt with, providing hints for the design of a Replica Consistency Service. Scalability First of all, any Grid infrastructure involves the management of many sites, hence, in case of flat files 4, it is likely to have to deal with several thousands of replicas, some of which could not be continuously available. Thus, update propagation algorithms must be properly designed to provide good performance also with large numbers of replicas. Keeping the design simple can be the key to success; whenever possible, single master solutions are the recommended way to provide fast read access and high data availability. Security Security issues must be considered in the development of a RCS. Communication with the service should be secure; this means that the service should deal with authentication, authorization, privacy, and integrity issues. The Grid Security Infrastructure (GSI) provided by the Globus Toolkit is widely adopted as an integrated solution to security problems, and it is based on the public key infrastructure. The GSI can be easily integrated in a Grid service. Replica Location Replica location services and replica catalogues are used in Grid middleware to store the association between a logical dataset and all its replicas. Among the most used implementations we cite the Globus Replica Location Service (RLS) (Chervenak et al., 2002) and the LCG File Catalogue (LFC). The RCS has two options: interfacing with this catalogue or implementing its own replica catalogue. Both options have advantages and disadvantages. Using an external replica catalogue would avoid duplicating information and complicating the system. On the other hand, the integration with an external service should be carefully planned and would require such catalogues to be modified. For example, not all the logical datasets registered in a replica catalogue need the consistency management, like read only datasets. For datasets that do require consistency management, some new attributes (e.g. master/slaves, fresh/stale, version number) should be added to each replica's metadata. Efficient file transfer An efficient file transfer tool should be used for update propagation. File transfer services for Grid computing are normally built on top of the GridFTP protocol. The RCS should use either GridFTP or higher level services to efficiently propagate updates to possibly thousand of replicas. Most of the Storage Elements support the GridFTP protocol, making it a good choice to solve the file transfer issue in the RCS. Note that GridFTP is optimized for transferring rather big amounts of data with relatively big file sizes. This is partially due to the fact that TCP/IP works more efficiently with larger than with smaller file sizes due to the TCP window size tuning and a slow start up with smaller window sizes. Performance tests have shown that transfers of smaller data items (up to about 5 MB) can be achieved more efficiently using alternative approaches such as SOAP with attachment (Sciolla, 2007). 4 In case of replicated databases the number of replicas can range from a few units to a few tens of replicas.

SE heterogeneity A Grid connects many different resources. Storage Elements, where datasets are stored, can have different implementations and different access protocols. Although a standard interface could be available in the next few years (Shoshani, 2003), a RCS should interface with different SEs. Lock management functionalities should be provided by the SE since, in certain scenarios, the access to a replica may need to be blocked to avoid concurrent accesses. Disconnected nodes The RCS should be able to complete the synchronization of replicated datasets even when some of them are not available. Quorum mechanisms could be used to ensure that an update propagation process can execute when at least a given number of replicas are available, and it should be possible to select this number depending on the application requirements. Synchronization of unavailable replicas should be retried as soon as they become available. Metadata Consistency The RCS should provide synchronization capabilities both for application and middleware services. Many middleware services in fact use replication for fault tolerance and reliability. One example can be found in the Globus RLS, where catalogues are replicated but consistency management is not supported. This leads us to consider, as already stated in this article, the consistency of both files and databases, that is the subject of the next paragraph. Database Consistency A Replica Consistency Service to be used in a Grid middleware should be able to manage the consistency of both applications' data and middleware services' data. Many Grid services in fact use persistent data stores, usually relational databases, to save critical information. In order to provide fault tolerance and increase the performance of these services, such data are often replicated over several sites, and hence a consistency mechanism is needed to enforce consistency among these replicas. Practical examples of replicated services that use relational database are the Globus Replica Location Service and the LCG File Catalogue. Such services usually can be implemented using backend databases from different vendors. Oracle databases are a common choice for larger sites. In other cases, open source databases (often MySQL and PostgreSQL) are good alternatives. Thus, crossvendor replication also needs to be supported by a Replica Consistency Service. Cross vendor or heterogeneous database synchronization requires that the RCS is built using pluggable modules to interface with many different software packages. Differences in the SQL dialects used by different database vendors must be matched, both by limiting the use of non standard SQL, and by providing some translation capabilities. Unidirectional Oracle to MySQL synchronization has been tested in the CONStanza project (Domenici et al., 2006). Another open source software that provides heterogeneous replication through a Java based data extraction, transformation and loading tool is Enhydra Octopus (Octopus, 2007). A third example for a Grid database replication system is presented in (Chen, 2007). The problem of concurrency control in distributed heterogeneous databases in a Grid environment is studied in (Taniar,

2007). Although they present different characteristics, file synchronization and database synchronization have common features that should be exploited to provide a general and flexible Replica Consistency Service. F U T U R E T R E N D S In general, the Grid software developers deal more with the efficient replication and replica selection of read only datasets than with update synchronization and consistency. One of the reasons for that is that there are not many use cases of the latter kind in classical Grid applications. On the other hand, database research has shown that update replication comes at some cost in terms of data availability, so that only certain applications can fully profit from replicated data with update features. Just as replica consistency has become an essential property in distributed databases and certain file systems, the same will occur in Grid infrastructures. Further, considering that Grid computing is a rapidly emerging domain, it is likely that new applications, outside the scientific field, will arise in the next few years, providing more requirements for the implementation of a Replica Consistency Service. C O N C L U S I O N We have presented the problem of replica consistency in Grid environments and discussed possible solutions to be considered when implementing such a system. Nowadays many Grid applications deal with read only replicas; for this reason Grid middleware frameworks do not provide any support for replica synchronization. Another reason is that replica synchronization is a highly application specific domain, and providing a universal solution suitable for multiple dataset types and access patterns is very difficult. In this entry we analyzed the main issues in developing a Replica Consistency Service (RCS) in a Grid environment, suggesting practical approaches. Some of these approaches have been implemented and tested in a prototype service described in (Domenici et al., 2006), that allows for the synchronization of both files and heterogeneous database replicas. We expect that future Grid applications will have more stringent requirements for replica consistency; this will help to better characterize the design of the RCS and will also speed up the implementation of reliable solutions. R E F E R E N C E S Baud, J.P, Casey, J., Lemaitre, S., Nicholson, C., Smith, D., Stewart, G. (2005). LCG Data Management: from EDG to EGEE. GLAS PPE/2005 06.

Breitbart Y., & Korth, H. F. (1997). Replication and consistency: Being lazy helps sometimes. In Proc. of the 16th ACM SIGACT SIGMOD SIGART Symposium on Principles of Database Systems. Casey, J. et al. (2003). Next Generation EU DataGrid Data Management Services. In Proc. Conference for Computing in High Energy and Nuclear Physics (CHEP 2003), La Jolla, California. Cellary, W., Gelenbe, E. and Morzy, T. (1988). Concurrency Control in Distributed Database Systems. Amsterdam: North Holland. Chen, Y., Berry D., Dantressangle P., (2007). Transaction Based Grid Database Replication, In Proc. of the UK e Science All Hands Meeting 2007. Chervenak, A., Deelman, E., Foster, I., Guy, L., Hoschek, W., Iamnitchi, A., Kesselman, C., Kunszt, P., Ripeanu, M., Schwarz, B., Stockinger, H., Stockinger, K., & Tierney, B. (2002). Giggle: A Framework for Constructing Scalable Replica Location Services. In Proc. of the Int'l. ACM/IEEE Supercomputing Conference (SC 2002), IEEE Computer Society Press. Chervenak, A., Schuler, R., Kesselman, C., Koranda, S., Moe, B. Wide (2005). Area Data Replication for Scientific Collaboration. In Proc. of 6th IEEE/ACM Int'l Workshop on GridComputing (Grid2005). Domenici, A., Donno, F., Pucciani, G., Stockinger, H. (2006). Relaxed Data Consistency with CONStanza. Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGrid06), Singapore 16 19 May 2006, IEEE Computer Society. Domenici, A., Donno, F., Pucciani, G., Stockinger, H., Stockinger, K. (2003). Replica consistency in a Data Grid. In Proc. of the IX International Workshop on Advanced Computing and Analysis Techniques in Physics Research, Tsukuba, Japan. Düllmann, D., Hoschek, W., Jean Martinez, J., Samar, A., Stockinger, H., & Stockinger, K. (2002). Models for Replica Synchronisation and Consistency in a Data Grid. In Proc. of 10th IEEE Symposium on High Performance and Distributed Computing (HPDC 10), IEEE Computer Society Press. Garcia Molina, H. (1982). Elections in a Distributed Computing System. IEEE Transaction on Computers, vol. 32. Gray, J., Helland, P., O'Neil, P., & Shasha. D. (1997). The dangers of replication and a solution. In Proc. of the 1996 ACM SIGMOD International Conference on Management of Data, pp. 173 182. LCG 3D (2007). Distributed Deployment of Databases for LCG. From https://twiki.cern.ch/twiki/bin/view/pssgroup/lcg3dwiki Octopus (2007). Enhydra Octopus, JDBC Data Transformation. From http://www.enhydra.org/tech/octopus/ Özsu, M.T., Valduriez, P. (1999). Principles of Distributed Database Systems, Prentice Hall. Peris, A.D., Lorenzo, P.M, Donno, F., Sciabà, A., Campana, S., Santinelli, R. (2004). LCG 2 User Guide, v2.1. Ranganathan, K. and Foster, I. (2001). Identifying Dynamic Replication Strategies for a High Performance Data Grid. In Proc. of the International Grid Computing Workshop, Denver, CO. RLS (2007). Data Management: Key Concepts of RLS. From http://www.globus.org/toolkit/docs/4.0/data/key/rls.html Saito, Y., Shapiro, M. (2005). Optimistic Replication. ACM Computing Surveys.

Sciolla, C. (2007), Implementazione e valutazione di un sistema di trasferimento file basato su SOAP in ambiente GRID, In Italian, Master s Thesis at the University of Pisa. Shoshani, A. (2003). Storage Resource Managers: Essential Components for the Grid. Chapter in book: Grid Resource management: State of the Art and Future Trends, Edited by Jarek Nabrzyski, Jennifer M. Schopf, Jan Weglarz, Kluwer Academic Publishers. SRB (2007). The SDSC Storage Resource Broker. From http://www.sdsc.edu/srb/index.php/main_page Stockinger, H. (2001). Database Replication in World Wide Distributed Data Grids. Ph.D. Thesis, Institute of Computer Science and Business Informatics, University of Vienna, Austria. Susarla, S., Carter, J. (2005). Flexible Consistency for Wide area Peer Replication. In Proc. of the 25th International Conference on Distributed Computing Systems. Taniar D., Goel S. (2007), Concurrency control issues in grid databases, Future Generation Computer Systems 23:1 WLCG (2007). Worlwide LHC Computing Grid. From http://lcg.web.cern.ch/lcg/ Yu, H., Vahdat, A. (2002). Design and Evaluation of a Conit based Continuous Consistency Model for Replicated Services. ACM Transactions on Computer Systems (TOCS). Terms and Definitions Data Replication Having and managing more copies of datasets. These copies are typically synchronized. Replica Catalogue Used to locate replicas (physical locations) which are mapped to logical file names. Logical File Name A name used to identify a set of replicated files. Physical File Name The name of a replicated file which defines his location. Replica Management System A Grid service that takes care of replicating datasets and keeping track of locations in a Replica Catalogue. Replica Consistency The property exhibited by a set of replicas that contain the same information. Replica Synchronization The task of updating replicas in order to enforce their consistency. Strict Synchronization Updating all the replicas of the same dataset in a single transaction to make sure that replicas are never outdated.

Lazy Synchronization Allowing for certain delays in the update process, i.e. replicas can be outdated for a certain time. Heterogeneous Database Synchronization Used to enforce consistency among replicated databases of different vendors.