An Architecture for Replica Management in Grid Computing Environments

Transcription

1 _ An Architecture for Replica Management in Grid Computing Environments Abstract We present the architecture of a replica management service that manages the copying and placement of files in a high-performance, distributed computing environment to optimize the performance of the data-intensive applications. This architecture consists of two parts: a replica catalog or repository where information can be registered about logical files, collections of files, and physical locations where subsets of collections are stored; and a set of registration and query operations that are supported by the replica management service. The replica management service can be used by higher-level services such as replica selection and automatic creation of new replicas to satisfy application performance requirements. We describe important design decisions and implementation issues for the replica management service. Design decisions include a strict separation between file metadata and replication information, no enforcement of replica semantics or file consistency, and support for rollback after failures of complex operations. Implementation issues include options for the underlying technology of the replica catalog and the tradeoff between reliability and complexity. 1 Introduction Data-intensive, high-performance computing applications require the efficient management and transfer of terabytes or petabytes of information in wide-area, distributed computing environments. Examples of such applications include experimental analyses and simulations in scientific disciplines such as high-energy physics, climate modeling, earthquake engineering, and astronomy. In such applications, massive datasets must be shared by a community of hundreds or thousands of researchers distributed worldwide. These researchers need to transfer large subsets of these datasets to local sites or other remote resources for processing. They may create local copies or replicas to overcome long wide-area data transfer latencies. The data management environment must provide security services such as authentication of users and control over who is allowed to access the data. In addition, once multiple copies of files are distributed at multiple locations, researchers need to be able to locate copies and determine whether to access an existing copy or create a new one to meet the performance needs of their applications. We argue that the requirements of such distributed, data intensive applications are best met by the creation of a Data Grid infrastructure that provides a set of orthogonal, application-independent services that can then be combined and specialized in different ways to meet the needs of specific applications. These services include a metadata

2 management service that records information about the contents of files and the experimental conditions under which they were created; a replica management service that registers multiple copies of files at different physical locations and allows users to discover where files are located; a replica selection service that chooses the best replica for a data transfer based on predicted performance; and a secure, reliable, efficient data transfer protocol. In this paper, we present the architecture of a replica management service charged with managing the copying and placement of files in a distributed computing system so as to optimize the performance of the data analysis process. Our goal in designing this service is not to provide a complete solution to this problem but rather to provide a set of basic mechanisms that will make it easy for users or higher-level tools to manage the replication process. Our proposed replica management service provides the following basic functions: The registration of files with the replica management service The creation and deletion of replicas for previously registered files Enquiries concerning the location of replicas In turn, the basic functions provided by the replica management service can be used by higher-level services, for example, by replica selection services that select among available replicas based on the predicted performance of data transfers and by replica creation services that automatically generate and register new replicas in response to data access patterns and the current state of the computational grid. In this paper, we present the basic components of our architecture for a replica management service. To register replicas, users must create entries in a replica catalog or repository. There are three types of entries: logical files, logical collections and locations. We describe these entries and the registration and query operations that are supported by the replica management service. We also present important design decisions for the replica management architecture. These include: Separation of replication and file metadata information: Our architecture assumes a strict separation between metadata information, which describes the contents of files, and replication information, which is used to map logical file and collection names to physical locations. Metadata and replica management are orthogonal services. Replication semantics: Our architecture enforces no replica semantics. Files registered with the replica management service are asserted by the user to be replicas of one another, but the service does not make guarantees about file consistency. Rollback: If a failure occurs during a complex, multi-part operation, we will rollback the state of the replica management service to its previously-consistent state before the operation began.

3 No distributed locking: Our architecture does not assume the existence of a distributed locking mechanism. Because of this, it is possible for users to corrupt the replica management service by changing or deleting files on registered storage systems without informing the replica management service. The paper concludes with a discussion of implementation issues for replica management. 2 A Motivating Example: High-Energy Physics Applications We use high-energy physics experiments to motivate the design of our replica management architecture. We characterize the application with respect to parameters such as average file sizes, total data volume, rate of data creation, type of file access (write-once, write-many), expected access rates, type of storage system (file system or database), and consistency requirements for multiple copies of data. In this application, as well as others that we have examined, such as climate modeling, earthquake engineering and astronomy, we see a common requirement for two basic data management services: efficient access to, and transfer of, large files; and a mechanism for creating and managing multiple copies of files. Experimental physics applications operate on and generate large amounts of data. For example, beginning in 2005, the Large Hadron Collider (LHC) at the European physics center CERN will produce several petabytes of raw and derived data per year for approximately 15 years. The data generated by physics experiments are of two types: experimental data, or information collected by the experiment; and metadata, or information about the experiment, such as the number of events, and the results of analysis. File sizes and numbers of files are determined to some extent by the type of software used to store experimental data and metadata. For example, several experiments have chosen to use the object-oriented Objectivity database. Current file sizes (e.g., within the BaBar experiment) range from 2 to 10 gigabytes in size, while metadata files are approximately 2 gigabytes. Objectivity currently limits database federations to 64K files. However, future versions of Objectivity will support more files, allowing average file sizes to be reduced. Access patterns vary for experimental data files and metadata. Experimental data files typically have a single creator. During an initial production period lasting several weeks, these files are modified as new objects are added. After data production is complete, files are not modified. In contrast, metadata files may be created by multiple individuals and may be modified or augmented over time, even after the initial period of data production. For example, some experiments continue to modify metadata files to reflect the increasing number of total events in the database. The volume of metadata is typically smaller than that of experimental data.

4 The consumers of experimental physics data and metadata will number in the hundreds or thousands. These users are distributed at many sites worldwide. Hence, it is often desirable to make copies or replicas of the data being analyzed to minimize access time and network load. For example, Figure 1 shows the expected replication scheme for LHC physics datasets. Files are replicated in a hierarchical manner, with all files stored at a central location (CERN) and decreasing subsets of the data stored at national and regional data centers. Tier 0 CERN Tier 1 France Tier 1 Italy Tier 1 England Tier 2 Bologna Tier 2 Pisa Tier 2 Padova Figure 1: Scheme for hierarchical replication of Physics data Replication of physics datasets is complicated by several factors. First, security services are required to authenticate the user and control access to storage systems. Next, because datasets are so large, it may be desirable to replicate only interesting subsets of the data. Finally, replication of data subject to modification implies a need for a mechanism for propagating updates to all replicas. For example, consider the initial period of data production, during which files are modified for several weeks. During this period, users want their local replicas to be updated periodically to reflect the experimental data being produced. Typically, updates are batched and performed every few days. Since metadata updates take place over an indefinite period, these changes must also be propagated periodically to all replicas. In Table 1 we summarize the characteristics of high-energy physics applications. Table 1: Chracteristics of high-energy physics applications Rate of data generation (starting 2005) Typical experimental database file sizes Typical metadata database file sizes Maximum number of database files in federation Period of updates to experimental data Period of updates to metadata Type of storage system Number of data consumers Several petabytes per year 2 to 10 gigabytes 2 gigabytes Currently 64K; eventually millions Several weeks Indefinite Object-oriented database Hundreds to thousands

5 3 Data Model We assume the following data model. Data are organized into files. For convenience, users group files into collections. A replica or location is a subset of a collection that is stored on a particular physical storage system. There may be multiple, possibly overlapping subsets of a collection stored on multiple storage systems in a data grid. These grid storage systems may use a variety of underlying storage technologies and data movement protocols, which are independent of replica management. We distinguish between logical file names and physical file names. A logical file name is a globally unique identifier for a file within the data grid s namespace. The logical file name may or may not have meaning for a human, for example, by recording information about the contents of a file. However, the replica management service does not use any semantic information contained in logical file names. The purpose of the replica management service is to map a unique logical file name to a possibly different physical name for the file on a particular storage device. 4 Replica Management in Grid Computing Environments The replica management service is just one component in a computational grid environment that provides support for high-performance, data-intensive applications. A recently proposed architecture for computational grids [1] includes four levels: Fabric: At the lowest level of the grid architecture are the basic components and resources from which a computational grid is constructed. These include storage systems, networks, and catalogs. Connectivity: At the next level of the architecture are services concerned with communication and authentication. Typically, these are standard protocols. Resource: Services at the next highest level are concerned with providing secure, remote access to individual resources. Collective: Services at the collective level support the coordinated management of multiple resources. Figure 2 shows a partial list of components at each level of the proposed grid architecture, with particular emphasis on components related to replica management. At the lowest fabric level of the architecture are the basic components that make up the Grid, including storage systems, networks and computational systems. In addition, the picture includes two catalogs: a metadata catalog that contains descriptive information about files and a replica catalog where information is stored about registered replicas. At the connectivity layer are various standard protocols for communication and security. At the resource level are services associated with managing individual resources, for example, storage and catalog management protocols as well as protocols for network and computation resource management. Finally, at the collective layer of the architecture are

6 higher-level services that manage multiple underlying resources, including the replica management service that is the focus of this paper. Other services at the collective layer include services for replica selection, metadata management, management of replicated and distributed catalogs, and for information services that provide resource discovery or performance estimation. Application Particle physics application, climate modeling application, etc. Collective Replica Mgmt Services Replica Selection Services Metadata Services Distributed Catalog Services Information Services... Resource Storage Mgmt Protocol Catalog Mgmt Protocol Network Mgmt Protocol Compute Mgmt Protocol... Connectivity Communication, service discovery (DNS), authentication, delegation Fabric Storage Systems Networks Compute Systems Replica Catalog Metadata Catalog Figure 2: Shows a partial list of elements of the Data Grid Reference Architecture [1] that are relevant to replica management. One of the key features of our architecture is that the replica management service is orthogonal to other services such as replica selection and metadata management. Figure 3 shows a scenario where an application accesses several of these orthogonal services to identify the best location for a desired data transfer. For example, consider a climate modeling simulation that will be run on precipitation data collected in The scientist running the simulation does not know the exact file names or locations of the data required for this analysis. Instead, the application specifies the characteristics of the desired data at a high level and passes this attribute description to a metadata catalog (1). The metadata catalog queries its attribute-based indexes and produces a list of logical files that contain data with the specified characteristics. The metadata catalog returns this list of logical files to the application (2). The application passes these logical file names to the replica management service (3), which returns to the application a list of physical locations for all registered copies of the desired logical files (4). Next, the application passes this list of replica locations (5) to a replica selection service, which identifies the source and destination storage system locations for all candidate data transfer operations. In our example, the source locations contain files with 1998 precipitation measurements, and the destination location is where the application will access the data. The replica selection service sends the candidate source and destination locations to one or more information services (6), which provide estimates of candidate transfer performance based on grid measurements and/or predictions (7). Based on these estimates, the replica

7 selection service chooses the best location for a particular transfer and returns location information for the selected replica to the application (8). Following this selection process, the application performs the data transfer operations. Application Attributes of Desired Data (1) (2) (3) Logical File Names Metadata Service Replica Management Service (5) Locations of one or more (4) replicas Location of selected replica (8) Replica Selection Service (6) (7) Sources and destinations of candidate transfers Performance Measurements and Predictions Information Services Figure 3: Shows a data selection scenario where the application consults the metadata service, replica management service and replica selection service to determine the best source of data matching a set of desired data attributes. 5 The Replica Management Service Architecture The architecture of the replica management service consists of a replica catalog or repository where information about registered replicas is stored and a set of registration and query operations that are supported by the service. Our architecture does not require a specific implementation for the replica catalog. In this section, we begin by defining the objects that are registered with the service. Next, we present important architecture design decisions that clarify the functionality provided by the replica management service. Finally, we briefly describe the operations supported by the service. 5.1 Managed Objects As already discussed, the purpose of the replica management service is to allow users to register files with the service, create and delete replicas of previously registered files, and make enquiries about the location and performance characteristics of replicas. The replica management service must register three types of entries in a replica catalog or repository:

8 Logical files Logical collections Locations Logical files are entities with globally unique names that may have one or more physical instances. Users characterize individual files by registering them with the replica management service. A logical collection is a user-defined group of files. We expect that users will often find it convenient and intuitive to register and manipulate groups of files as a collection, rather than requiring that every file be registered and manipulated individually. A logical collection is simply a list of files and contains no information about the physical locations where files are stored. Location entries in the replica management system contain all information required to map a logical collection to a particular physical instance of that collection. This might include such information as the hostname, port number and access protocol of the physical storage system where the files are stored. Each location object represents a complete or partial copy of a logical collection on a storage system. One location entry corresponds to exactly one physical storage system location. Each logical collection may have an arbitrary number of associated location objects, each of which contains mapping information for a (possibly overlapping) subset of the files in the collection. To illustrate the use of these objects for registering and querying replica information, we again use the example of precipitation measurements for the year Suppose that files contain one month of measurements, and that file names are jan98, feb98, mar98, etc. The manager of a climate modeling catalog could register all these files as belonging to a logical collection called precipitation98. In addition, the manager could register information, such as file size, about each file in separate logical file entries. If a storage system at site 1 stores a complete copy of the files in this logical collection, the manager would register a location entry in the catalog that contains all information needed to map from logical file names to physical storage locations at site 1. Similarly, if a storage system at site 2 stores only files jan98, feb98 and mar98, this list of files as well as mapping information would be registered with the replica management service using a location entry. Subsequently, if a user queries the replica management service to determine all locations of the logical file feb98, the service will respond with physical storage locations for the file at sites 1 and 2. A query for the file jun98 would return only information about the location of the file at site Architecture Design Decisions Next, we discuss several important design decisions for the replica management service. Our motivation for several of these decisions was to clearly define the role of the service and to limit its complexity.

9 5.2.1 Separation of Replication and Metadata Information One key observation is that the objects that can be registered with the replica management service contain only the information required to map logical file and collection names to physical locations. Any other information that might be associated with files or collections, such as descriptions of file contents or the experimental conditions under which files were created, should be stored in an orthogonal metadata management service. Our architecture places no constraints on the design or the contents of the metadata service. Typically, a user might first consult the metadata management service to select logical files based on metadata attributes such as the type of experimental results needed or the time when data were collected. Once the necessary logical files are identified, the user consults the replica management service to find one or more physical locations where copies of the desired logical files are stored Replication Semantics The word replica has been used in a variety of contexts with a variety of meanings. At one extreme, the word replica is sometimes used to mean a copy of a file that is guaranteed to be consistent with the original, despite updates to the latter. A replica management architecture that supports this definition of replication would be required to implement the full functionality of a wide area, distributed database, with locking of files during modification and atomic updates of all replicas. Because of the difficulty of implementing such a distributed database, our architecture operates at the other extreme: our replica management service explicitly does not enforce any replica semantics. In other words, for multiple replicas (locations) of a logical collection, we make no guarantees about file consistency, nor do we maintain any information on which was the original or source location from which one or more copies were made. When users register files as replicas of a logical collection, they assert that these files are replicas under a user-specific definition of replication. Our replica management service does not perform any operations to check, guarantee or enforce the user s assertion Replica Management Service Consistency Although our architecture makes no guarantees about consistency among registered file replicas, we must make certain guarantees about the consistency of information stored in the replica management service itself. Since computational and network failures are inevitable in distributed computing environments, the replica management service must be able to recover and return to a consistent state despite conflicting or failed operations. One way our architecture remains consistent is to guarantee that no file registration operation should successfully complete unless the file exists completely on the

10 corresponding storage system. Consider a replica copy operation that includes copying a file from a source to a destination storage system and registering the new file in a location entry in the replica service. We must enforce an ordering on operations, requiring that the copy operation successfully completes before registration of the file with the replica management service is allowed to complete. If failures occur and the state of the replica management service is corrupted, we must rollback the replica management service to a consistent state Rollback Certain operations on the replica management service are atomic. If they are completed, then the state of the replica management service is updated. If these operations fail, then the state of the replica management service is unchanged. Examples of atomic operations include adding a new entry to the replica management service, deleting an entry, or adding an attribute to an existing entry. Other operations on the replica management service consist of multiple parts. For example, consider an operation that copies a file to a storage system and registers the file with the replica management service. Our architecture does not assume that complex, multi-part operations are atomic. Depending on when a failure occurs during a multi-part operation, the information registered in the replica management service may become corrupted. We guarantee that if failures occur during complex operations, we will rollback the state of the replica management service to the previously-consistent state before the operation began. This requires us to save sufficient state about outstanding complex operations to revert to a consistent state after failures No Distributed Locking Mechanism It is possible for users to corrupt our replica management service by changing or deleting files on an underlying storage system without informing the replica management service. We strongly discourage such operations, but the architecture does not prevent them. After such operations, information registered in the replica catalog may not be consistent with the actual contents of corresponding storage systems. The replica management service could avoid such corruption if it could enforce that all changes to storage systems be made via calls to the replica management service. Enforcing this requirement would require a distributed locking mechanism that prevents changes to registered storage locations except via authorized replica management operations. Because of the difficulty of implementing such a distributed locking mechanism, our architecture does not assume that locking is available and does not guarantee that catalog corruption will not occur Requirements for Logical Files and Locations We require that files that are registered in location or logical file entries must also be registered in the corresponding logical collection entry. Conversely, we do not require

11 that every file in a logical collection entry be registered in a location or logical file entry. In other words, there may be logical files associated with a logical collection that currently have no registered physical instances in the catalog Post-Processing Files After Data Transfer Our architecture provides limited support for post-processing operations on transferred data. Certain applications would like to perform post-processing after a file is transferred to a destination storage system but before the file is registered in the replica management service. Examples of post-processing include decryption of data, running verification operations such as a checksum to confirm that the file was not corrupted during transfer, or attaching the transferred file to an object-oriented database. We limit the nature of allowed post-processing operations to maintain our consistency guarantees for the replica management service. In particular, we allow only those postprocessing operations that do not alter file contents. Reading the contents of a transferred file to perform verification (checksum) calculations or registering the file in an external database would be allowed. However, decrypting a data file would not be allowed, since the contents of the file would change. These restrictions make it possible for us to roll back failed post-processing operations and restart them, if necessary. 5.3 Replica Management Operations Our replica management architecture includes support for the following operations: Register a new entry in the replica management service: A new logical collection consisting of a list of logical file names A new location containing mapping information for a subset of files in an existing logical collection A new logical file entry with specific information, such as size, describing a single file in an existing logical collection Modify an existing entry in the replica management service: Add or delete a file from an existing logical collection or location entry Add or delete a descriptive attribute of an existing entry Query the replica management service: Find an entry, if it exists, for a specified logical file, logical collection, or location Find all locations that include a physical copy of a specified logical file Return requested attributes associated with an entry. For a logical collection entry, return the names of files in the collection. For a location

12 entry, return attributes used to map logical names to physical names. For a logical file entry, return attributes that describe the logical file. Combined storage and registration operations Copy a file registered in an existing location entry from a source to a destination storage system and register the file in the corresponding location entry Publish a file that is not currently represented in the replica catalog by copying it to a storage system and registering the file in corresponding location and logical collection entries Delete entries from the replica management service 6 Implementation Questions For the replica management service architecture we have described, there are many possible implementations. In this section, we discuss a few implementation issues. 6.1 Storing and Querying Replica Information There are a variety of different technologies that can store and query replica management service information. Two possibilities are relational databases and LDAP directories. A relational database provides support for indexing replica information efficiently and for database language queries (e.g., SQL) of replica management information. An LDAP or Lightweight Directory Access Protocol directory has a simple protocol for entering and querying information in a directory; an LDAP directory can use a variety of storage systems or back-ends to hold data, ranging from a relational database to a file system. 6.2 Reliability and Availability If the replica management service fails during an operation, it will be unavailable for some time and its state upon recovery will be indeterminate. The amount of reliability and availability provided by a replica management service is an implementation decision. As with any system design, there is an inevitable tradeoff between the level of reliability and consistency after failures and the system s cost and complexity. Reliability and availability will be greatly improved in implementations that replicate and/or distribute the replica management service. Our architecture allows implementers to use services provided by relational databases and LDAP directories for distributing and replicating information. If high reliability and availability are required, then system builders must devote adequate resources to replicating the service, performing frequent checkpoint operations to facilitate quick recovery, and avoiding single hardware and software points of failure. This robustness must be engineered into the replica management service.

13 7 Conclusions We have argued that data-intensive, high-performance computing applications such as high-energy physics require the efficient management and transfer of terabytes or petabytes of information in wide-area, distributed environments. Researchers performing these analyses create local copies or replicas of large subsets of these datasets to overcome long wide-area data transfer latencies. A Data Grid infrastructure to support these applications must provide a set of orthogonal, application-independent services that can then be combined and specialized in different ways to meet the needs of specific applications. These services include a metadata management, replica management, replica selection service, and secure, reliable, efficient data transfer. We have presented an architecture for a replica management service. This architecture consists of two parts: a replica catalog or repository where information can be registered about logical files, collections of files, and physical locations where subsets of collections are stored; and a set of registration and query operations that are supported by the replica management service. The replica management service can be used by higher-level services such as replica selection and automatic creation of new replicas to satisfy application performance requirements. In addition to describing the basic entities that are registered with the replica catalog, we presented several important design decisions for the replica management service. To make implementation of the replica management service feasible, we have limited its functionality. For example, our service does not guarantee or enforce any replica semantics or replica consistency. When users register files as replicas of a logical collection, they assert that these files are replicas under a user-specific definition of replication. We do not enforce consistency among replicas because doing so would require us to implement a wide-area, distributed database with a distributed locking mechanism and atomic updates of all replicas. Efficient implementation of such a widearea distributed database represents a difficult, outstanding research problem. Several other architecture decisions designed to clearly define the role and limit the complexity of the replica management service include the separation of replication and file metadata information, providing rollback of complex operations, and limiting the types of post-processing that can be performed on transferred files. Finally, the paper presented a few implementation issues for the replica management service. One issue is the technology used to implement the replica catalog and the query protocol for that catalog. Another is the amount of reliability and availability provided by the replica management service. As with any system implementation, there is a tradeoff of cost and complexity when providing higher reliability and availability. A reliable system would include distribution and replication of the replica catalog, frequent checkpoint operations for fast recovery after failures, and hardware and software redundancy to avoid single points of failure.

14 References [1] I. Foster, C. Kesselman, S. Tuecke, The Anatomy of the Grid: Enabling Scalable Virtual Organizations, IJSA 2001.