Scaling Distributed Database Management Systems by using a Grid-based Storage Service

Transcription

1 Scaling Distributed Database Management Systems by using a Grid-based Storage Service Master Thesis Silviu-Marius Moldovan Marius.Moldovan@irisa.fr Supervisors: Gabriel Antoniu, Luc Bougé {Gabriel.Antoniu,Luc.Bouge}@irisa.fr Keywords: databases, grids, scalability, performance, fault tolerance, consistency. Abstract This report deals with databases with distributed storage, focusing on the insufficient storage space issue. In order to make these systems more scalable, the advantages offered by grids can be taken into consideration. Thus, an approach to create an interface between a database system and a grid-based data storage service is presented. Research Master Degree in Computer Science Rennes, June 2008 IRISA, Paris Team

2 Contents 1 Introduction Motivation Current scenario Proposal Storing data in databases: state of the art Efficiency: main-memory databases Main-memory databases: a few new issues Case study: The DERBY data storage system Scalability: distributed main-memory databases Motivation Case study: Sprint [2007] Going larger: using grids Context Grid data sharing services Grid-based databases in practice Contribution: interfacing a database system with a grid-based data storage service The database system: Berkeley DB The grid-based data storage service: BlobTamer Storing the data of Berkeley DB using BlobTamer Selecting the level to interface Methodology applied How Berkeley DB reads and writes Implementation details Mapping files into memory Re-implementing the read and write operations Re-implementing collateral operations Conclusion Contribution Other approaches Future work

3 1 Introduction 1.1 Motivation One of the most important aspect related to data is the way they are stored. The most common way to store data nowadays relies on databases. Almost every type of application use them, ranging from those performing intense scientific computation to those storing personal data of employees in a factory. Initially, databases stored their data on the disk space of one machine. The evolution in the database database domain has occured from the need of greater efficiency and larger storage space. In-memory databases are one way to increase the access perfomance. In this approach, the data is stored in the main-memory and a backup copy may be kept on the disk. Since memory access time is smaller than disk access time, the access time can be reduced. But the storage capacity of the database will still be limited to the memory of a single machine. A simple approach to extend databases and increase the storage capacity is to distribute their data on more machines. Thus, one could use the storage capacities and the computing power of different nodes in the same cluster. Of course, supplementary problems might arise in the management of such systems. For example, in order to assure data availability, a piece of data is replicated over several, different nodes. Thus, the coherence of the multiple copies must be assured. Databases with distributed data represent the subject of this study. 1.2 Current scenario Nowadays, the amount of data to be stored is getting larger. Scientific applications, for example, perform numerical simulations and need big storage capacities and computing power. A larger space than one offered by a database with one node is required. Distributing the data of databases over a cluster of computers is one solution. But clusters can only have a few tens of nodes. If the data should be stored in the memory of the machines in the cluster, for the sake of efficiency, then the space might not be sufficient. For example, if each node has 2 Gigabytes of RAM memory and there are 30 nodes in the cluster, only 60 Gigabytes of storage will be available. Grids might represent the solution for further extension of the storage capacity of databases. But resources in grids are heterogeneous and new problems related to data management might arise. 1.3 Proposal The most convenient approach of using grids relies on grid data-sharing services [2]. Besides transparent access to data, these services also provide persistence and consistency of data, in a fault tolerant way. Applications can, thus, benefit from their properties and leave them to intermediate all operations between them and the grids that store their data. An approach of enabling databases to use grids for storing their data already exists [1]. It uses the JuxMem ([3], [2]) grid data-sharing service and allows performing some basic operations like table creation, record insertion or simple select queries. But this approach assumes extending the grid data-sharing service, by adding a new API on top of it. It does not take into consideration the database management system that will use this API. This problem has not been studied from the perspective of a database system. It would be interesting to study an open-source database system and try to find an appropriate level where to redirect the data onto a grid-based data storage service. Thus, an 2

4 interface could be created between the database engine and the above-mentioned service, at a level which is convenient for the database engine. The rest of the report is structured as follows: Section 2 presents the newest concepts in database design and analyzes the possibilities of using grids for storing the data of databases; Section 3 presents a database engine, a grid-based storage service and analyzes the possibility of creating an interface between the two systems; Section 4 presents the most important implementation details of the interface; finally, Section 5 concludes on the contributions of this work and on the possibilities of continuing what has been done. 3

5 2 Storing data in databases: state of the art Old-fashioned databases store their data on the disk space of one machine. But that is not practical any more. Nowadays databases must be more efficient and the amounts of data to be stored are increasing. This has led to a revolution in the concepts of database design. The main ideas and concepts that have appeared lately are presented in this section. 2.1 Efficiency: main-memory databases One of the first changes that occured was using the main-memory as storage medium. In main-memory database systems data is stored in the main physical memory, as opposed to the conventional approach in which data is stored on the disk. Even if the data may have a copy on disk, the primary copy lives permanently in memory, and this has deep influence on the design and performance of this type of database system. Storing large databases in memory has become a reality nowadays, since memory is getting cheaper and chip capacity increases Main-memory databases: a few new issues The properties that distinguish main memory from magnetic disks lead to optimizations in most of the aspects of database management. Using memory-resident data has impact on several functional components of database management systems, as shown in [6]. These new issues are illustrated below. Concurrency Control The most commonly used concurrency control methods in practice are lock-based. In conventional database systems, where data are disk-resident, small locking granules (fields or records) are chosen, to reduce contention. In main-memory database systems, on the other hand, contention is already low because data are memory-resident. Thus, for these systems it has been suggested that very large lock grains (e.g., relations) are more suitable for the data. The lock granule can be chosen to be the entire database, in the extreme case, which leads to a desirable, serial execution of transactions. In a conventional system, locks are implemented through a hash table that contains entries for the locked objects. The objects themselves, located on the disk, contain no lock information. In main-memory systems, in contrast, a few bits in the objects can be used to represent lock status. The overhead of a small number of bits is not significant, for records with reasonable length, and lookups through the entire hash table are avoided. Persistence There are several steps to achieve persistence in a database system Logging First of all, a log of transaction activity is kept. This log must reside on stable storage. In conventional systems, it is copied directly to the disk. But logging can affect response time, as well as throughput, if the log becomes a bottleneck. These two problems exist in main-memory systems, too, but their impact on performance is bigger, since logging is the only disk operation required by each transaction. If logging takes too much time, for example, the overall performance of the system will decrease. To eliminate the response time problem, one solution is to use a small amount of main memory to hold a portion of the log (the log tail); after its log information is written 4

6 into memory, a transaction is considered comitted. If there is not enough memory for the log tail, the solution is to pre-commit transactions. This implies releasing the transaction s locks as soon as its log record is placed in the log, without waiting for the information to be propagated to the disk. To relieve a log bottleneck, group commits can be used: the log records of several transactions are allowed to accumulate in memory, and they are flushed to the disk in a single disk operation. Checkpointing The second step in achieving persistence is to keep the disk-resident copy of the database up-to-date. This is done by checkpointing, which is one of the few reasons to access the disk resident copy of the database, in a main-memory database system. Thus, disk access can be tailored to the needs of the checkpointer: disk I/O is performed using very large blocks, which are more efficiently written. Recovery The third step to achieve persistence is recovery from failures, which implies restoring the data from its disk-resident backup. To do that in main-memory systems, one solution that increases performance is to load blocks of the database on demand until all data has been loaded. Another solution is to use disk stripping: the database is spread across multiple disks and read in parallel. Access methods In conventional systems, index structures like B-Trees are used for data access. They have a short, bushy structure and the data values on which the index is built need to be stored in the index itself. In main-memory systems, hashing is one solution to access data. It provides fast lookup and update, but is not as space-efficient as a tree, according to [6]. Thus, tree structures such as the T-Tree have been designed explicitly for memory-resident databases [8]. As opposed to B-Trees, these trees can have a deeper, lesscomplicated structure, so that traversing them is much faster. Also, because random access is fast in main memory, pointers can be followed efficiently. Therefore, the index structures can store pointers to the indexed data, rather than the data itself or block identifiers, which is more efficient [6]. Data representation In main-memory databases, pointers are used for data representation: relational tuples can be represented as a set of pointers to data values. Using pointers has two main advantages, as stated in [6]: it is space efficient (if large values appear more than once in the database) and it simplifies the handling of variable length fields Case study: The DERBY data storage system The DERBY data storage system, described in [7], is used to support a distributed, memoryresident database system. DERBY consists of general-purpose workstations, connected through a high-speed network, and makes use of them to create a massive, cost-effective main-memory storage system. Workstations in DERBY are classified into servers and clients. Servers run on machine with idle resources and provide the storage space for the DERBY memory storage system. They have the lowest priority and are "guests" of the machines they borrow resources from. One important advantage of DERBY is that, at any moment, each workstation can be operating as a client, a server, both or neither (as it can be seen in Figure 1) and this may change 5

7 Figure 1: The DERBY architecture (node 5 serves as the UPS for nodes 1, 2, 3 and 4) over time. Another advantage is that the system configuration models a dynamic, realistic data processing environment where workstations come and go over time. The primary role of a DERBY client is to forward read/write requests from the database application to the DERBY server where the record is located. DERBY s basic data storage abstraction assumes there is exactly one server responsible for each record. A server where the record is stored is called the primary location of the record; this location may change over time, due to the dynamic nature of the system. The primary role of a server in DERBY is to keep all records in memory and to avoid disk accesses when satisfying client requests. Servers guarantee long-term data persistence (they eventually propagate modified records to the disk), but also short-term persistence without disk storage. The latter is achieved through a new and interesting approach: a part of the nodes in the network are supplied with Uninterrupted Power Supplies (UPS) which temporarily hold data. This a cost-effective way to achieve reasonably large amounts of persistent storage. To provide high-speed persistent storage, each server is associated with a number of workstations equipped with these UPS-s. Also, each workstation with UPS can provide service to more servers. In Figure 1, for example, node 5 is an UPS for the other nodes. In the case of failures, the lost data is recovered from the logs kept on disks, or from recently modified log buffers stored in the UPS workstations. To guarantee data consistency and regulate concurrent access, DERBY provides a basic locking mechanism, with a lock granularity of a single record. This is quite unusual, since large granularities are most appropriate for memory-resident systems. Every client must hold locks for the records it operates on. The server maintains a list with each record of the clients caching the record as read-only and as write-locked. One important contribution of DERBY is that, during initial testing, it has been shown that load balancing has little effect on the performance of memory-based systems, in contrast to disk-based storage systems. Memory-based systems need to consider migrating load only when a limited resource is close to saturation. Thus, DERBY dynamically redistributes the load via a saturation prevention algorithm that is different from the conventional approach to load balancing. The algorithm attempts to optimize processing load constrained by memory space availability, more exactly to find the first acceptable distribution of load that does not exceed a predefined threshold of available memory space on any machine. This way, the servers are kept away from reaching their saturation points. 6

8 2.2 Scalability: distributed main-memory databases Motivation Applications accessing an in-memory database are usually limited by the memory capacity of the machine hosting the database. Having a database whose data is distributed on a cluster of workstations brings several advantages. To begin with, the application can take advantage of the aggregated memory of all the machines in the cluster, and, thus, the available storage capacity will be larger. Then, by distributing the data over more nodes, fault tolerance can be achieved. Copies of the same piece of data can be replicated over multiple nodes and, so, if one node fails, the data can be recovered from these nodes. Finally, the failure recovery will also be performant, since the backup copies of the data which are closest to the node will be used for restoring a failed node Case study: Sprint [2007] Sprint is a middleware infrastructure for high-performance and high-availability data management. It manages commodity in-memory databases running in a cluster of sharednothing servers, as stated in [5]. Applications will then be limited by the aggregated capacity of the servers in the cluster. The hardware infrastructure of Sprint is represented by its physical servers, while the software one by the logical servers: edge servers, data servers and durability servers. The advantage of this decoupled design is that each type of server can handle a different aspect of database management. Edge servers receive client queries and execute them against data servers. Data servers run a local, in-memory, off-the-shelf database management system and execute transactions without accessing the disk. Durability servers ensure transaction persistency and handle recovery. The server types can be identified in Sprint s architecture, illustrated in Figure 2. Physical servers communicate by message-passing only, while logical ones can use pointto-point or total order multicast communication. Physical servers can fail by crashing and may recover after the failure but lose all information stored in the main-memory before the crash. The failure of a physical server implies the failure of all the logical servers it hosts. Sprint tolerates unreliable failure detection, which relies on an entity providing a list of nodes that may have failed, but not all of them having actually failed. The advantages of this are fast reaction to failures and the certainty that failed servers are eventually detected by operational servers. The limitation of this approach is illustrated by the case of an operational server that may be mistakenly suspected to have failed. But even if that happens, the system remains consistent: the falsely suspected server is replaced by another one and they exist simultaneously for a certain time. All permanent state is stored by the durability servers, which periodically creates an image on disk of the current database state. In the case of a failure, new instances of edge servers and data servers are created on operational physical servers, using the state stored by the durability servers. If a durability server fails, the information needed for recovery are retrieved from operational durability servers. According to [5], the usual ACID (atomicity, consistency, isolation, durability) properties of a transaction are guaranteed by Sprint. The system distinguishes between two types of transactions: local transactions that only access data stored on a single data server, and global 7

9 Figure 2: The Sprint architecture transactions that access data on multiple servers. Both types of transactions are supported and respect the ACID properties. Database tables are partitioned over the data servers. Data items can be replicated on multiple data servers, which brings several benefits. First of all, the failure recovery mechanism is more efficient. Then, this allows parallel execution of read operations, even though it would make write operations more difficult, since all replicas of a data item must be modified. One important contribution of Sprint is its approach to distributed query processing. In most distributed database architectures, high-level client queries are translated into lowerlevel internal requests. In Sprint, however, a middleware solution is adopted: queries are decomposed into internal ones, according to the way the database is fragmented and replicated. Also, the distributed query decomposition and merging are simple, since Sprint was designed for multi-tier architectures. The experiments conducted proved that Sprint shows good performance and scalability on clusters of up to 64 nodes (among which 32 data servers). Experiments at larger scales (e.g., at grid scale) have not been performed, however. 2.3 Going larger: using grids Context Sometimes, the amount of data in a database can be too large to store, even for a single cluster of computers. Grid computing has emerged as a response to the growing demand for resources. A grid is composed by a federation of clusters. A grid system seems to offer 8

10 the necessary infrastructure for storing databases: they offer much larger amounts of storing capacities, on many nodes. This could allow to store larger volumes of data. But new problems arise, related to grid infrastructures. To begin with, since there are more clusters in the grid, there will be a hierarchy of latences in the system: some latences will be greater than others. When a message is sent between two nodes from the same cluster, it will cost 100 or 1000 times less than if the two nodes were on different clusters. A possible optimization for this problem is to use communication protocols on a hierarchical level. Then, a grid system is composed of many hosts, from many administrative domains, with heterogeneous resource capabilities (computing power, storage, operating system). Finally, together with the number of nodes, the number of failures will increase. Thus, special care must be taken when managing data in such a system Grid data sharing services A data sharing service for grid computing opens an alternative approach to the problem of grid data management. This concept decouples data management from computation. Main features The main goal of such a system, as stated in [2], is to provide transparent access to data. The most widely-used approach to data management in grid environments is by explicit data transfers: the user has to localize the data he is interested in and to perform the desired transfer. Thanks to transparent access to remote data through an external datasharing service, the client does not have to handle data transfers and does not have to care where the data are. There are another three properties that a data-sharing service provides. Persistence First of all, it provides persistent data storage, to save data transfers. Since large masses of data are to be handled, data transfers between different grid components can be costly and is desirable to avoid repeating them. Fault-tolerance Second of all, the service is fault tolerant. Data is available despite events that can occur because of the dynamic character of the grid infrastructure, like resources joining and leaving, or unexpected failures. Thereby, replication techniques and failure detection mechanisms are provided. Consistency Finally, the consistency of replicated data is guaranteed. Since data manipulated by grid applications are mutable, and data are often replicated to enhance access locality, the service must ensure the consistency of the different replicas. To achieve this, the service relies on consistency models, implemented by consistency protocols. According to [3], a data sharing service for grid computing can be thought of as a hybrid system between distributed shared memory (DSM) systems and peer-to-peer (P2P) systems. That is because it takes benefit from the advantages provided by both types of systems. It provides transparent data sharing and consistency protocols and models, just like a DSM system. Meanwhile, it provides fault-tolerance mechanisms and it manages heterogeneous resources in a very scalable environment, just like a P2P system. 9

11 An example: JuxMem An architecture was proposed ([3], [2]) for a data-sharing service, based on the observations above. The software architecture of JuxMem (for Juxtaposed Memory) reflects the hardware architecture of a grid: a hierarchical model consisting of a federation of distributed clusters of computers. This architecture is made up of a network of peer groups, which can correspond to clusters at the physical level, to a subset of the same physical cluster, or to nodes spread over several physical clusters. In each such group there are nodes that provide memory for data storage, nodes that simply use the service to allocate and to access data blocks and one node that manages the available memory. In order to allocate memory, the client must specify on how many clusters the data should be replicated, and on how many nodes in each cluster. In response, a set of data replicas are instantiated and an ID is returned. To read or write a data block, the client only needs to specify this ID, and JuxMem transparently locates the corresponding data block and performs the necessary data transfers. Each block of data stored in the system is replicated and associated to a group of peers, each peer in the group hosting a copy of the same data block. These peers could belong to different clusters and, thus, the data could be replicated on several physical clusters. JuxMem s approach to maintain the consistency between the different copies of a same piece of data is based on home-based protocols. For each piece of data there is a home entity in charge of maintaining a reference data copy. A protocol implementing the entry consistency model in a fault-tolerant way has been developed. To limit inter-cluster communications, the home entity is organized in a hierarchical way: local homes, at cluster level, are the clients of a global home, at grid level ([4], [9]) Grid-based databases in practice Grids have been developped in the context of high-performance computing (HPC) applications, like numerical simulations. The use of these infrastructures has been very little explored in the context of databases. Two examples are presented below. DB/JuxMem This approach [1] extends the JuxMem grid data-sharing service with a database-oriented API, that allows to perform basic operations like table creation, record insertion and simple select queries. In order to achieve this a layered software architecture is added on top of JuxMem, each layer having a precise role in database management: data storage, indexing, table fragmentation, etc. The highest one is the database-oriented API. These high-level layers over JuxMem are necessary, because JuxMem initially manipulated data only as byte sequences. One advantage of this approach is that it provides structured data management to applications running over JuxMem. Another advantage is that database management systems can benefit from the properties of JuxMem. The data and metadata of databases are handled using JuxMem, which transparently allocates, localizes, replicates and transfers data, in a fault-tolerant and consistent way. Finally, the approach takes benefit of a grid-scale computing infrastructure, as opposed to previous efforts, which provided a distributed main-memory database management system that relied only on a cluster of computers. Oracle 11g The Oracle Database [14] was designed for enterprise grid computing. The Oracle grid architecture creates large resource pools, which are shared by different applications. 10

12 Data processing and storage capacity can then be dynamically provisioned to apllications as needed. One of the most important features for providing resource provisioning is represented by Real Application Clusters (RAC). A RAC is a cluster database with a shared cache architecture that runs on multiple machines. These machines are attached through a cluster interconnect and a shared storage subsystem. A RAC database appears like a single database to users and the same maintenance tools and approaches used for a single database can be used on the entire cluster. One important role of RAC is its ability to manage workload: it can add or remove nodes on demand, based on the processing requirements. RAC also plays an important role in assuring data availability. All the data in the database are replicated on all the nodes in the cluster. RAC exploits this redundancy: users have access to all data as long as there is one available node in the cluster, even if all the other nodes have failed. Even though the Oracle database is self-managing and provides automatic resource allocation, as mentioned above, administrators are allowed to influence how the database resources are allocated to users. This is done through another feature, called Resource Manager. The system also provides capabilities to schedule and perform jobs in the grid, through the Scheduler feature. 11

13 3 Contribution: interfacing a database system with a grid-based data storage service 3.1 The database system: Berkeley DB Berkeley DB [11] is a database library style toolkit written completely in the C programming language. It contains almost 1,800,000 lines of code, structured in many APIs. The library provides a broad base of functionality to application developers. An overview of its features and provided services is illustrated in Figure 3. Main features The system uses a simple function-call interface for all operations, there is no query language to parse and no execution plan to produce. One big advantage of this library is that it is open-source, so the complete source code is available and can be modified according to one s needs. The library is embedded, since it runs in the address space of the application that uses it, inside the same process. As a result, the database operations happen inside the library and require no inter-process communication. Another advantage of Berkeley DB is that it is scalable. Firstly, even though the library is quite small, it can manage databases of up to 256 Terabytes and records of up to 4 Gigabytes. Secondly, it supports high concurrency, thousands of users being able to perform operations on the same database in the same time. Another interesting feature of Berkeley DB is its configurability: the applications can select the storage structure that provides the fastest acces to their data as well as the database services they need (eg, the degree of logging, locking, concurrency or recoverability). Moreover, applications can choose whether to store database pages on the hard disk, or in Berkeley DB s page cache. Record structure and storage Records in Berkeley DB are (key,value) pairs. Some simple operations on records are supported: inserting records in tables, deleting them from tables, searching records by their key and updating found records. Values of any data type can be stored in a Berkeley DB database, no matter how complex they are. Berkeley DB does not operate on the value part of a record. The system cannot decompose the value into constituent parts that it could further use and analyze. Thus, it can provide no information about the contents or structure of the stored value. The application must know the structure of the keys and values that it uses. The data of Berkeley DB databases are stored on the disk. For the sake of efficiency, Berkeley DB uses an in-memory cache which allows for grouped flushing onto disk. Access methods Berkeley DB supports 4 types of storage structures. Hash tables are suitable for very large databases where the time necessary to do a search or an update operation can easily be predicted. It helps fetching records for which the exact key is provided, but not records with similar keys. Btrees are the structures suitable for range searches, when the application needs to find all the records with the keys between two known values. Since in this structure similar keys are stored closely one to the other, it is very convenient to fetch the values related to keys which are nearby. This type of structure is the default one in Berkeley DB. For applications that need to store and fetch records, but cannot easily generate keys by themselves, the best choice is record-number-based storage. In this approach, the record numbers, generated automatically by Berkeley DB, represent the keys for the records. Queues 12

14 Figure 3: Berkeley DB features are suitable for applications that create a lot of records and then must process them in the creation order. These structures store fixed-length records and they use record numbers as keys, too. They are designed for fast record insertions at the tail of the queue and retrieval at the head. In this access method, locking at record level is used. Data management services provided Berkeley DB offers several data management services which work with all available storage structures. To begin with, the system supports concurrency, allowing more users to work on the same record without interfering with one another. Simultaneous readers and writers are supported due to the locking system, which is used by the access methods to acquire the right to read or write database pages. Transactions are also supported in Berkeley DB. The system uses two-phase locking to assure that concurrent transactions are isolated from one another. The transaction system uses writeahead logging protocols to guarantee the recoverability of the changes performed, in the case of a failure. Thus, when an application starts, it can ask Berkeley DB to run recovery, which will restore the database to a clean, consistent state. Difference to relational databases Berkeley DB is not a relational database. One important difference to these latter systems is that Berkeley DB does not support SQL queries. The access to data is done through the API provided. The advantage of a relational database is that it knows everything about the data and can execute queries in a high-level language, without any programming being required. In Berkeley DB, on the other hand, the application developer must understand how the data is represented and accessed and must write the code that will get and store records. The advantage of systems like Berkeley DB is that the overhead of query parsing, optimization and execution is eliminated. Thus, a low-level written program can be very fast. Another difference to relational databases is that Berkeley DB has no notion of schema (i.e., structure of records in tables, relationships among the tables of a database, etc.) and data types. An interesting issue is that relational databases can be built on top of Berkeley DB. For example, the MySQL relational database system [13] does the SQL parsing and execution by itself, but relies on Berkeley DB for the storage level. 13

15 Figure 4: Relationships between the roles of BlobTamer 3.2 The grid-based data storage service: BlobTamer BlobTamer is a system developped inside the Paris Team, at IRISA. It is written in the C++ programming language, having approximately 23,000 lines of code. The name reflects the fact that it manages efficiently blobs (binary large objects) and it makes them more userfriendly. Managing massive data in large-scale distributed environments In this model, the data considered are strings of size in the order of Terabytes, which cannot fit in the memory of a single node. The storage of such large data naturally requires the use of data fragmentation and of distributed storage, which are offered by grids. It is assumed that the access to data is fine-grain, each individual read or write operation concerning only a segment of the string of the order of Megabytes, microscopic with respect to the whole string. Also, the environment considered is highly concurrent: the writing and reading accesses are concurrent, unpredictable and very frequent. The strings are fragmented into small, equally-sized pages, which are distributed in the local memory of a large number of nodes. Upon creation, a page is labeled with the version number at which it has been created. A concatenation of consecutive pages is called segment. There is a set of metadata which makes the connection between an access request and the list of pages that store the corresponding data. The roles in the system The system consists of distributed processes, that communicate through remote procedure calls (RPCs). A physical node can run one or more processes and, in the same time, may play multiple roles from the ones mentioned below. There are 5 types of processes in the system. There may be one or more concurrent clients that issue READ and WRITE requests. The system is not aware of their number, which may vary in time. The pages created by the WRITE operations are physically stored in the local memory of data providers. On entering the system, each data provider registers with the provider manager. This entity is responsible for providing a list of available data providers to clients who issue WRITE requests. For each request, the provider manager decides which data providers should be used based on a strategy that assures load balancing. It periodically receives updates from the data providers regarding their available space, so the list returned 14

16 Figure 5: Interactions between the actors: reads (left) and writes (right) will contain providers with larger available space and lower load. Also, as many distinct providers as possible are enlisted, which allows an efficient parallel access to the pages. The metadata generated upon the creation of new pages by WRITE requests are physically stored by the metadata provider. Its purpose is to help clients who issue READ requests to localize the providers that store the pages corresponding to the required segment of the string. To allow concurrent access to metadata, the metadata provider is implemented on top of an off-the-shelf, stable and scalable distributed hash table: BambooDHT [10]. The version manager stores the number of the last published version of a data string. It serializes WRITE requests to each string and supplies the latest published string version to READ requests. All operations on the version manager are atomic, since it is protected by a lock. The relationships between these types of processes are illustrated in Figure 4. How writes and reads are performed A WRITE request begins with the client contacting the provider manager to obtain a list of providers, one for each page of the segment. Then, the client contacts, in parallel, the providers in the list and requests them to store the pages. After executing the request, each provider sends an acknowledgement to the client. Only when it has received all the acknowledgements, so when it is sure all the pages are written on data providers, the client contacts the version manager, requesting a new version number. If an error occurs while writing the pages, the version manager is not contacted at all. The version number is used by the client to generate the metadata corresponding to the already written data, which he sends to the metadata provider, in parallel. After receiving the acknowledgement, the client reports the success to the version manager. The typical scenario for a READ request begins when the client contacts the version manager to get the last version of the corresponding data string. If the version specified is larger then than the latest available version, the READ will fail. Otherwise, the client contacts the metadata provider and retrieves, in parallel, the metadata describing the pages of the requested segment. After gathering all the metadata, it contacts, in parallel, the data providers that store the corresponding pages. The two scenarios are illustrated in Figure 5. 15

17 The function calls provided The service provides three primitives: one for allocating memory and two for manipulating strings. The ALLOC primitive takes two parameters (pagesize and stringsize) and creates an all-zero string of the provided size. The pagesize parameter specifies the size of the pages that the string will be fragmented into. The primitive generates an unique id for the string being allocated, id which must be specified by clients as input parameter to the other two primitives. id = ALLOC(pagesize,stringsize) The WRITE primitive modifies a string given by its id with the contents of a buffer of length size at a specified offset, all these parameters being provided by the client. The function call generates a new version number, corresponding to the new version (the modified one) of the data string. vw = WRITE(id, buffer, offset, size) A READ primitive takes a segment (specified by an offset and a size) from a string (specified by its id) and puts it into a buffer. The version v of the string from which the segment must be taken is also provided. The READ fails if the specified version of the string is not available, yet. vr = READ(id, v, buffer, offset, size) Experimental results Evaluations of the system have been peformed using 100 nodes, taken from 2 sites, of the Grid 5000 [12] testbed. Two experimental settings were used: one in which the client was located in the same cluster as the data and metadata providers, and one in which the client is located in a different, remote, grid cluster. The latency between the client and the data providers is much higher in the second setting (25 ms), compared to the first setting (0.1 ms). Two experiments were performed. In one of them the purpose was to evaluate how the metadata scheme influences the performance of data accesses. The time required for metadata to be completely read, respectively written, was measured. In the first setting it was observed that the increase in the number of providers did not impact the time required to perform a READ operation, whereas it improved the time required to perform a WRITE operation. In the case of the latter, this advantage is more visible when writing larger segments. In the second setting it has been concluded that the higher latency had a significant impact on the cost of reading the metadata, while this impact is much lower for WRITE operations. Another experiment aimed at evaluating how efficiently the lock-free scheme supports highly-concurrent data accesses. The average bandwidth per client was measured for READ and WRITE requests, when increasing the number of clients. In both settings it was noticed that the bandwidth per client decreases very slowly when the number of concurrent accesses increases significantly. Moreover, the decrease in the read bandwidth is even smaller if client-side caching is used. The two experiments thus showed that the system scales well, without significantly affecting performance, both in terms of storage providers and of concurrent clients. 16

18 Discussion One of the most important advantages of the system is that it allows efficient, large-scale concurrent access to the data strings, without locking them. The versioning technique allows that: concurrent writes to the same page can be performed in parallel, because they access different versions of that page. Reading operations can also be performed in parallel, once each client receives the latest version from the version manager. The system also provides some fault tolerance mechanisms, through the off-the-shelf DHT on top of which the metadata provider is implemented. 3.3 Storing the data of Berkeley DB using BlobTamer Selecting the level to interface In order to achieve the interfacing of the two systems described above and, thus, to leave the job of storing Berkeley DB s data to BlobTamer, the database system s architecture had to be taken into consideration. The Berkeley DB library has a layered architecture, composed of five major subsystems: Access Method The Access Method subsystem provides general-purpose support for creating and accessing database files formatted as btrees, hashed files, and fixed- and variable-length records. Memory (Buffer) Pool The Memory Pool subsystem (or buffer manager, as known in literature) is the general-purpose shared memory buffer pool used by Berkeley DB. This is the shared memory cache that allows multiple processes and threads within processes to share access to databases. Transaction The Transaction subsystem allows a group of database modifications to be treated as an atomic unit so that either all of the changes are done, or none of the changes are done. The Transaction subsystem implements the Berkeley DB transaction model. Locking The Locking subsystem is the general-purpose lock manager used by Berkeley DB. This module is useful outside of the Berkeley DB package for processes that require a portable, fast, configurable lock manager. Logging The Logging subsystem is the write-ahead logging used to support the Berkeley DB transaction model. It is largely specific to the Berkeley DB package, and unlikely to be useful elsewhere except as a supporting module for the Berkeley DB transaction subsystem. In addition to the above-mentioned subsystems, there is also a Storage layer, as in any other database management system. In this model, illustrated in Figure 6, the application makes calls to the access methods. When applications require recoverability, their calls to the Access Method subsystem must be wrapped in calls to the Transaction subsystem. The Access Method and Transaction subsystems in turn make calls into the Memory Pool, Locking and Logging subsystems on behalf of the application. The underlying subsystems can be used independently by applications. For example, the Memory Pool subsystem can be used apart from the rest of Berkeley DB by applications simply wanting a shared memory buffer pool, or the Locking subsystem may be called directly 17

19 Figure 6: The Berkeley DB architecture by applications that are doing their own locking outside of Berkeley DB. However, this usage is not common, and most applications will either use only the Access Method subsystem, or the Access Method subsystem wrapped in calls to the Berkeley DB transaction interfaces. As stated above, the Access Method and Transaction subsystems use the underlying shared memory buffer pool (cache) to hold recently used file pages in main memory. The pages have to be in the main memory, in order for the database management system to operate on it. The Memory Pool subsystem receives page requests from the upper layers and provides handles for underlying files. The handles are then used to retrieve pages from these files. When the pages are returned, if the requestor indicates that a page has been modified (i.e., the page is dirty), the page is written to the disk. This memory buffer pool handles all operations related to pages in a transparent way. The upper layers are not aware that not all data is in the memory, at one time. If the cache if full and a new page needs to be inserted, a page is selected and discarded from the pool. The selection is based on a least-recently-used algorithm: the page that stayed the longest time in the cache without being accessed will be replaced. An important aspect at this point is selecting the layer to implement for a successful interfacing with BlobTamer. One natural choice is the Memory Pool, since page management support is provided by BlobTamer directly. An in-depth study not only of the layers, but also of the interactions between layers is necessary in order to be able to provide a correct interfacing. There exists a tight coupling of the Logging layer with the upper layers in order to provide recovery support. Because of that, this layer would have to be implemented as well, if the Memory Pool layer was chosen. On the other hand, the Storage layer acts as the backbone for both the Buffer Pool and the Logging layers. Both these layers use the Storage layer directly to store their data, as it can be seen from Figure 6. Implementing the Storage layer is much more simple, because it implies just a file system functionality on top of BlobTamer. This approach makes debugging easier, because the implementation is at a lower layer. Moreover, it enables the study of access (reads and writes) patterns at page 18

20 level, which might lead to optimizations for read/write operations, in the new version of the Storage layer. The potential introduction of such optimizations justifies the choice of implementing the Storage layer and not using a distributed file system (like NFS) Methodology applied Before implementing the Storage layer, a few things needed to be studied. First, it was important to know how and where Berkeley DB stored its data and metadata (if it created any), how many physical files were created for each table, and whether temporary files were created while writing the data. Testing some applications that use Berkeley DB was required to study these aspects. Second, it was important to see how the system uses some basic system calls related to file operations (read, write, open, close, flush, fsync, etc.), concentrating especially on the parameters used to read and write data (e.g., the offset in the file and the size of the operation). The possibility of changing the code in the wrappers of these system calls was analyzed, too. The tested application The application needed for the tests was an example written in C provided in the Berkeley DB download toolkit. It concerns some products and vendors that sell the products. The input data were provided in text files (one for the vendors and another for the products), with one record per line. The application consists of two programs. One program creates the database, the tables (files on the hard-disk) and loads the data from the input files into the tables. Another program searches for a specific product, reads the data from the tables and displays information about the product and the vendors that sell it. The first program corresponds to the CREATE DATABASE and INSERT commands in SQL, while the second program corresponds to the SELECT command. In these programs data can be easily manipulated, by means of put and get methods (provided by the Berkeley DB library), as it can be seen from the code fragments below. // define data structure for the application typedef struct stock_dbs { DB *inventory_dbp; DB *vendor_dbp; DB *itemname_sdbp; } STOCK_DBS;... STOCK_DBS my_stock;... /* Set page size */ vendor_dbp->set_pagesize(dbp, 65536);... /* Put data into the database */ my_stock.vendor_dbp->put(my_stock.inventory_dbp, 0, &key, &data, 0);... /* Get a record */ vendor_dbp->get(vendor_dbp, NULL, &key, &data, 0); 19