Scaling Distributed Database Management Systems by using a Grid-based Storage Service

Size: px
Start display at page:

Download "Scaling Distributed Database Management Systems by using a Grid-based Storage Service"

Transcription

1 Scaling Distributed Database Management Systems by using a Grid-based Storage Service Master Thesis Silviu-Marius Moldovan Marius.Moldovan@irisa.fr Supervisors: Gabriel Antoniu, Luc Bougé {Gabriel.Antoniu,Luc.Bouge}@irisa.fr Keywords: databases, grids, scalability, performance, fault tolerance, consistency. Abstract This report deals with databases with distributed storage, focusing on the insufficient storage space issue. In order to make these systems more scalable, the advantages offered by grids can be taken into consideration. Thus, an approach to create an interface between a database system and a grid-based data storage service is presented. Research Master Degree in Computer Science Rennes, June 2008 IRISA, Paris Team

2 Contents 1 Introduction Motivation Current scenario Proposal Storing data in databases: state of the art Efficiency: main-memory databases Main-memory databases: a few new issues Case study: The DERBY data storage system Scalability: distributed main-memory databases Motivation Case study: Sprint [2007] Going larger: using grids Context Grid data sharing services Grid-based databases in practice Contribution: interfacing a database system with a grid-based data storage service The database system: Berkeley DB The grid-based data storage service: BlobTamer Storing the data of Berkeley DB using BlobTamer Selecting the level to interface Methodology applied How Berkeley DB reads and writes Implementation details Mapping files into memory Re-implementing the read and write operations Re-implementing collateral operations Conclusion Contribution Other approaches Future work

3 1 Introduction 1.1 Motivation One of the most important aspect related to data is the way they are stored. The most common way to store data nowadays relies on databases. Almost every type of application use them, ranging from those performing intense scientific computation to those storing personal data of employees in a factory. Initially, databases stored their data on the disk space of one machine. The evolution in the database database domain has occured from the need of greater efficiency and larger storage space. In-memory databases are one way to increase the access perfomance. In this approach, the data is stored in the main-memory and a backup copy may be kept on the disk. Since memory access time is smaller than disk access time, the access time can be reduced. But the storage capacity of the database will still be limited to the memory of a single machine. A simple approach to extend databases and increase the storage capacity is to distribute their data on more machines. Thus, one could use the storage capacities and the computing power of different nodes in the same cluster. Of course, supplementary problems might arise in the management of such systems. For example, in order to assure data availability, a piece of data is replicated over several, different nodes. Thus, the coherence of the multiple copies must be assured. Databases with distributed data represent the subject of this study. 1.2 Current scenario Nowadays, the amount of data to be stored is getting larger. Scientific applications, for example, perform numerical simulations and need big storage capacities and computing power. A larger space than one offered by a database with one node is required. Distributing the data of databases over a cluster of computers is one solution. But clusters can only have a few tens of nodes. If the data should be stored in the memory of the machines in the cluster, for the sake of efficiency, then the space might not be sufficient. For example, if each node has 2 Gigabytes of RAM memory and there are 30 nodes in the cluster, only 60 Gigabytes of storage will be available. Grids might represent the solution for further extension of the storage capacity of databases. But resources in grids are heterogeneous and new problems related to data management might arise. 1.3 Proposal The most convenient approach of using grids relies on grid data-sharing services [2]. Besides transparent access to data, these services also provide persistence and consistency of data, in a fault tolerant way. Applications can, thus, benefit from their properties and leave them to intermediate all operations between them and the grids that store their data. An approach of enabling databases to use grids for storing their data already exists [1]. It uses the JuxMem ([3], [2]) grid data-sharing service and allows performing some basic operations like table creation, record insertion or simple select queries. But this approach assumes extending the grid data-sharing service, by adding a new API on top of it. It does not take into consideration the database management system that will use this API. This problem has not been studied from the perspective of a database system. It would be interesting to study an open-source database system and try to find an appropriate level where to redirect the data onto a grid-based data storage service. Thus, an 2

4 interface could be created between the database engine and the above-mentioned service, at a level which is convenient for the database engine. The rest of the report is structured as follows: Section 2 presents the newest concepts in database design and analyzes the possibilities of using grids for storing the data of databases; Section 3 presents a database engine, a grid-based storage service and analyzes the possibility of creating an interface between the two systems; Section 4 presents the most important implementation details of the interface; finally, Section 5 concludes on the contributions of this work and on the possibilities of continuing what has been done. 3

5 2 Storing data in databases: state of the art Old-fashioned databases store their data on the disk space of one machine. But that is not practical any more. Nowadays databases must be more efficient and the amounts of data to be stored are increasing. This has led to a revolution in the concepts of database design. The main ideas and concepts that have appeared lately are presented in this section. 2.1 Efficiency: main-memory databases One of the first changes that occured was using the main-memory as storage medium. In main-memory database systems data is stored in the main physical memory, as opposed to the conventional approach in which data is stored on the disk. Even if the data may have a copy on disk, the primary copy lives permanently in memory, and this has deep influence on the design and performance of this type of database system. Storing large databases in memory has become a reality nowadays, since memory is getting cheaper and chip capacity increases Main-memory databases: a few new issues The properties that distinguish main memory from magnetic disks lead to optimizations in most of the aspects of database management. Using memory-resident data has impact on several functional components of database management systems, as shown in [6]. These new issues are illustrated below. Concurrency Control The most commonly used concurrency control methods in practice are lock-based. In conventional database systems, where data are disk-resident, small locking granules (fields or records) are chosen, to reduce contention. In main-memory database systems, on the other hand, contention is already low because data are memory-resident. Thus, for these systems it has been suggested that very large lock grains (e.g., relations) are more suitable for the data. The lock granule can be chosen to be the entire database, in the extreme case, which leads to a desirable, serial execution of transactions. In a conventional system, locks are implemented through a hash table that contains entries for the locked objects. The objects themselves, located on the disk, contain no lock information. In main-memory systems, in contrast, a few bits in the objects can be used to represent lock status. The overhead of a small number of bits is not significant, for records with reasonable length, and lookups through the entire hash table are avoided. Persistence There are several steps to achieve persistence in a database system Logging First of all, a log of transaction activity is kept. This log must reside on stable storage. In conventional systems, it is copied directly to the disk. But logging can affect response time, as well as throughput, if the log becomes a bottleneck. These two problems exist in main-memory systems, too, but their impact on performance is bigger, since logging is the only disk operation required by each transaction. If logging takes too much time, for example, the overall performance of the system will decrease. To eliminate the response time problem, one solution is to use a small amount of main memory to hold a portion of the log (the log tail); after its log information is written 4

6 into memory, a transaction is considered comitted. If there is not enough memory for the log tail, the solution is to pre-commit transactions. This implies releasing the transaction s locks as soon as its log record is placed in the log, without waiting for the information to be propagated to the disk. To relieve a log bottleneck, group commits can be used: the log records of several transactions are allowed to accumulate in memory, and they are flushed to the disk in a single disk operation. Checkpointing The second step in achieving persistence is to keep the disk-resident copy of the database up-to-date. This is done by checkpointing, which is one of the few reasons to access the disk resident copy of the database, in a main-memory database system. Thus, disk access can be tailored to the needs of the checkpointer: disk I/O is performed using very large blocks, which are more efficiently written. Recovery The third step to achieve persistence is recovery from failures, which implies restoring the data from its disk-resident backup. To do that in main-memory systems, one solution that increases performance is to load blocks of the database on demand until all data has been loaded. Another solution is to use disk stripping: the database is spread across multiple disks and read in parallel. Access methods In conventional systems, index structures like B-Trees are used for data access. They have a short, bushy structure and the data values on which the index is built need to be stored in the index itself. In main-memory systems, hashing is one solution to access data. It provides fast lookup and update, but is not as space-efficient as a tree, according to [6]. Thus, tree structures such as the T-Tree have been designed explicitly for memory-resident databases [8]. As opposed to B-Trees, these trees can have a deeper, lesscomplicated structure, so that traversing them is much faster. Also, because random access is fast in main memory, pointers can be followed efficiently. Therefore, the index structures can store pointers to the indexed data, rather than the data itself or block identifiers, which is more efficient [6]. Data representation In main-memory databases, pointers are used for data representation: relational tuples can be represented as a set of pointers to data values. Using pointers has two main advantages, as stated in [6]: it is space efficient (if large values appear more than once in the database) and it simplifies the handling of variable length fields Case study: The DERBY data storage system The DERBY data storage system, described in [7], is used to support a distributed, memoryresident database system. DERBY consists of general-purpose workstations, connected through a high-speed network, and makes use of them to create a massive, cost-effective main-memory storage system. Workstations in DERBY are classified into servers and clients. Servers run on machine with idle resources and provide the storage space for the DERBY memory storage system. They have the lowest priority and are "guests" of the machines they borrow resources from. One important advantage of DERBY is that, at any moment, each workstation can be operating as a client, a server, both or neither (as it can be seen in Figure 1) and this may change 5

7 Figure 1: The DERBY architecture (node 5 serves as the UPS for nodes 1, 2, 3 and 4) over time. Another advantage is that the system configuration models a dynamic, realistic data processing environment where workstations come and go over time. The primary role of a DERBY client is to forward read/write requests from the database application to the DERBY server where the record is located. DERBY s basic data storage abstraction assumes there is exactly one server responsible for each record. A server where the record is stored is called the primary location of the record; this location may change over time, due to the dynamic nature of the system. The primary role of a server in DERBY is to keep all records in memory and to avoid disk accesses when satisfying client requests. Servers guarantee long-term data persistence (they eventually propagate modified records to the disk), but also short-term persistence without disk storage. The latter is achieved through a new and interesting approach: a part of the nodes in the network are supplied with Uninterrupted Power Supplies (UPS) which temporarily hold data. This a cost-effective way to achieve reasonably large amounts of persistent storage. To provide high-speed persistent storage, each server is associated with a number of workstations equipped with these UPS-s. Also, each workstation with UPS can provide service to more servers. In Figure 1, for example, node 5 is an UPS for the other nodes. In the case of failures, the lost data is recovered from the logs kept on disks, or from recently modified log buffers stored in the UPS workstations. To guarantee data consistency and regulate concurrent access, DERBY provides a basic locking mechanism, with a lock granularity of a single record. This is quite unusual, since large granularities are most appropriate for memory-resident systems. Every client must hold locks for the records it operates on. The server maintains a list with each record of the clients caching the record as read-only and as write-locked. One important contribution of DERBY is that, during initial testing, it has been shown that load balancing has little effect on the performance of memory-based systems, in contrast to disk-based storage systems. Memory-based systems need to consider migrating load only when a limited resource is close to saturation. Thus, DERBY dynamically redistributes the load via a saturation prevention algorithm that is different from the conventional approach to load balancing. The algorithm attempts to optimize processing load constrained by memory space availability, more exactly to find the first acceptable distribution of load that does not exceed a predefined threshold of available memory space on any machine. This way, the servers are kept away from reaching their saturation points. 6

8 2.2 Scalability: distributed main-memory databases Motivation Applications accessing an in-memory database are usually limited by the memory capacity of the machine hosting the database. Having a database whose data is distributed on a cluster of workstations brings several advantages. To begin with, the application can take advantage of the aggregated memory of all the machines in the cluster, and, thus, the available storage capacity will be larger. Then, by distributing the data over more nodes, fault tolerance can be achieved. Copies of the same piece of data can be replicated over multiple nodes and, so, if one node fails, the data can be recovered from these nodes. Finally, the failure recovery will also be performant, since the backup copies of the data which are closest to the node will be used for restoring a failed node Case study: Sprint [2007] Sprint is a middleware infrastructure for high-performance and high-availability data management. It manages commodity in-memory databases running in a cluster of sharednothing servers, as stated in [5]. Applications will then be limited by the aggregated capacity of the servers in the cluster. The hardware infrastructure of Sprint is represented by its physical servers, while the software one by the logical servers: edge servers, data servers and durability servers. The advantage of this decoupled design is that each type of server can handle a different aspect of database management. Edge servers receive client queries and execute them against data servers. Data servers run a local, in-memory, off-the-shelf database management system and execute transactions without accessing the disk. Durability servers ensure transaction persistency and handle recovery. The server types can be identified in Sprint s architecture, illustrated in Figure 2. Physical servers communicate by message-passing only, while logical ones can use pointto-point or total order multicast communication. Physical servers can fail by crashing and may recover after the failure but lose all information stored in the main-memory before the crash. The failure of a physical server implies the failure of all the logical servers it hosts. Sprint tolerates unreliable failure detection, which relies on an entity providing a list of nodes that may have failed, but not all of them having actually failed. The advantages of this are fast reaction to failures and the certainty that failed servers are eventually detected by operational servers. The limitation of this approach is illustrated by the case of an operational server that may be mistakenly suspected to have failed. But even if that happens, the system remains consistent: the falsely suspected server is replaced by another one and they exist simultaneously for a certain time. All permanent state is stored by the durability servers, which periodically creates an image on disk of the current database state. In the case of a failure, new instances of edge servers and data servers are created on operational physical servers, using the state stored by the durability servers. If a durability server fails, the information needed for recovery are retrieved from operational durability servers. According to [5], the usual ACID (atomicity, consistency, isolation, durability) properties of a transaction are guaranteed by Sprint. The system distinguishes between two types of transactions: local transactions that only access data stored on a single data server, and global 7

9 Figure 2: The Sprint architecture transactions that access data on multiple servers. Both types of transactions are supported and respect the ACID properties. Database tables are partitioned over the data servers. Data items can be replicated on multiple data servers, which brings several benefits. First of all, the failure recovery mechanism is more efficient. Then, this allows parallel execution of read operations, even though it would make write operations more difficult, since all replicas of a data item must be modified. One important contribution of Sprint is its approach to distributed query processing. In most distributed database architectures, high-level client queries are translated into lowerlevel internal requests. In Sprint, however, a middleware solution is adopted: queries are decomposed into internal ones, according to the way the database is fragmented and replicated. Also, the distributed query decomposition and merging are simple, since Sprint was designed for multi-tier architectures. The experiments conducted proved that Sprint shows good performance and scalability on clusters of up to 64 nodes (among which 32 data servers). Experiments at larger scales (e.g., at grid scale) have not been performed, however. 2.3 Going larger: using grids Context Sometimes, the amount of data in a database can be too large to store, even for a single cluster of computers. Grid computing has emerged as a response to the growing demand for resources. A grid is composed by a federation of clusters. A grid system seems to offer 8

10 the necessary infrastructure for storing databases: they offer much larger amounts of storing capacities, on many nodes. This could allow to store larger volumes of data. But new problems arise, related to grid infrastructures. To begin with, since there are more clusters in the grid, there will be a hierarchy of latences in the system: some latences will be greater than others. When a message is sent between two nodes from the same cluster, it will cost 100 or 1000 times less than if the two nodes were on different clusters. A possible optimization for this problem is to use communication protocols on a hierarchical level. Then, a grid system is composed of many hosts, from many administrative domains, with heterogeneous resource capabilities (computing power, storage, operating system). Finally, together with the number of nodes, the number of failures will increase. Thus, special care must be taken when managing data in such a system Grid data sharing services A data sharing service for grid computing opens an alternative approach to the problem of grid data management. This concept decouples data management from computation. Main features The main goal of such a system, as stated in [2], is to provide transparent access to data. The most widely-used approach to data management in grid environments is by explicit data transfers: the user has to localize the data he is interested in and to perform the desired transfer. Thanks to transparent access to remote data through an external datasharing service, the client does not have to handle data transfers and does not have to care where the data are. There are another three properties that a data-sharing service provides. Persistence First of all, it provides persistent data storage, to save data transfers. Since large masses of data are to be handled, data transfers between different grid components can be costly and is desirable to avoid repeating them. Fault-tolerance Second of all, the service is fault tolerant. Data is available despite events that can occur because of the dynamic character of the grid infrastructure, like resources joining and leaving, or unexpected failures. Thereby, replication techniques and failure detection mechanisms are provided. Consistency Finally, the consistency of replicated data is guaranteed. Since data manipulated by grid applications are mutable, and data are often replicated to enhance access locality, the service must ensure the consistency of the different replicas. To achieve this, the service relies on consistency models, implemented by consistency protocols. According to [3], a data sharing service for grid computing can be thought of as a hybrid system between distributed shared memory (DSM) systems and peer-to-peer (P2P) systems. That is because it takes benefit from the advantages provided by both types of systems. It provides transparent data sharing and consistency protocols and models, just like a DSM system. Meanwhile, it provides fault-tolerance mechanisms and it manages heterogeneous resources in a very scalable environment, just like a P2P system. 9

11 An example: JuxMem An architecture was proposed ([3], [2]) for a data-sharing service, based on the observations above. The software architecture of JuxMem (for Juxtaposed Memory) reflects the hardware architecture of a grid: a hierarchical model consisting of a federation of distributed clusters of computers. This architecture is made up of a network of peer groups, which can correspond to clusters at the physical level, to a subset of the same physical cluster, or to nodes spread over several physical clusters. In each such group there are nodes that provide memory for data storage, nodes that simply use the service to allocate and to access data blocks and one node that manages the available memory. In order to allocate memory, the client must specify on how many clusters the data should be replicated, and on how many nodes in each cluster. In response, a set of data replicas are instantiated and an ID is returned. To read or write a data block, the client only needs to specify this ID, and JuxMem transparently locates the corresponding data block and performs the necessary data transfers. Each block of data stored in the system is replicated and associated to a group of peers, each peer in the group hosting a copy of the same data block. These peers could belong to different clusters and, thus, the data could be replicated on several physical clusters. JuxMem s approach to maintain the consistency between the different copies of a same piece of data is based on home-based protocols. For each piece of data there is a home entity in charge of maintaining a reference data copy. A protocol implementing the entry consistency model in a fault-tolerant way has been developed. To limit inter-cluster communications, the home entity is organized in a hierarchical way: local homes, at cluster level, are the clients of a global home, at grid level ([4], [9]) Grid-based databases in practice Grids have been developped in the context of high-performance computing (HPC) applications, like numerical simulations. The use of these infrastructures has been very little explored in the context of databases. Two examples are presented below. DB/JuxMem This approach [1] extends the JuxMem grid data-sharing service with a database-oriented API, that allows to perform basic operations like table creation, record insertion and simple select queries. In order to achieve this a layered software architecture is added on top of JuxMem, each layer having a precise role in database management: data storage, indexing, table fragmentation, etc. The highest one is the database-oriented API. These high-level layers over JuxMem are necessary, because JuxMem initially manipulated data only as byte sequences. One advantage of this approach is that it provides structured data management to applications running over JuxMem. Another advantage is that database management systems can benefit from the properties of JuxMem. The data and metadata of databases are handled using JuxMem, which transparently allocates, localizes, replicates and transfers data, in a fault-tolerant and consistent way. Finally, the approach takes benefit of a grid-scale computing infrastructure, as opposed to previous efforts, which provided a distributed main-memory database management system that relied only on a cluster of computers. Oracle 11g The Oracle Database [14] was designed for enterprise grid computing. The Oracle grid architecture creates large resource pools, which are shared by different applications. 10

12 Data processing and storage capacity can then be dynamically provisioned to apllications as needed. One of the most important features for providing resource provisioning is represented by Real Application Clusters (RAC). A RAC is a cluster database with a shared cache architecture that runs on multiple machines. These machines are attached through a cluster interconnect and a shared storage subsystem. A RAC database appears like a single database to users and the same maintenance tools and approaches used for a single database can be used on the entire cluster. One important role of RAC is its ability to manage workload: it can add or remove nodes on demand, based on the processing requirements. RAC also plays an important role in assuring data availability. All the data in the database are replicated on all the nodes in the cluster. RAC exploits this redundancy: users have access to all data as long as there is one available node in the cluster, even if all the other nodes have failed. Even though the Oracle database is self-managing and provides automatic resource allocation, as mentioned above, administrators are allowed to influence how the database resources are allocated to users. This is done through another feature, called Resource Manager. The system also provides capabilities to schedule and perform jobs in the grid, through the Scheduler feature. 11

13 3 Contribution: interfacing a database system with a grid-based data storage service 3.1 The database system: Berkeley DB Berkeley DB [11] is a database library style toolkit written completely in the C programming language. It contains almost 1,800,000 lines of code, structured in many APIs. The library provides a broad base of functionality to application developers. An overview of its features and provided services is illustrated in Figure 3. Main features The system uses a simple function-call interface for all operations, there is no query language to parse and no execution plan to produce. One big advantage of this library is that it is open-source, so the complete source code is available and can be modified according to one s needs. The library is embedded, since it runs in the address space of the application that uses it, inside the same process. As a result, the database operations happen inside the library and require no inter-process communication. Another advantage of Berkeley DB is that it is scalable. Firstly, even though the library is quite small, it can manage databases of up to 256 Terabytes and records of up to 4 Gigabytes. Secondly, it supports high concurrency, thousands of users being able to perform operations on the same database in the same time. Another interesting feature of Berkeley DB is its configurability: the applications can select the storage structure that provides the fastest acces to their data as well as the database services they need (eg, the degree of logging, locking, concurrency or recoverability). Moreover, applications can choose whether to store database pages on the hard disk, or in Berkeley DB s page cache. Record structure and storage Records in Berkeley DB are (key,value) pairs. Some simple operations on records are supported: inserting records in tables, deleting them from tables, searching records by their key and updating found records. Values of any data type can be stored in a Berkeley DB database, no matter how complex they are. Berkeley DB does not operate on the value part of a record. The system cannot decompose the value into constituent parts that it could further use and analyze. Thus, it can provide no information about the contents or structure of the stored value. The application must know the structure of the keys and values that it uses. The data of Berkeley DB databases are stored on the disk. For the sake of efficiency, Berkeley DB uses an in-memory cache which allows for grouped flushing onto disk. Access methods Berkeley DB supports 4 types of storage structures. Hash tables are suitable for very large databases where the time necessary to do a search or an update operation can easily be predicted. It helps fetching records for which the exact key is provided, but not records with similar keys. Btrees are the structures suitable for range searches, when the application needs to find all the records with the keys between two known values. Since in this structure similar keys are stored closely one to the other, it is very convenient to fetch the values related to keys which are nearby. This type of structure is the default one in Berkeley DB. For applications that need to store and fetch records, but cannot easily generate keys by themselves, the best choice is record-number-based storage. In this approach, the record numbers, generated automatically by Berkeley DB, represent the keys for the records. Queues 12

14 Figure 3: Berkeley DB features are suitable for applications that create a lot of records and then must process them in the creation order. These structures store fixed-length records and they use record numbers as keys, too. They are designed for fast record insertions at the tail of the queue and retrieval at the head. In this access method, locking at record level is used. Data management services provided Berkeley DB offers several data management services which work with all available storage structures. To begin with, the system supports concurrency, allowing more users to work on the same record without interfering with one another. Simultaneous readers and writers are supported due to the locking system, which is used by the access methods to acquire the right to read or write database pages. Transactions are also supported in Berkeley DB. The system uses two-phase locking to assure that concurrent transactions are isolated from one another. The transaction system uses writeahead logging protocols to guarantee the recoverability of the changes performed, in the case of a failure. Thus, when an application starts, it can ask Berkeley DB to run recovery, which will restore the database to a clean, consistent state. Difference to relational databases Berkeley DB is not a relational database. One important difference to these latter systems is that Berkeley DB does not support SQL queries. The access to data is done through the API provided. The advantage of a relational database is that it knows everything about the data and can execute queries in a high-level language, without any programming being required. In Berkeley DB, on the other hand, the application developer must understand how the data is represented and accessed and must write the code that will get and store records. The advantage of systems like Berkeley DB is that the overhead of query parsing, optimization and execution is eliminated. Thus, a low-level written program can be very fast. Another difference to relational databases is that Berkeley DB has no notion of schema (i.e., structure of records in tables, relationships among the tables of a database, etc.) and data types. An interesting issue is that relational databases can be built on top of Berkeley DB. For example, the MySQL relational database system [13] does the SQL parsing and execution by itself, but relies on Berkeley DB for the storage level. 13

15 Figure 4: Relationships between the roles of BlobTamer 3.2 The grid-based data storage service: BlobTamer BlobTamer is a system developped inside the Paris Team, at IRISA. It is written in the C++ programming language, having approximately 23,000 lines of code. The name reflects the fact that it manages efficiently blobs (binary large objects) and it makes them more userfriendly. Managing massive data in large-scale distributed environments In this model, the data considered are strings of size in the order of Terabytes, which cannot fit in the memory of a single node. The storage of such large data naturally requires the use of data fragmentation and of distributed storage, which are offered by grids. It is assumed that the access to data is fine-grain, each individual read or write operation concerning only a segment of the string of the order of Megabytes, microscopic with respect to the whole string. Also, the environment considered is highly concurrent: the writing and reading accesses are concurrent, unpredictable and very frequent. The strings are fragmented into small, equally-sized pages, which are distributed in the local memory of a large number of nodes. Upon creation, a page is labeled with the version number at which it has been created. A concatenation of consecutive pages is called segment. There is a set of metadata which makes the connection between an access request and the list of pages that store the corresponding data. The roles in the system The system consists of distributed processes, that communicate through remote procedure calls (RPCs). A physical node can run one or more processes and, in the same time, may play multiple roles from the ones mentioned below. There are 5 types of processes in the system. There may be one or more concurrent clients that issue READ and WRITE requests. The system is not aware of their number, which may vary in time. The pages created by the WRITE operations are physically stored in the local memory of data providers. On entering the system, each data provider registers with the provider manager. This entity is responsible for providing a list of available data providers to clients who issue WRITE requests. For each request, the provider manager decides which data providers should be used based on a strategy that assures load balancing. It periodically receives updates from the data providers regarding their available space, so the list returned 14

16 Figure 5: Interactions between the actors: reads (left) and writes (right) will contain providers with larger available space and lower load. Also, as many distinct providers as possible are enlisted, which allows an efficient parallel access to the pages. The metadata generated upon the creation of new pages by WRITE requests are physically stored by the metadata provider. Its purpose is to help clients who issue READ requests to localize the providers that store the pages corresponding to the required segment of the string. To allow concurrent access to metadata, the metadata provider is implemented on top of an off-the-shelf, stable and scalable distributed hash table: BambooDHT [10]. The version manager stores the number of the last published version of a data string. It serializes WRITE requests to each string and supplies the latest published string version to READ requests. All operations on the version manager are atomic, since it is protected by a lock. The relationships between these types of processes are illustrated in Figure 4. How writes and reads are performed A WRITE request begins with the client contacting the provider manager to obtain a list of providers, one for each page of the segment. Then, the client contacts, in parallel, the providers in the list and requests them to store the pages. After executing the request, each provider sends an acknowledgement to the client. Only when it has received all the acknowledgements, so when it is sure all the pages are written on data providers, the client contacts the version manager, requesting a new version number. If an error occurs while writing the pages, the version manager is not contacted at all. The version number is used by the client to generate the metadata corresponding to the already written data, which he sends to the metadata provider, in parallel. After receiving the acknowledgement, the client reports the success to the version manager. The typical scenario for a READ request begins when the client contacts the version manager to get the last version of the corresponding data string. If the version specified is larger then than the latest available version, the READ will fail. Otherwise, the client contacts the metadata provider and retrieves, in parallel, the metadata describing the pages of the requested segment. After gathering all the metadata, it contacts, in parallel, the data providers that store the corresponding pages. The two scenarios are illustrated in Figure 5. 15

17 The function calls provided The service provides three primitives: one for allocating memory and two for manipulating strings. The ALLOC primitive takes two parameters (pagesize and stringsize) and creates an all-zero string of the provided size. The pagesize parameter specifies the size of the pages that the string will be fragmented into. The primitive generates an unique id for the string being allocated, id which must be specified by clients as input parameter to the other two primitives. id = ALLOC(pagesize,stringsize) The WRITE primitive modifies a string given by its id with the contents of a buffer of length size at a specified offset, all these parameters being provided by the client. The function call generates a new version number, corresponding to the new version (the modified one) of the data string. vw = WRITE(id, buffer, offset, size) A READ primitive takes a segment (specified by an offset and a size) from a string (specified by its id) and puts it into a buffer. The version v of the string from which the segment must be taken is also provided. The READ fails if the specified version of the string is not available, yet. vr = READ(id, v, buffer, offset, size) Experimental results Evaluations of the system have been peformed using 100 nodes, taken from 2 sites, of the Grid 5000 [12] testbed. Two experimental settings were used: one in which the client was located in the same cluster as the data and metadata providers, and one in which the client is located in a different, remote, grid cluster. The latency between the client and the data providers is much higher in the second setting (25 ms), compared to the first setting (0.1 ms). Two experiments were performed. In one of them the purpose was to evaluate how the metadata scheme influences the performance of data accesses. The time required for metadata to be completely read, respectively written, was measured. In the first setting it was observed that the increase in the number of providers did not impact the time required to perform a READ operation, whereas it improved the time required to perform a WRITE operation. In the case of the latter, this advantage is more visible when writing larger segments. In the second setting it has been concluded that the higher latency had a significant impact on the cost of reading the metadata, while this impact is much lower for WRITE operations. Another experiment aimed at evaluating how efficiently the lock-free scheme supports highly-concurrent data accesses. The average bandwidth per client was measured for READ and WRITE requests, when increasing the number of clients. In both settings it was noticed that the bandwidth per client decreases very slowly when the number of concurrent accesses increases significantly. Moreover, the decrease in the read bandwidth is even smaller if client-side caching is used. The two experiments thus showed that the system scales well, without significantly affecting performance, both in terms of storage providers and of concurrent clients. 16

18 Discussion One of the most important advantages of the system is that it allows efficient, large-scale concurrent access to the data strings, without locking them. The versioning technique allows that: concurrent writes to the same page can be performed in parallel, because they access different versions of that page. Reading operations can also be performed in parallel, once each client receives the latest version from the version manager. The system also provides some fault tolerance mechanisms, through the off-the-shelf DHT on top of which the metadata provider is implemented. 3.3 Storing the data of Berkeley DB using BlobTamer Selecting the level to interface In order to achieve the interfacing of the two systems described above and, thus, to leave the job of storing Berkeley DB s data to BlobTamer, the database system s architecture had to be taken into consideration. The Berkeley DB library has a layered architecture, composed of five major subsystems: Access Method The Access Method subsystem provides general-purpose support for creating and accessing database files formatted as btrees, hashed files, and fixed- and variable-length records. Memory (Buffer) Pool The Memory Pool subsystem (or buffer manager, as known in literature) is the general-purpose shared memory buffer pool used by Berkeley DB. This is the shared memory cache that allows multiple processes and threads within processes to share access to databases. Transaction The Transaction subsystem allows a group of database modifications to be treated as an atomic unit so that either all of the changes are done, or none of the changes are done. The Transaction subsystem implements the Berkeley DB transaction model. Locking The Locking subsystem is the general-purpose lock manager used by Berkeley DB. This module is useful outside of the Berkeley DB package for processes that require a portable, fast, configurable lock manager. Logging The Logging subsystem is the write-ahead logging used to support the Berkeley DB transaction model. It is largely specific to the Berkeley DB package, and unlikely to be useful elsewhere except as a supporting module for the Berkeley DB transaction subsystem. In addition to the above-mentioned subsystems, there is also a Storage layer, as in any other database management system. In this model, illustrated in Figure 6, the application makes calls to the access methods. When applications require recoverability, their calls to the Access Method subsystem must be wrapped in calls to the Transaction subsystem. The Access Method and Transaction subsystems in turn make calls into the Memory Pool, Locking and Logging subsystems on behalf of the application. The underlying subsystems can be used independently by applications. For example, the Memory Pool subsystem can be used apart from the rest of Berkeley DB by applications simply wanting a shared memory buffer pool, or the Locking subsystem may be called directly 17

19 Figure 6: The Berkeley DB architecture by applications that are doing their own locking outside of Berkeley DB. However, this usage is not common, and most applications will either use only the Access Method subsystem, or the Access Method subsystem wrapped in calls to the Berkeley DB transaction interfaces. As stated above, the Access Method and Transaction subsystems use the underlying shared memory buffer pool (cache) to hold recently used file pages in main memory. The pages have to be in the main memory, in order for the database management system to operate on it. The Memory Pool subsystem receives page requests from the upper layers and provides handles for underlying files. The handles are then used to retrieve pages from these files. When the pages are returned, if the requestor indicates that a page has been modified (i.e., the page is dirty), the page is written to the disk. This memory buffer pool handles all operations related to pages in a transparent way. The upper layers are not aware that not all data is in the memory, at one time. If the cache if full and a new page needs to be inserted, a page is selected and discarded from the pool. The selection is based on a least-recently-used algorithm: the page that stayed the longest time in the cache without being accessed will be replaced. An important aspect at this point is selecting the layer to implement for a successful interfacing with BlobTamer. One natural choice is the Memory Pool, since page management support is provided by BlobTamer directly. An in-depth study not only of the layers, but also of the interactions between layers is necessary in order to be able to provide a correct interfacing. There exists a tight coupling of the Logging layer with the upper layers in order to provide recovery support. Because of that, this layer would have to be implemented as well, if the Memory Pool layer was chosen. On the other hand, the Storage layer acts as the backbone for both the Buffer Pool and the Logging layers. Both these layers use the Storage layer directly to store their data, as it can be seen from Figure 6. Implementing the Storage layer is much more simple, because it implies just a file system functionality on top of BlobTamer. This approach makes debugging easier, because the implementation is at a lower layer. Moreover, it enables the study of access (reads and writes) patterns at page 18

20 level, which might lead to optimizations for read/write operations, in the new version of the Storage layer. The potential introduction of such optimizations justifies the choice of implementing the Storage layer and not using a distributed file system (like NFS) Methodology applied Before implementing the Storage layer, a few things needed to be studied. First, it was important to know how and where Berkeley DB stored its data and metadata (if it created any), how many physical files were created for each table, and whether temporary files were created while writing the data. Testing some applications that use Berkeley DB was required to study these aspects. Second, it was important to see how the system uses some basic system calls related to file operations (read, write, open, close, flush, fsync, etc.), concentrating especially on the parameters used to read and write data (e.g., the offset in the file and the size of the operation). The possibility of changing the code in the wrappers of these system calls was analyzed, too. The tested application The application needed for the tests was an example written in C provided in the Berkeley DB download toolkit. It concerns some products and vendors that sell the products. The input data were provided in text files (one for the vendors and another for the products), with one record per line. The application consists of two programs. One program creates the database, the tables (files on the hard-disk) and loads the data from the input files into the tables. Another program searches for a specific product, reads the data from the tables and displays information about the product and the vendors that sell it. The first program corresponds to the CREATE DATABASE and INSERT commands in SQL, while the second program corresponds to the SELECT command. In these programs data can be easily manipulated, by means of put and get methods (provided by the Berkeley DB library), as it can be seen from the code fragments below. // define data structure for the application typedef struct stock_dbs { DB *inventory_dbp; DB *vendor_dbp; DB *itemname_sdbp; } STOCK_DBS;... STOCK_DBS my_stock;... /* Set page size */ vendor_dbp->set_pagesize(dbp, 65536);... /* Put data into the database */ my_stock.vendor_dbp->put(my_stock.inventory_dbp, 0, &key, &data, 0);... /* Get a record */ vendor_dbp->get(vendor_dbp, NULL, &key, &data, 0); 19

The Advantages and Disadvantages of a Standard Data Storage System

The Advantages and Disadvantages of a Standard Data Storage System Databases: towards performance and scalability Bibliographical study Silviu-Marius Moldovan mariusmoldovan@gmail.com Supervisors: Gabriel Antoniu, Luc Bougé {Gabriel.Antoniu,Luc.Bouge}@irisa.fr INSA, IFSIC,

More information

Availability Digest. www.availabilitydigest.com. Raima s High-Availability Embedded Database December 2011

Availability Digest. www.availabilitydigest.com. Raima s High-Availability Embedded Database December 2011 the Availability Digest Raima s High-Availability Embedded Database December 2011 Embedded processing systems are everywhere. You probably cannot go a day without interacting with dozens of these powerful

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Chapter 18: Database System Architectures. Centralized Systems

Chapter 18: Database System Architectures. Centralized Systems Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and

More information

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

BlobSeer: Towards efficient data storage management on large-scale, distributed systems : Towards efficient data storage management on large-scale, distributed systems Bogdan Nicolae University of Rennes 1, France KerData Team, INRIA Rennes Bretagne-Atlantique PhD Advisors: Gabriel Antoniu

More information

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures Chapter 18: Database System Architectures Centralized Systems! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types! Run on a single computer system and do

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Data Management in the Cloud

Data Management in the Cloud Data Management in the Cloud Ryan Stern stern@cs.colostate.edu : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server

More information

Outline. Failure Types

Outline. Failure Types Outline Database Management and Tuning Johann Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Unit 11 1 2 Conclusion Acknowledgements: The slides are provided by Nikolaus Augsten

More information

ZooKeeper. Table of contents

ZooKeeper. Table of contents by Table of contents 1 ZooKeeper: A Distributed Coordination Service for Distributed Applications... 2 1.1 Design Goals...2 1.2 Data model and the hierarchical namespace...3 1.3 Nodes and ephemeral nodes...

More information

Tier Architectures. Kathleen Durant CS 3200

Tier Architectures. Kathleen Durant CS 3200 Tier Architectures Kathleen Durant CS 3200 1 Supporting Architectures for DBMS Over the years there have been many different hardware configurations to support database systems Some are outdated others

More information

How to Choose Between Hadoop, NoSQL and RDBMS

How to Choose Between Hadoop, NoSQL and RDBMS How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A

More information

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

low-level storage structures e.g. partitions underpinning the warehouse logical table structures DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures

More information

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)

More information

Chapter 11: File System Implementation. Operating System Concepts with Java 8 th Edition

Chapter 11: File System Implementation. Operating System Concepts with Java 8 th Edition Chapter 11: File System Implementation 11.1 Silberschatz, Galvin and Gagne 2009 Chapter 11: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation

More information

DataBlitz Main Memory DataBase System

DataBlitz Main Memory DataBase System DataBlitz Main Memory DataBase System What is DataBlitz? DataBlitz is a general purpose Main Memory DataBase System that enables: Ð high-speed access to data Ð concurrent access to shared data Ð data integrity

More information

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011 SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,

More information

In-Memory Databases MemSQL

In-Memory Databases MemSQL IT4BI - Université Libre de Bruxelles In-Memory Databases MemSQL Gabby Nikolova Thao Ha Contents I. In-memory Databases...4 1. Concept:...4 2. Indexing:...4 a. b. c. d. AVL Tree:...4 B-Tree and B+ Tree:...5

More information

A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage

A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage Volume 2, No.4, July August 2013 International Journal of Information Systems and Computer Sciences ISSN 2319 7595 Tejaswini S L Jayanthy et al., Available International Online Journal at http://warse.org/pdfs/ijiscs03242013.pdf

More information

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1 CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level -ORACLE TIMESTEN 11gR1 CASE STUDY Oracle TimesTen In-Memory Database and Shared Disk HA Implementation

More information

Hypertable Architecture Overview

Hypertable Architecture Overview WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for

More information

Distributed Data Management

Distributed Data Management Introduction Distributed Data Management Involves the distribution of data and work among more than one machine in the network. Distributed computing is more broad than canonical client/server, in that

More information

- An Oracle9i RAC Solution

- An Oracle9i RAC Solution High Availability and Scalability Technologies - An Oracle9i RAC Solution Presented by: Arquimedes Smith Oracle9i RAC Architecture Real Application Cluster (RAC) is a powerful new feature in Oracle9i Database

More information

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

1. Comments on reviews a. Need to avoid just summarizing web page asks you for: 1. Comments on reviews a. Need to avoid just summarizing web page asks you for: i. A one or two sentence summary of the paper ii. A description of the problem they were trying to solve iii. A summary of

More information

File Management. Chapter 12

File Management. Chapter 12 Chapter 12 File Management File is the basic element of most of the applications, since the input to an application, as well as its output, is usually a file. They also typically outlive the execution

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

- An Essential Building Block for Stable and Reliable Compute Clusters

- An Essential Building Block for Stable and Reliable Compute Clusters Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative

More information

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful. Architectures Cluster Computing Job Parallelism Request Parallelism 2 2010 VMware Inc. All rights reserved Replication Stateless vs. Stateful! Fault tolerance High availability despite failures If one

More information

Promise of Low-Latency Stable Storage for Enterprise Solutions

Promise of Low-Latency Stable Storage for Enterprise Solutions Promise of Low-Latency Stable Storage for Enterprise Solutions Janet Wu Principal Software Engineer Oracle janet.wu@oracle.com Santa Clara, CA 1 Latency Sensitive Applications Sample Real-Time Use Cases

More information

Using Object Database db4o as Storage Provider in Voldemort

Using Object Database db4o as Storage Provider in Voldemort Using Object Database db4o as Storage Provider in Voldemort by German Viscuso db4objects (a division of Versant Corporation) September 2010 Abstract: In this article I will show you how

More information

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation Facebook: Cassandra Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/24 Outline 1 2 3 Smruti R. Sarangi Leader Election

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets!! Large data collections appear in many scientific domains like climate studies.!! Users and

More information

Tivoli Storage Manager Explained

Tivoli Storage Manager Explained IBM Software Group Dave Cannon IBM Tivoli Storage Management Development Oxford University TSM Symposium 2003 Presentation Objectives Explain TSM behavior for selected operations Describe design goals

More information

The Oracle Universal Server Buffer Manager

The Oracle Universal Server Buffer Manager The Oracle Universal Server Buffer Manager W. Bridge, A. Joshi, M. Keihl, T. Lahiri, J. Loaiza, N. Macnaughton Oracle Corporation, 500 Oracle Parkway, Box 4OP13, Redwood Shores, CA 94065 { wbridge, ajoshi,

More information

Review: The ACID properties

Review: The ACID properties Recovery Review: The ACID properties A tomicity: All actions in the Xaction happen, or none happen. C onsistency: If each Xaction is consistent, and the DB starts consistent, it ends up consistent. I solation:

More information

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems Chapter 13 File and Database Systems Outline 13.1 Introduction 13.2 Data Hierarchy 13.3 Files 13.4 File Systems 13.4.1 Directories 13.4. Metadata 13.4. Mounting 13.5 File Organization 13.6 File Allocation

More information

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems Chapter 13 File and Database Systems Outline 13.1 Introduction 13.2 Data Hierarchy 13.3 Files 13.4 File Systems 13.4.1 Directories 13.4. Metadata 13.4. Mounting 13.5 File Organization 13.6 File Allocation

More information

The Service Availability Forum Specification for High Availability Middleware

The Service Availability Forum Specification for High Availability Middleware The Availability Forum Specification for High Availability Middleware Timo Jokiaho, Fred Herrmann, Dave Penkler, Manfred Reitenspiess, Louise Moser Availability Forum Timo.Jokiaho@nokia.com, Frederic.Herrmann@sun.com,

More information

CitusDB Architecture for Real-Time Big Data

CitusDB Architecture for Real-Time Big Data CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing

More information

CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL

CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL This chapter is to introduce the client-server model and its role in the development of distributed network systems. The chapter

More information

Client/Server and Distributed Computing

Client/Server and Distributed Computing Adapted from:operating Systems: Internals and Design Principles, 6/E William Stallings CS571 Fall 2010 Client/Server and Distributed Computing Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Traditional

More information

The Classical Architecture. Storage 1 / 36

The Classical Architecture. Storage 1 / 36 1 / 36 The Problem Application Data? Filesystem Logical Drive Physical Drive 2 / 36 Requirements There are different classes of requirements: Data Independence application is shielded from physical storage

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

MS-40074: Microsoft SQL Server 2014 for Oracle DBAs

MS-40074: Microsoft SQL Server 2014 for Oracle DBAs MS-40074: Microsoft SQL Server 2014 for Oracle DBAs Description This four-day instructor-led course provides students with the knowledge and skills to capitalize on their skills and experience as an Oracle

More information

MySQL Storage Engines

MySQL Storage Engines MySQL Storage Engines Data in MySQL is stored in files (or memory) using a variety of different techniques. Each of these techniques employs different storage mechanisms, indexing facilities, locking levels

More information

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications White Paper Table of Contents Overview...3 Replication Types Supported...3 Set-up &

More information

The Sierra Clustered Database Engine, the technology at the heart of

The Sierra Clustered Database Engine, the technology at the heart of A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel

More information

How To Make A Backup System More Efficient

How To Make A Backup System More Efficient Identifying the Hidden Risk of Data De-duplication: How the HYDRAstor Solution Proactively Solves the Problem October, 2006 Introduction Data de-duplication has recently gained significant industry attention,

More information

Base One's Rich Client Architecture

Base One's Rich Client Architecture Base One's Rich Client Architecture Base One provides a unique approach for developing Internet-enabled applications, combining both efficiency and ease of programming through its "Rich Client" architecture.

More information

HDFS Architecture Guide

HDFS Architecture Guide by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5

More information

Availability Digest. MySQL Clusters Go Active/Active. December 2006

Availability Digest. MySQL Clusters Go Active/Active. December 2006 the Availability Digest MySQL Clusters Go Active/Active December 2006 Introduction MySQL (www.mysql.com) is without a doubt the most popular open source database in use today. Developed by MySQL AB of

More information

Designing a Cloud Storage System

Designing a Cloud Storage System Designing a Cloud Storage System End to End Cloud Storage When designing a cloud storage system, there is value in decoupling the system s archival capacity (its ability to persistently store large volumes

More information

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters COSC 6374 Parallel I/O (I) I/O basics Fall 2012 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card

More information

FAWN - a Fast Array of Wimpy Nodes

FAWN - a Fast Array of Wimpy Nodes University of Warsaw January 12, 2011 Outline Introduction 1 Introduction 2 3 4 5 Key issues Introduction Growing CPU vs. I/O gap Contemporary systems must serve millions of users Electricity consumed

More information

Rackspace Cloud Databases and Container-based Virtualization

Rackspace Cloud Databases and Container-based Virtualization Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many

More information

In-memory databases and innovations in Business Intelligence

In-memory databases and innovations in Business Intelligence Database Systems Journal vol. VI, no. 1/2015 59 In-memory databases and innovations in Business Intelligence Ruxandra BĂBEANU, Marian CIOBANU University of Economic Studies, Bucharest, Romania babeanu.ruxandra@gmail.com,

More information

Big data management with IBM General Parallel File System

Big data management with IBM General Parallel File System Big data management with IBM General Parallel File System Optimize storage management and boost your return on investment Highlights Handles the explosive growth of structured and unstructured data Offers

More information

Network Attached Storage. Jinfeng Yang Oct/19/2015

Network Attached Storage. Jinfeng Yang Oct/19/2015 Network Attached Storage Jinfeng Yang Oct/19/2015 Outline Part A 1. What is the Network Attached Storage (NAS)? 2. What are the applications of NAS? 3. The benefits of NAS. 4. NAS s performance (Reliability

More information

COMP5426 Parallel and Distributed Computing. Distributed Systems: Client/Server and Clusters

COMP5426 Parallel and Distributed Computing. Distributed Systems: Client/Server and Clusters COMP5426 Parallel and Distributed Computing Distributed Systems: Client/Server and Clusters Client/Server Computing Client Client machines are generally single-user workstations providing a user-friendly

More information

Bigdata High Availability (HA) Architecture

Bigdata High Availability (HA) Architecture Bigdata High Availability (HA) Architecture Introduction This whitepaper describes an HA architecture based on a shared nothing design. Each node uses commodity hardware and has its own local resources

More information

Storage in Database Systems. CMPSCI 445 Fall 2010

Storage in Database Systems. CMPSCI 445 Fall 2010 Storage in Database Systems CMPSCI 445 Fall 2010 1 Storage Topics Architecture and Overview Disks Buffer management Files of records 2 DBMS Architecture Query Parser Query Rewriter Query Optimizer Query

More information

ORACLE INSTANCE ARCHITECTURE

ORACLE INSTANCE ARCHITECTURE ORACLE INSTANCE ARCHITECTURE ORACLE ARCHITECTURE Oracle Database Instance Memory Architecture Process Architecture Application and Networking Architecture 2 INTRODUCTION TO THE ORACLE DATABASE INSTANCE

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Client/Server Computing Distributed Processing, Client/Server, and Clusters Client/Server Computing Distributed Processing, Client/Server, and Clusters Chapter 13 Client machines are generally single-user PCs or workstations that provide a highly userfriendly interface to the

More information

Guide to Scaling OpenLDAP

Guide to Scaling OpenLDAP Guide to Scaling OpenLDAP MySQL Cluster as Data Store for OpenLDAP Directories An OpenLDAP Whitepaper by Symas Corporation Copyright 2009, Symas Corporation Table of Contents 1 INTRODUCTION...3 2 TRADITIONAL

More information

Physical Data Organization

Physical Data Organization Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor

More information

Operating Systems CSE 410, Spring 2004. File Management. Stephen Wagner Michigan State University

Operating Systems CSE 410, Spring 2004. File Management. Stephen Wagner Michigan State University Operating Systems CSE 410, Spring 2004 File Management Stephen Wagner Michigan State University File Management File management system has traditionally been considered part of the operating system. Applications

More information

InfiniteGraph: The Distributed Graph Database

InfiniteGraph: The Distributed Graph Database A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 1: Distributed File Systems Finding a needle in Haystack: Facebook

More information

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013 F1: A Distributed SQL Database That Scales Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013 What is F1? Distributed relational database Built to replace sharded MySQL back-end of AdWords

More information

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications

More information

Using In-Memory Computing to Simplify Big Data Analytics

Using In-Memory Computing to Simplify Big Data Analytics SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed

More information

Chapter 12 File Management

Chapter 12 File Management Operating Systems: Internals and Design Principles Chapter 12 File Management Eighth Edition By William Stallings Files Data collections created by users The File System is one of the most important parts

More information

BlobSeer: Enabling Efficient Lock-Free, Versioning-Based Storage for Massive Data under Heavy Access Concurrency

BlobSeer: Enabling Efficient Lock-Free, Versioning-Based Storage for Massive Data under Heavy Access Concurrency BlobSeer: Enabling Efficient Lock-Free, Versioning-Based Storage for Massive Data under Heavy Access Concurrency Gabriel Antoniu 1, Luc Bougé 2, Bogdan Nicolae 3 KerData research team 1 INRIA Rennes -

More information

Database System Architecture & System Catalog Instructor: Mourad Benchikh Text Books: Elmasri & Navathe Chap. 17 Silberschatz & Korth Chap.

Database System Architecture & System Catalog Instructor: Mourad Benchikh Text Books: Elmasri & Navathe Chap. 17 Silberschatz & Korth Chap. Database System Architecture & System Catalog Instructor: Mourad Benchikh Text Books: Elmasri & Navathe Chap. 17 Silberschatz & Korth Chap. 1 Oracle9i Documentation First-Semester 1427-1428 Definitions

More information

Microsoft SQL Server for Oracle DBAs Course 40045; 4 Days, Instructor-led

Microsoft SQL Server for Oracle DBAs Course 40045; 4 Days, Instructor-led Microsoft SQL Server for Oracle DBAs Course 40045; 4 Days, Instructor-led Course Description This four-day instructor-led course provides students with the knowledge and skills to capitalize on their skills

More information

In-memory Tables Technology overview and solutions

In-memory Tables Technology overview and solutions In-memory Tables Technology overview and solutions My mainframe is my business. My business relies on MIPS. Verna Bartlett Head of Marketing Gary Weinhold Systems Analyst Agenda Introduction to in-memory

More information

White Paper. Optimizing the Performance Of MySQL Cluster

White Paper. Optimizing the Performance Of MySQL Cluster White Paper Optimizing the Performance Of MySQL Cluster Table of Contents Introduction and Background Information... 2 Optimal Applications for MySQL Cluster... 3 Identifying the Performance Issues.....

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Storage Systems Autumn 2009. Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Storage Systems Autumn 2009. Chapter 6: Distributed Hash Tables and their Applications André Brinkmann Storage Systems Autumn 2009 Chapter 6: Distributed Hash Tables and their Applications André Brinkmann Scaling RAID architectures Using traditional RAID architecture does not scale Adding news disk implies

More information

SCALABILITY AND AVAILABILITY

SCALABILITY AND AVAILABILITY SCALABILITY AND AVAILABILITY Real Systems must be Scalable fast enough to handle the expected load and grow easily when the load grows Available available enough of the time Scalable Scale-up increase

More information

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at distributing load b. QUESTION: What is the context? i. How

More information

Scala Storage Scale-Out Clustered Storage White Paper

Scala Storage Scale-Out Clustered Storage White Paper White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

XenData Archive Series Software Technical Overview

XenData Archive Series Software Technical Overview XenData White Paper XenData Archive Series Software Technical Overview Advanced and Video Editions, Version 4.0 December 2006 XenData Archive Series software manages digital assets on data tape and magnetic

More information

Chapter Outline. Chapter 2 Distributed Information Systems Architecture. Middleware for Heterogeneous and Distributed Information Systems

Chapter Outline. Chapter 2 Distributed Information Systems Architecture. Middleware for Heterogeneous and Distributed Information Systems Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter 2 Architecture Chapter Outline Distributed transactions (quick

More information

IBM Tivoli Storage Manager Version 7.1.4. Introduction to Data Protection Solutions IBM

IBM Tivoli Storage Manager Version 7.1.4. Introduction to Data Protection Solutions IBM IBM Tivoli Storage Manager Version 7.1.4 Introduction to Data Protection Solutions IBM IBM Tivoli Storage Manager Version 7.1.4 Introduction to Data Protection Solutions IBM Note: Before you use this

More information

BookKeeper overview. Table of contents

BookKeeper overview. Table of contents by Table of contents 1 BookKeeper overview...2 1.1 BookKeeper introduction... 2 1.2 In slightly more detail...2 1.3 Bookkeeper elements and concepts...3 1.4 Bookkeeper initial design... 3 1.5 Bookkeeper

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

COS 318: Operating Systems

COS 318: Operating Systems COS 318: Operating Systems File Performance and Reliability Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall10/cos318/ Topics File buffer cache

More information

Recovery Principles in MySQL Cluster 5.1

Recovery Principles in MySQL Cluster 5.1 Recovery Principles in MySQL Cluster 5.1 Mikael Ronström Senior Software Architect MySQL AB 1 Outline of Talk Introduction of MySQL Cluster in version 4.1 and 5.0 Discussion of requirements for MySQL Cluster

More information

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University RAMCloud and the Low- Latency Datacenter John Ousterhout Stanford University Most important driver for innovation in computer systems: Rise of the datacenter Phase 1: large scale Phase 2: low latency Introduction

More information

<Insert Picture Here> Getting Coherence: Introduction to Data Grids South Florida User Group

<Insert Picture Here> Getting Coherence: Introduction to Data Grids South Florida User Group Getting Coherence: Introduction to Data Grids South Florida User Group Cameron Purdy Cameron Purdy Vice President of Development Speaker Cameron Purdy is Vice President of Development

More information

File-System Implementation

File-System Implementation File-System Implementation 11 CHAPTER In this chapter we discuss various methods for storing information on secondary storage. The basic issues are device directory, free space management, and space allocation

More information

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001 ICOM 6005 Database Management Systems Design Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001 Readings Read Chapter 1 of text book ICOM 6005 Dr. Manuel

More information