Survey of Technologies for Wide Area Distributed Storage Project : GigaPort3 Project Year : 2010 Project Manager : Rogier Spoor Author(s) Completion Date : 2010-06-29 Version : 1.0 : Arjan Peddemors, Christiaan Kuun, Rogier Spoor, Paul Dekkers Christiaan den Besten Summary This report gives on overview of existing technologies that may be used to offer distributed storage services to SURFnet and its connected institutes. It provides an analysis of the requirements that are relevant for such services, which is used to compare the different products. Furthermore, this report gives a description of the environment that will be used to evaluate candidate products. This publication is licensed under Creative Commons Attribution 3.0 Unported. More information on this license can be found at http://creativecommons.org/licenses/by/3.0/
Colophon Programme line Part Activity Deliverable Access rights External party : Enabling Dynamic Services : Task 3 - Storage Clouds : Technology Scouting Storage Clouds : EDS-3R Report on scouting results on Storage Clouds Technology : Public : Novay, CSIR, Prolocation This project was made possible by the Economic Structure Enhancing Fund (FES) of the Dutch Government.
Contents 1 Introduction... 4 1.1 Use cases... 5 1.1.1 Virtual machine distributed storage... 5 1.1.2 Database replication... 5 1.1.3 Scientific data storage... 6 1.2 Requirements... 6 2 Overview of existing products... 10 2.1 Candidate products... 11 2.1.1 Lustre... 11 2.1.2 GlusterFS... 12 2.1.3 GPFS... 12 2.1.4 Ceph... 13 2.1.5 Candidate product comparison... 14 2.2 Non-candidate products... 15 2.2.1 XtreemFS... 15 2.2.2 MogileFS... 15 2.2.3 NFS version 4.1 (pnfs)... 16 2.2.4 ZFS... 16 2.2.5 VERITAS File System... 16 2.2.6 ParaScale... 16 2.2.7 CAStor... 17 2.2.8 Tahoe-LAFS... 17 2.2.9 DRBD... 17 3 Future work... 19 4 Conclusion... 21 5 References... 22
1 Introduction Cheap storage technology and fast Internet access have brought online storage services within range of a large group of users and organizations. Where traditional storage capacity was closely tied to the computing facilities in use, it is now possible to store data in the cloud at reasonable prices. Current public cloud storage services, however, provide moderate levels of support for high performance and high availability applications running at the edges of the Internet. These applications come from various domains, ranging from those deployed in corporate datacenters to applications in research. They typically operate in an environment with local highend storage services in the form of storage area network (SAN) services or network attached storage (NAS), which are not easily replaced by public cloud alternatives. A different kind of online storage technology, a distributed file system with parallel access and fault tolerant features, may be better suited to support these applications. With such a system, local storage resources as well as resources at remote sites are joined to support concurrent access by multiple clients on copies of the same data. Additionally, it continues to serve data in case of failure of a component at one of the sites. Ideally, such a system regulates the placement of the data in such a way that access to that data by applications is optimal, i.e., by keeping the data close to where it is used on storage resources that match the application needs (in terms of speed, latency, etc.). We expect that a distributed storage facility can be successfully implemented in the SURFnet context, because of the high-quality and high-speed network infrastructure offered to SURFnet participants. This report provides an overview of existing products and building blocks that deliver this distributed file system functionality. We will use this survey to select a few products as candidate products, which will be further investigated and deployed in a distributed test environment (as part of the GigaPort3 project). The outline of this document is as follows. The remainder of this section presents a number of use cases in which selected products are to be used, and gives a listing of relevant requirements. In section 2, we provide an overview of products we considered, and place these products in two groups (depending on how well they match with the requirements): candidate products that will be evaluated further in the future, and non-candidate products that will not be further considered (but are included to put candidates into context). Section 3 describes the environment in which candidate products will be further tested and evaluated. Section 4 provides the survey conclusions. 4
1.1 Use cases To illustrate in which situations the distributed storage products will be used, we now describe a number of use cases. The main users of the system are SURFnet participants, although the use cases are also relevant for others. All use cases assume the availability of storage capacity at different locations, which are linked to each other through a fast, wide-area IP-based optical lightpath SURFnet lightpaths [19]. 1.1.1 Virtual machine distributed storage In recent years, hardware virtualization has become a mainstream technology used in many server environments to consolidate resources and to allow for flexible server configurations. Virtual machines (VMs) can be moved from one physical server machine to another with minimal interruption, which is important for those applications that have high-availability requirements. In case of hardware failure, VMs can also be powered up on different machines while using centralized storage. In configurations where hardware resources are dispersed over a large geographical area, virtual machines may be moved to a new physical machine at a long distance from the initial physical machine. A common setup for virtual machines is to access storage through a storage area network (SAN). When moving over a long distance, access to such storage may suffer from high latency. A solution to this problem is to replicate the storage data accessed by a virtual machine to those locations where it may run in the future [23][25]. When a VM is moved to a new location, it will start using the replicated storage that is nearby (instead of the storage that was close to the initial location). In this use case, selected distributed storage solution is applied to keep data stored at multiple locations in sync, to support local access to storage for virtual machines that migrate over a long distance. This mechanism allows for easier disaster recovery between geographically dispersed locations. 1.1.2 Database replication Databases play an important role in the IT infrastructure of many organizations. They offer core functionality to a wide range of software services that are crucial for day-to-day operation; they may store, for instance, email messages, transaction results, customer information, document management information, etc. Usually, great care is taken to keep the information stored in databases safe, i.e., to make sure that databases are stored in a durable fashion. Additionally, databases that are necessary for the core operation of an organization are often configured in such a way (e.g. through replication [24]) that high availability is guaranteed, so that, even in case of hardware failure, the database stays online. 5
In this use case, a database is replicated at different nodes in a distributed storage environment, at locations that are far apart. Contrary to situations where data is replicated at nearby locations (in the order of kilometers), the geographic distribution of data over a wide area makes sure that the data is highly durable. In case of calamities such as large-scale industrial accidents or natural disasters, a short distance between replicas may not be enough to prevent data loss. In case hardware is malfunctioning at one location and the network remains operational, access to the database may be relayed to a remote node, so that the availability of the database remains intact. Database access may be such that it requires fast local as well as distributed storage. 1.1.3 Scientific data storage Scientific experiments generate data in the form of measurements. The instruments used in various scientific domains, such as DNA sequencers in biology, satellite remote sensors in geology, and particle colliders in experimental physics, are currently capable of generating vast amounts of data within a short timeframe [18]. Experiments using these instruments require substantial capacity to store the raw measurement data, but also to store data as output of analysis and processing steps executed on this raw data after the experiment. Typically, once the experimental data is captured, storage performance and high-availability is less important. Durability of the data, low storage costs, and the possibility to share experimental results between groups of scientists spread all over the world are important. In this use case, we assume a setup where a wide range of different types of experimental data is stored online in a distributed manner. The raw measurement data is entered into the system at a single location, while it is read and expanded (i.e., processed to generate derived data) at various locations, by a group of scientists working together over a substantial distance. It must be easy for scientists and system administrators to start using the storage facility and to maintain and expand it over time. Additionally, it must be possible to indicate the required level of durability offered by the storage system. 1.2 Requirements An e-science infrastructure demands a high performance, high volume and scalable data storage architecture. Many storage solutions that are currently used in the e- Science infrastructure have a limited, non-scalable approach like Storage Area Networks (SAN) or they use specially developed storage solutions that are only suitable in a specific research area like nuclear physics or astronomy. Scalable, high-capacity commercial solutions could be applied in these environments, but these are (too) expensive. Additionally, existing technologies both commercial and non-commercial often do not utilize to the fullest the unique network infrastructure available in the SURFnet setting: a very high-speed, state-of-the-art network offered at many locations in the Netherlands, with very good connections to the rest of the Internet. This network supports configurations where data storage is realized and used in a distributed fashion, over a wider area than previously feasible. 6
Cloud storage services, such as those offered by Amazon and Google, are examples of a fully distributed storage facility. These services have a number of attractive features, such as pay per usage and (some) guarantees that data is safely stored (through redundancy). They do not, however, provide high speed access and support for a wide range of different applications. This is also true for storage solutions applied to grid computing; in a grid infrastructure, storage facilities are focused on supporting computation at grid nodes, not on providing high-speed data access to applications at the edges of the network. There is, however, some overlap in functionality, and products used in grid environments may also be suitable for our needs. When data is stored in a distributed system, it is beneficial to place data where it is most often used, and migrate less important or less frequently accessed data to places with cheap capacity. Such a system is likely to have different storage level (or tiers) that form a storage hierarchy. The aim of this project is to determine the feasibility of building a high-capacity, fully distributed, hierarchical storage solution, exploiting the SURFnet infrastructure. Given the use cases and the aspects described above, it is clear that the system must be general purpose. We identify the following high level requirements as relevant for the distributed storage system. They are mostly qualitative requirements, as these are sufficient to survey existing products and sufficient to make an initial product selection (as is the purpose of this report). We are aware that a (wide-area) distributed storage system may not be able to offer features in any combination, i.e., that the CAP theorem applies which prescribes that only two out of the three properties data consistency, system availability, and partition tolerance can be supported by a distributed system [6]. Ideally, products must be able to explicitly balance these requirements, such that in a future system a property trade-off is easy to configure. - Scalable The system must be scalable in terms of capacity, performance and in terms of concurrent access. For instance, it must be easy to expand the total amount of storage without degrading performance and concurrency. Additionally, it must be easy, when need arises, to configure the system such that a large number of users concurrently may have access also concurrently to individual storage objects without degrading performance too much. - High Availability The system must have high-availability functionality that keeps data available to applications and clients, even in the event of malfunctioning software or hardware. This implies that the system must be capable of replicating data at multiple locations. Also, this means that it must be possible to maintain and reconfigure the system on-the-fly, i.e., while the system as a whole keeps running. In case of component failure, the system must support bringing back online the component after repair, which is likely to include synchronization actions and data consistency checks. It implies that capacity and storage locations may be added or removed while the system is operating. - Durability The system must support the storage of data in a durable manner, i.e., when a single software or hardware component fails, no data is lost. Durability functionality that must be supported is replication of data to disks at 7
other (remote) locations, which includes maintenance to make sure that a minimum number of replicas are available. Additionally, the system may support backup (parts) of the data on backup media such as tape. - Performance at Traditional SAN/NAS Level To support existing applications running in server environment at SURFnet institutes, the system must have to be able to support a level of performance (bandwidth and latency) comparable to that found in a traditional (non-distributed) SAN/NAS environment. Being a general purpose system, it is clear that the supported performance for specific applications cannot match the performance of dedicated storage solutions (tailored for that application). However, by offering different kinds of storage tiers within the distributed system, i.e., SSDs, RAID nodes, and SAS/SATA disks, it is possible to offer different levels of performance. Additionally, by extending existing systems based on slower hardware (e.g., SATA disks) with fast components (e.g., SSDs), system performance may be improved. The system must be able to incorporate different kinds of storage technologies and must be able to match these technologies given application requirements. It is clear that data read and write performance is closely linked with aspects such as replica management and the available WAN bandwidth required to distribute replicas. We assume that by balancing parameters such as object placement strategy, level of consistency, and level of durability, it is possible to reach a high performance level. - Dynamic Operation The level of availability, durability and performance must be configurable per application. This prevents the system to always run at the highest supported level of functionality, which reduces costs. It also allows user, application developers and system administrators to balance cost versus features. Preferably, the system must be self-configurable and self-tunable, in the sense that it changes parameters to optimize its own operation. The system must support to move data between different kinds of storage technologies offering in this way tiered functionality so that data objects that are accessed frequently are stored on the disks with highest performance and those that are infrequent accessed are stored on slower disks. It must be possible to manually override this behavior (for instance, to force a database to be available on high performance storage, even when not accessed very frequently). - Cost Effective It must be possible to build, configure, run, and maintain the system in a cost-effective manner. The system must work with commodity hardware, which means that individual hardware components may not be as reliable as when high-end hardware is used: due to the requirements of scalability, availability, durability and performance, however, the system already must be able to cope with failures. The configuration of the system, as well as the maintenance (including the resolution of failure situations) must be easy and straightforward. Preferably, the operation of the system is energy efficient. License fees for software, when applicable, must be limited. - Generic Interfaces The system must offer generic interfaces to applications and clients. In particular, it preferably supports the POSIX file system interface [10] as closely as possible; in that way, a wide range of different applications can be supported. It is understood that the POSIX standard defines local file system operations and semantics and that these may be selectively extended and adapted to better support performance in a distributed setting (e.g., 8
consistency semantics may be relaxed to obtain better write performance). An example of such an extension is the POSIX I/O extension for high performance computing (HPC) [16]. Alternatively, the system may support a block device interface, because such an interface can be used to implement arbitrary file systems. Note that it might be easier to apply smart placement policies with objects (files) than with raw blocks. - Protocols Based on Open Standards The system must be built using protocols based on open standards as much as possible. This will reduce the chance on vendor lock-in and improves extensibility. It is likely that a storage service, based on open standards, in the long run will be more economic to maintain then a proprietary based service. - Multi-Party Access The system must support access by multiple, geographically dispersed parties at the same time. This enables collaboration between these parties over long distances on the same data. 9
2 Overview of existing products Different kinds of distributed storage systems have evolved over the years. Early products with initial implementations appearing in the mid-80 s were based on a client-server paradigm, with a server at a single location storing all the data and multiple clients accessing this data. These network file systems are now very widely used in local area networks, in particular systems based on the Common Internet File System (CIFS) and the Network File System (NFS). Other systems with more explicit focus on multiple servers that store data are the Andrew File System (AFS) and the DCE Distributed File System (DFS). These early systems have in common that they do not operate well in a wide area network environment and that they are not fault tolerant. More recently, cloud storage technologies have emerged that operate well in the Internet at large, and also, in many cases, provide facilities for durability in the form of data replication. The best known cloud storage services, such as Amazon S3 and Google Storage, are not available as standalone software products and therefore cannot be implemented on own hardware. The Hadoop Distributed File System [8] and CloudStore [3] are open source products that can be used, but by default these provide interfaces at the application level 1, and work with a single metadata server (which is a single point of failure). Therefore we do not consider these products here. Like traditional distributed file systems and cloud storage systems, a whole range of other products have interesting features and do at least meet some of the requirements we described in the previous section. Many of them, however, focus on dedicated environments or are designed for specific applications. An example is OCFS2 [14], which provides a shared file system dedicated for operation in SANs. Another example is dcache [4], which has been developed to store the large amounts of data generated by experiments with the Large Hadron Collider at CERN. As an initial filter for product selection, we picked those that claim to be distributed, fault-tolerant and having supporting parallel access. We observed that a broad classification of products can be made by looking at the used data primitive and by the handling of metadata. The data primitive is the form in which data is handled within the system at the most basic level, which can be either at the block level, or at the object or file level. For the handling of metadata, there is a distinction between products that treat metadata separate from regular data by storing it on dedicated nodes, and products that treat metadata in the same way as regular data (i.e., storing metadata and regular data on the same nodes). In this section we provide an overview of existing products and technologies that may be used to implement a distributed storage system that fulfills the requirements completely or to a large extent. The products are subdivided in those 1 Both HDFS and CloudStore have facilities to mount the file system under Linux using FUSE. 10
that are most promising (the candidate products; section 2.1) and those that meet many requirements but are not further considered for evaluation (the non-candidate products; section 2.2). Furthermore, a number of building blocks are described (in section 2.3) that may be used to extend or improve a selected product. 2.1 Candidate products The candidate products described here have a good match with the requirements identified in the previous section. The list of candidates will be used to make a final selection of products that will be tested and evaluated in an wide-area distributed environment with different types of hardware and interconnects between the storage nodes. The order in which the candidates are described does not indicate the order of preference. 2.1.1 Lustre Lustre [12] is a massively parallel distributed file system running on Linux and used at many high-performance computing (HPC) centres worldwide. Originally developed at CMU in 1999, it is now owned by Oracle and the software is available as open source (under a GNU GPL license). The Lustre architecture defines different kinds of roles and nodes, following an object storage paradigm where metadata is separated from the file data. A typical Lustre cluster can have tens of thousands of clients, thousands of object storage servers (OSDs) and a failover pair of metadata servers (currently still a work in progress, metadata servers will in the future be able to form a cluster comprising of dozens of nodes). Lustre assumes that OSDs are reliable, i.e., use such techniques as RAID to prevent data loss. Servers can currently be added dynamically and the file system is POSIX compliant. Other features that Lustre provides are ADIO interfaces, it can disable locking, and perform direct I/O that is usable for databases. Lustre also has other tuneable settings. It can currently be installed on Linux where it can interoperate amongst all supported processor architectures. It is still being developed for other operating systems. The main advantage of Lustre is that it has very high parallel performance, it also has good file I/O and can handle requests for thousands of files. Lustre does not seem to be deployed frequently in clusters that stretch over large distances 2, however, which raises questions about the performance and other characteristics when nodes are interconnected by a wide-area network. Additionally, Lustre appears to have little support for tiered operation, and setting policies for placement of files at particular tiers. 2 Lustre operation in a wide-area network setting with sites at various locations in the US is currently explored at Teragrid / PSC (focusing on high-performance computing applications) http://www.teragridforum.org/mediawiki/index.php?title=lustre-wan:_advanced_features_testing 11
2.1.2 GlusterFS GlusterFS [7] is a parallel network file system developed by Gluster Inc., which is used primarily for clustered storage consisting of a (potentially large) number of simple storage servers (also referred to as storage bricks ). GlusterFS is an open source product, available under the GNU GPL license. GlusterFS stores data at the file level (not at the block level) and does not use, contrary to many other similar products, separate metadata nodes. Instead, the location of files is found through a hash algorithm, which maps file names to storage servers. This algorithm takes into account that storage nodes may join and leave the system dynamically and, according to Gluster, makes the system scale in a linear fashion. All storage aggregates into a single global namespace. GlusterFS has a POSIX file system interface for general purpose access and a dedicated client library for direct access to storage nodes. Most functionality of the system resides at the client with the server nodes being relatively simple. The client software consists of modules with dedicated responsibilities such as volume management and file replication. All access to the storage nodes is done by clients, i.e., server nodes do not exchange data. The typical GlusterFS environment seems to be a high-performance, high-capacity datacenter with a fast, low-latency local network (such as Infiniband RDMA) between clients and servers. Any commodity hardware can be used to implement a storage brick and scale the system to several peta-bytes. GlusterFS can handle thousands of clients. It also includes a configuration utility. Like Lustre, it is unclear how well GlusterFS will operate in a wide area network. For instance, to which nodes will be written by a client when files must be redundantly stored? For the sake of durability, these files should be far apart, but for the sake of write performance, these files should be written to storage nodes close to the client. 2.1.3 GPFS The General Parallel File System (GPFS) [9] is a commercial shared-disk clustered file system from IBM. GPFS is being used by many large supercomputer centers because of its distributed file system capabilities and high speed parallel file access. Other areas in which GPFS is being used are streaming digital media, grid computing and scalable file storage. The GPFS file system works like a traditional UNIX file system. From a user/application level perspective this is very convenient. GPFS works internally with data blocks instead of objects. In order to grant concurrent access from multiple applications or nodes to the same file and to keep the data consistent, GPFS has developed a token management system on the block level. The size of these blocks is configurable and these data blocks can be striped across multiple nodes and disk. This is beneficial for the throughput of the file system. To improve the performance further it is possible to connect GPFS to a SAN/NAS via fiber channel instead of using local disks. Other performance improvements are cache mechanism, read ahead and write behind a file. 12
A very interesting feature of GPFS is its capability to define storage tiers. Different tiers can be created based on the location or performance of the hardware. Data can be dynamically moved between tiers based on the policies that have been created (a placement policy allows to determine the best tier based on simple file characteristics). This gives the opportunity to dynamically shift data, depending on the actual usage, between a faster or slower tier. When designing a multi tier GPFS file system the complexity of the policies and management of the storage will increase while adding additional tiers. Another interesting feature is the concept of failure groups, which allows GPFS to replicate data between hardware components that do not share a single point of failure, or for instance between geographically dispersed locations. GPFS provides a scalable metadata management instead of a centralize metadata server that is common in other file systems. This makes all GPFS nodes of the cluster involved with metadata management operations. Practically when a file is accessed on a certain node, that particular node is responsible for metadata management of that file. In case of parallel access a dynamically selected node will be made authoritative. The features mentioned above have provided GPFS a unique position in the widearea file system research environment. An example of such a usage is TeraGrid [22] in which the nodes are geographically spread in the USA. Such a large cluster setup can be build on top of commodity hardware, although commercial support is only available for specific hardware brands. Access to the data stored in a GPFS environment is granted via these interfaces: fiber channel, CIFS, NFS, HTTP, SCP and FTP. Unfortunately a common protocol as iscsi is not officially supported for system-internal data storage. 2.1.4 Ceph Ceph [2] is a distributed file system originally designed by the Storage Systems Research Center at the University of California, Santa Cruz. It is developed as an open source project, with code under the GNU LGPL license, and, since May 2010, has its client integrated in the Linux kernel. The objectives of Ceph are to provide a fully distributed file system without a single point of failure, with a POSIX-style interface. It claims high I/O performance and a high level of scalability. Ceph is based on an object storage paradigm, where file data is stored by object storage devices (OSDs) and metadata is stored by metadata servers (MDSs). Contrary to some distributed file systems relying on dumb OSDs, the Ceph OSDs have responsibilities for data migration, replication and failure handling and communicate between each other. Metadata management is completely distributed, using a cluster of MDSs to handle metadata request from clients. The operation is adapted dynamically based on the workload generated by the clients (e.g., moving and replicating metadata depending on how often a file is accessed). An MDS does not keep track of which OSDs store the data for a particular file. Instead, Ceph uses a special function called CRUSH to determine the location of 13
objects on storage nodes: it first maps an object to a placement group, and then calculates which OSDs belong to that placement group (and provides an ordering in the list of OSDs for a placement group). While doing so, it takes care of the replication of file data on different OSDs. CRUSH automatically takes into account that the set of storage nodes is dynamic over time. To clients, the data within a Ceph configuration (potentially consisting of thousands of OSDs) is presented as a single logical object store called RADOS. Replication of data is organized by writing to the first OSD in the placement group, after which this OSD replicates the data to others. The client receives an ack when all data has reached the buffer caches on all OSDs, and receives a commit when the data has been safely stored on all involved OSDs. RADOS has mechanisms for failure detection and automatic re-replication. Furthermore, Ceph implements a mechanism for recovery in case of system outages or large configuration changes. The features of Ceph as described above provide a reasonably good match with the requirements. However, a number of issues remain unclear. A major drawback is the immaturity of Ceph as a platform: as far as we know, it has not been widely used in a production environment, and the Ceph documentation also explicitly warns for the beta state of the code. Another issue is the uncertainty about the operation of Ceph in a WAN environment (as is the case for other products). The placement of the file data at OSDs does not take into account that links between storage nodes have variable quality (bandwidth, latency). Additionally, the operation of the mechanisms for automatic adaptation (adjustment of placement group to OSD mapping, and failure detection) may be non-optimal or worse for a WAN Ceph configuration. 2.1.5 Candidate product comparison Lustre GlusterFS GPFS Ceph Owner Oracle Gluster IBM Newdream (?) License GNU GPL GNU GPL commercial GNU LGPL Data object (file) object (file) block object (file) primitive Data placement strategy Metadata handling Storage tiers based on round robin and free space heuristics max. of 2 metadata servers (> 2 in beta version) pools of object storage targets different strategies through plugin modules stored with file data on storage servers policy based metadata distributed over storage servers placement groups, pseudorandom mapping multiple metadata servers unknown policy defined through CRUSH rules 14
Failure handling Replication WAN example deployment client interfacing Node types Lustre GlusterFS GPFS Ceph assuming assuming assuming assuming reliable nodes unreliable reliable nodes unreliable nodes nodes server side client side server side server side (failover pairs) TeraGrid [22] (scientific data) native client file system, FUSE, clients may export NFS, CIFS client, metadata, object no known deployment native library, FUSE TeraGrid [22] (scientific data) native client file system, clients may export NFS, no known deployment native client file system, FUSE CIFS,etc. client, data client, data client, metadata, object 2.2 Non-candidate products 2.2.1 XtreemFS XtreemFS [26] is a globally distributed and replicated file system that has been developed in order to make grid data available in a distributed environment. It is developed as part of the XtreemOS EU project [27] which aims at creating an open source grid operating system. XtreemFS is a multiplatform file system, so client and server components can be installed on most common platforms (Linux, OS X, Windows). XtreemOS is an object-based file system, with metadata and regular data stored on different types of nodes. XtreemFS is POSIX compatible, failure tolerant and can be installed on commodity hardware. One big disadvantage of XtreemFS is the lack of support for a tiered storage approach: it will therefore not have a performance at the traditional SAN/NAS level. 2.2.2 MogileFS MogileFS [13] is an open source distributed file system with specific focus on data archiving and deployment on commodity hardware. It is fault tolerant (no single point of failure) by spreading data and metadata over different server nodes, where the replication level depends on the type of the file. MogileFS defines three different kinds of nodes (trackers, storage, and database) of which multiple instances may exist in a given configuration. A tracker is responsible for handling client sessions and requests, a database node stores file system metadata, and storage nodes store the actual data. Although MogileFS can be accessed through a variety of APIs and libraries, it does not provide a POSIX or block device interface to clients. It therefore is not suitable for our purposes. 15
2.2.3 NFS version 4.1 (pnfs) NFS [17] is a network file system that has been in common use for many years. It allows users to access files over a network similar to the manner they would access local storage. The protocol is an open protocol with many implementations from different vendors (also open source). NFS version 4 minor version 1 (NFSv4.1) has been approved by the IESG and received an RFC number in January 2010. Apart from bug fixes, the NFSv4.1 specification aims to provide further protocol support that will enable users to take advantage of secure, clustered server deployments. The ability to provide parallel scalable access to files when they are distributed among multiple servers is also supported. Various pnfs implementations are currently available (Linux, Solaris, ). Our impression is that these implementations are not stable enough to use in our environment, unfortunately. Aside from this, pnfs looks promising and may be reconsidered to become a candidate product. 2.2.4 ZFS ZFS [28] has been designed by Sun Microsystems (now Oracle) and is both a file system and a logical volume manager. ZFS provides support for (amongst others) the following features: very high storage capacities, snapshots, copy-on-write clones, integrity checking, automatic repair, RAID-Z (RAID-5 and RAID-6 via RAID- Z2) and native NFSv4 ACLs. It has been implemented as an open source product and can be freely downloaded. ZFS has many features used for error correction, hardware failure etc. Another useful feature is that block devices can be grouped according to the physical implementation (i.e. chassis), which allows the file system to continue in the case of a failure of an entire chassis. The main disadvantage is that ZFS is a local file system, not designed to run in a clustered, widely distributed network environment. This makes ZFS unsuitable for our purposes. 2.2.5 VERITAS File System The VERITAS File system [20] is a file system that was developed by VERITAS Software (now owned by Symantec), and capable to run as a cluster storage system. It runs on a variety of operating systems. The file-system can perform online defragmentation and resizing, and when running in clustered mode supports fail over between storage nodes. To ensure data consistency within the cluster, VERITAS uses a strict consistency model between storage nodes. A maximum of 32 storage nodes within the cluster is supported. VERITAS is focused on clustered storage within a local network. Due to its strict coherency model and its limit in number of nodes within a single cluster, it is not suitable to run as a wide area distributed system. 2.2.6 ParaScale ParaScale [15] is a private cloud solution. The basic belief is that a private cloud should be easy to manage and to scale. Scaling is easy because commodity 16
hardware can be added or extracted from the cloud when needed without much hassle. Files can be transferred into Parascale via protocols like nfs, ftp, http or webdav. The advantage of these protocols is that most applications do support them ParaScale can provide massive write bandwidth in parallel across multiple storage nodes. This enables it to provide an ideal solution for archiving near-line storage and provision of disk backup. It can cluster tens to hundreds of servers together; this can then be used to provide large file repositories with good parallel throughput. Parascale is a typical cloud storage solution and thus only using a single-tier architecture. In this way this product is not capable of handling a SAN/NAS-like performance. 2.2.7 CAStor CAStor [1] is an object-based storage software platform that can run on commodity hardware. It provides high performance and good scalability and is quite costeffective. CAStor virtualizes storage capacity creating a single pool of storage. It has the ability to scale easily and is thus able to meet dynamic capacity demands. It has a number of internal algorithms that perform self-management and self-healing. Additionally, CAStor supports the operation in a WAN environment. CAStor has a flat, single-tier architecture, which is its main disadvantage considering our requirements. By not dealing with multiple tiers (each providing different levels of performance), it is unlikely that CAStor will be able to support the traditional SAN/NAS performance requirement. 2.2.8 Tahoe-LAFS Tahoe [21] is an open source distributed file system that supports storing files into a network of peers. It is the enabling technology behind the Allmydata cloud storage company, where it is used to organize the back-end servers in a P2P manner. All data in a Tahoe configuration is written in an encrypted manner to the storage nodes. It uses erasure coding to spread the pieces of a file over a number of nodes in a redundant way and to improve data durability. It runs on various popular platforms such as Windows, OS X, Linux, etc. and is now part of the Ubuntu Linux distribution. Tahoe provides very little control over on which nodes data is stored, which makes it not suitable for tiered functionality. Furthermore, Tahoe assumes a flat, local network environment and therefore is not suitable to run in a WAN. These aspects make Tahoe unsuitable for our purposes. 2.2.9 DRBD DRBD (Distributed Replicated Block Device) [5] is a software-based storage replication program that allows a system administrator to mirror the content of block devices across different servers. It is released under the GNU GPL license. DRBD can do this replication in real time, transparently and synchronously or asynchronously. DRBD is implemented by way of a core Linux kernel module. It 17
exports a block device on top of which a file system can be configured, which makes DRBD quite flexible and versatile. Although DRBD does not offer the features of a complete distributed file system, and therefore is not suitable to build a system matching our requirement, it may be used as a building block that complements other product. 18
3 Future work To further explore the characteristics of candidate products, we will install and configure one or multiple products in a distributed test environment. This allows us to obtain hands-on experience with the chosen system(s) and provides us with markers to better assess the suitability for deployment on a larger scale. In this section we give short description of the environment we consider suitable for a firststep evaluation. Figure 1: example evaluation environment consisting of two sites connected by a high-speed, low latency WAN. The nodes at each site have a heterogeneous configuration and are interconnected through different LAN types. The objective is to obtain a number of basic performance indicators and evaluate a few straightforward use cases in a wide area setup with a limited number of nodes. We realize that a limited test environment as proposed here is not capable of a 19
broad system test investigating aspects such as scalability, large system performance and fault tolerance; such a test would require a very large setup and considerable resources to execute. In terms of distribution, the evaluation environment must consist of at least two sites, between which there is a reasonable distance, i.e., sites that are not part of the same metropolitan area. A distance of at least 100 km will introduce a packet delay that cannot be neglected when data is accessed and replicated between sites. The WAN connection between the sites must be realized using the SURFnet infrastructure and must have a considerable capacity. Each site must consist of multiple nodes that can take on different kinds of roles in the larger storage system (object storage node, metadata node, client nodes, ). As is the case in data centers, the nodes within a site are connected through a high speed LAN. To explore system characteristics in different circumstance, multiple local interconnecting technologies may be used (e.g. Gigabit Ethernet, Infiniband, ). Nodes consist of different kinds of hardware, to reflect that the storage system runs in a heterogeneous setting: within the nodes, different kinds of storage technologies such as SSD, SAS disks and RAID controllers are available for different kinds of nodes. At each site, a sufficient amount of nodes can act as client nodes, to generate test load that can stretch the system to its limits. 20
4 Conclusion This report provides an oversight of technologies for wide area distributed storage. It is a first step of a larger project that aims at designing a distributed storage facility that combines features of traditional SAN/NAS storage and cloud storage for applications in research and corporate datacenters. This report concludes that four products Lustre, GlusterFS, GPFS and Ceph are promising to investigate further. Surprisingly only one of these products, namely GPFS, has a purely commercial license; the others are provided with an open-source license. GPFS and Lustre are the most mature products and have a known track record as a file system in high performance computing environments. GlusterFS is also reasonably wide deployed, but Ceph is still under development. Unfortunately all promising product have been developed with a different mindset than what we are aiming for in this project. So it might be a challenge to tune and tweak these products so they will match our needs. So far only literature has been consulted to decide which products are promising. A next step in this project will be a more hands-on investigation of these products to obtain more in-depth knowledge. This investigation will be conducted in a distributed test environment within the SURFnet network. The outcome of this investigation, based on the requirements that have been set, will be a shortlist of two products that are most viable for an implementation phase. Finally a proof of concept architecture will be build in cooperation with a pilot partner and after a pilot phase the final product will be selected. A report on the experiences of this product and a proposal for a distributed storage design will also be delivered. 21
5 References [1] CAstor storage software, http://www.caringo.com/products_castor.html [2] Ceph open source distributed storage, http://ceph.newdream.net/ [3] CloudStore home page, http://kosmosfs.sourceforge.net/ [4] dcache home page, http://www.dcache.org/ [5] DRBD Home Page, http://www.drbd.org/ [6] S. Gilbert and N. Lynch, Brewer s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services, ACM SIGACT News, Vol. 33, Issue 2, 2002 [7] Gluster Community, http://www.gluster.org/ [8] Hadoop Distributed File System, http://hadoop.apache.org/hdfs/ [9] IBM General Parallel File System, http://www-03.ibm.com/systems/software/gpfs/index.html [10] Institute of Electrical and Electronics Engineers (IEEE) Standard 1003.1 2008 / POSIX:2008, Base Specifications, Issue 7, December 2008 [11] Isilon systems, http://www.isilon.com/ [12] Lustre home page, http://wiki.lustre.org/ [13] MogileFS home page, http://danga.com/mogilefs/ [14] OCFS2 project home, http://oss.oracle.com/projects/ocfs2/ [15] ParaScale cloud storage software, http://www.parascale.com/ [16] Posix Extensions for High-Performance Computing, http://www.pdl.cmu.edu/posix/ [17] S. Shepler, M. Eisler, and D. Noveck, Network File System (NFS) Version 4 Minor Version 1 Protocol, RFC 5661, Internet Engineering Task Force, January 2010 [18] A. Shoshani, and D. Rotem (Ed.), Scientific Data Management: Challenges, Technology and Deployment, Chapman & Hall/CRC, 2010 [19] SURFnet lightpaths, http://www.surfnet.nl/en/diensten/netwerkinfrastructuur/pages/lightpaths.as px [20] Symantec storage solutions, http://www.symantec.com/ [21] Tahoe The Least-Authority Filesystem (LAFS), http://allmydata.org/ [22] TeraGrid, https://www.teragrid.org/ [23] F. Travostino, P. Dasit, L. Gommans, C. Jog, C. de Laat, J. Mambretti, I. Monga, B. van Oudenaarde, S. Raghunath, and P. Wang, Seamless Live Migration of Virtual Machines over the MAN/WAN, Future Generation Computer Systems, Vol. 22, Issue 8, 2006 [24] M. Wiesmann, F. Pedone, A. Schiper, B. Kemme, and G. Alonso, Understanding Replication in Databases and Distributed Systems, In Proceedings of the International Conference on Distributed Computing Systems (ICDCS 00), April 2000 [25] T. Wood, K. Ramakrishnan, J. van der Merwe, and P. Shenoy. CloudNet: A Platform for Optimized WAN Migration of Virtual Machines, University of Massachusetts Technical Report TR-2010-002, January 2010 [26] XtreemFS a cloud file system, http://www.xtreemfs.org/ 22
[27] XtreemOS: a Linux-based Operating System to Support Virtual Organizations for Next Generation Grids, http://www.xtreemos.eu/ [28] ZFS home page, http://www.opensolaris.org/os/community/zfs/ 23