A Comparison of Fault-Tolerant Cloud Storage File Systems

Transcription

1 A Comparison of Fault-Tolerant Cloud Storage File Systems Steven Verkuil University of Twente P.O. Box 217, 7500AE Enschede The Netherlands ABSTRACT There are many cloud storage file systems that guarantee fault-tolerance. Implementation of fault-tolerance in a cloud storage file system is achieved in several different ways. This paper aims to find the benefits and drawbacks of existing fault-tolerant file systems by defining criteria on which faulttolerant file systems can be graded. Several distributed file systems will be compared to discover how the criteria are satisfied. The research concludes with an overview of how the different file systems, each powered by their own distinct architecture, perform in an environment that is prone to errors. It is shown that not all file systems perform evenly well regarding the criteria, therefore underlining the need to evaluate fault-tolerance behavior when choosing a file system. Keywords Fault-tolerant, distributed, cloud, storage, comparison, file system, HDFS, GlusterFS, XtreemFS. 1. INTRODUCTION Fault-tolerant data storage is becoming more relevant as businesses and individuals move their data to the cloud. Many cloud providers exist these days and provide out of the box solutions for basic storage needs. Examples of such cloud storage providers are Dropbox, box.net and Google Drive. However not all situations allow for third-party hosting of data. For example when storing privacy sensitive data that must be stored in a certain country by law. Or when a third party distributed network does not allow for fast enough access times for data intensive applications. Therefore, research in setting up robust cloud storage networks is relevant for many businesses and cloud operators these days. Fault-tolerance is an important aspect in cloud storage because it concerns the robustness of the data that is stored [3]. Data is distributed over multiple machines which are prone to network failures. If a server containing data becomes unavailable in a cloud environment, it must be prevented that depending services also become unavailable. This is the main reason that robustness in cloud storage file systems is actively researched. Although ample research is done on cloud storage file systems [3, 4, 9] no detailed study on the aspect of comparing fault-tolerant mechanisms used in cloud storage is done at the moment of writing. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 19 th Twente Student Conference on IT, June 24 th, 2013, Enschede, The Netherlands. Copyright 2013, University of Twente, Faculty of Electrical Engineering, Mathematics and Computer Science. This paper aims to find out what current cloud storage faulttolerant file systems exist and compare how they implement fault-tolerance. This paper describes a methodology to compare different fault-tolerant file systems. The aim is to facilitate the selection of file systems by cloud storage operators. This paper answers the following research question: What are the benefits and drawbacks of existing fault-tolerant cloud storage file systems? The main research question is answered by evaluating related sub-questions. These questions are: 1. Which fault-tolerant architectures for cloud storage file systems are currently available? 2. Which criteria can be used to compare faulttolerant cloud storage file systems? 3. How do existing fault-tolerant cloud storage file systems satisfy these criteria? The first research question will be answered by performing a literature study on existing fault-tolerant file systems architectures. Because the vast amount of available file systems, and due to the limited amount of time, this research will only focus on three commonly used open-source faulttolerant file systems. The following file systems, accompanied with version numbers, are chosen to be included in the research: Apache Hadoop File System (HDFS) release GlusterFS release XtreemFS release 1.4 The above three file systems are chosen after a literature search for distributed file systems and selected for their popularity in usage. An extensive analysis of the documentation of these file systems and relevant literature will be carried out to answer question one. Several additional papers focusing on different fault-tolerant architectures are studied to understand the inner operation of the faulttolerance mechanisms.. The literature analysis creates a basis for question two, in which we extrapolate on the term `fault-tolerance`. This paper focuses on the failure of a node / multiple nodes in a distributed network at any given time. The possibility that the actual data is corrupted during storage transactions is not covered in this document due to time and resource constraints. Criteria that can be used for comparison of different faulttolerant cloud file systems are defined. These criteria will be related, but not limited, to aspects such as (1) the ability to recover from the concurrent loss of multiple machines, (2) the ability to handle interruptions during a read or write operation

2 on a file, or (3) the time it takes for a node to be synchronized after recovery from network failure. The third research question will be answered by conducting a comparison on the defined criteria. How the criteria are satisfied by the file systems is derived by querying available documentation and referencing other researches. Answering the third research question will also involve a basic experiment in which some of the criteria can be evaluated in order to compare the three file systems. The experiment setup involves four virtualized Linux machines which together run a distributed file system installation. A networking error is then simulated on one of the machines and investigating which impact this will have on the files that are stored. The experiment is thus concerned with the difference between the files that are stored, see block 1.files in Figure 1, and how long it takes for the files to become accessible after a crash has occurred, see block 3.files in Figure files 3. files 2. Fault-tolerant file systems Figure 1. Schematic test setup Many research studies regarding fault-tolerance have already been done. Oriani et al. [12] have proposed a way to improve fault-tolerance for the Hadoop file system. Wang et al. [15] did research in removing single points of failure in an effort to improve fault-tolerance for Hadoop based file systems. General research on fault-tolerance behavior of the HDFS fault-tolerant file system was done by Evans [4]. Hupfeld et al. [9] have researched object-based file systems in grids regarding the XtreemFS architecture and how they benefit from fault-tolerance. Pardi et al. [13] researched GlusterFS scalability for large data processing. Chan et al. [3] proposed a coding-based fault-tolerant network storage system which is able to recover files after storage interruption. However, no detailed research has been done in comparing fault-tolerant file mechanisms used in cloud storage systems. This research contributes to obtaining an understanding of criteria that can be used by cloud providers to compare different fault-tolerant architectures. Furthermore a comparison of these criteria will be provided by comparing how different file systems satisfy these criteria. This paper is organized as follows: in Section 2 the different architectures for fault-tolerance are explored and explained for each of the three file systems, answering question 1. In section 3 criteria for comparison will be described, answering question 2. A qualitative comparison between the different architectures will also be done in this section, partially answering question 3. In section 4 the setup for the experiments will be discussed which is needed to fully answer question 3 in Section 5. Section 6 contains the conclusions of the research and the future work. 2. FAULT-TOLERANT ARCHITECTURES 2.1 HDFS The Hadoop Distributed File System (HDFS) is a file system build upon the open-source Apache Hadoop framework [2]. The Apache Hadoop framework, together with MapReduce (a computational paradigm from Google), provides a basis for applications which need to process large numbers of data in a distributed manner. HDFS is capable of handling multiple petabytes of data [14] and is used by companies that require a large storage infrastructure such as Facebook 1 and Yahoo 2. It is important to understand the basic architecture of HDFS before investigating its fault-tolerance capacities. A typical HDFS cluster contains two important types of nodes. These are the NameNode and DataNodes [2]. A cluster contains one NameNode and one or more DataNode machines. The NameNode is a master server responsible for managing the file system namespace and regulating access to data by clients. When a client uploads a file it is split into one or more blocks which are then stored in a set of DataNodes. These DataNodes are also responsible for servicing read and write requests from the file system s clients [2]. Figure 2 shows a schematic overview of the HDFS architecture. HDFS Client Read File Write File Metadata DataNodes NameNode Replication Operations Figure 2. HDFS architecture overview, based on [2] The DataNodes in Figure 2 are responsible for replication which is the primary fault-tolerance mechanism of HDFS. Each file can have a certain replication factor, indicating how many copies of all blocks associated with the

3 file should be stored in the HDFS cluster. Once a block is being written to a DataNode it is instantly replicated to another node until the replication factor is satisfied. For each 4KB of data the DataNode reads, it writes the same 4KB of data to another DataNode. The order of DataNodes is determined by the NameNode when setting up the initial file write transaction for the client. The process of passing on 4KB chucks repeats until the replication factor is met. The data is effectively pipelined from one node to the next, a process called Replication Pipelining [2]. Now that we have a basic understanding of HDFS, we can also easily identify the first shortcome of this architecture with regard to network fault-tolerance. Namely, the NameNode is a single node in the network for which there is no duplicate. Once the NameNode goes offline no transactions could be made by any of the clients, effectively loosing track of all stored data. This single point of failure is a drawback of the HDFS implementation [4]. However, it should be mentioned that some work has been done in replicating the NameNode to overcome the problem of a single point of failure [15]. 2.2 GlusterFS GlusterFS is an open source distributed file system designed to be capable to handle enormous amounts of data storage [13]. GlusterFS is maintained by the multinational software company Red Hat and is actively deployed in cloud services 3. GlusterFS architecture aggregates several disks and memory resources in a single global namespace with one common mount point on a Linux machine. Thousands of applications and clients can then connect to the GlusterFS file system via this mount point and interact with the stored data [6]. GlusterFS is built upon the FUSE project which allows software to create virtual file systems and integrates with Linux kernels 2.4.x and 2.6.x. GlusterFS is basically a layer upon already existing file systems such as ext4 and can scale to several petabytes of storage which is available under a single mount point [8]. In GlusterFS the elemental storage units are called Bricks. Bricks store data via translators on lower-level file systems. A server can have one or more bricks. A Trusted Storage Pool (TSP) can be created to combine multiple storage servers into a Distributed volume. GlusterFS can work with three different types of volumes which are discussed briefly below. Distributed Volumes spread files randomly across bricks in the volume. There could be multiple servers running one or more bricks. However each file is only stored once and therefore server failure can result in serious loss of data. Replicated Volumes create copies of files across multiple bricks in the volume. It can be configured such that a file copy is always put on a different server, and not on a different brick on the same server, in order to better protect against data loss. Striped Volumes stripe file data across multiple volumes. A file is split up into segments and each segment is stored on a brick. This allows for a significant speedup in high concurrency environments. There are also several variations possible on the types discussed above. For example Distributed Striped Volumes or Striped Replicated Volumes. However since this research is only concerned with fault-tolerance, the focus will be on Replicated Volumes for the rest of the this research. Replicated volumes are the most fault-tolerant types of 3 storage structures that GlusterFS supports [8]. Figure 3 illustrates a typical setup using a Replicated Volume in GlusterFS. File1 GlusterFS Client Read Server Brick File2 Replicated Volume File3 Write File1, File2, File3 File1 Server Brick File2 File3 Figure 3. GlusterFS Replicated Volume based on [8] When a volume is created in GlusterFS, a so called replica count is given as parameter during setup. This replica count specifies how many copies of each file should be stored inside the Replicated Volume. The replica count should match the number of bricks inside the volume. So assigning one brick to each server in a volume ensures maximum protection against server failure since files are distributed amongst many different machines [8]. 2.3 XtreemFS The XtreemFS file system is an open-source object-based file system for wide-area infrastructures [9]. XtreemFS uses Replication to implement fault-tolerance by maintaining replicas of files on physically different servers [1]. The XtreemFS architecture prescribes three types of servers which should all be present to form a working installation [1]. The first type is a Directory Service (DIR). The DIR is the central registry for all services in XtreemFS and contains the configuration settings and a list of peers. It is used by the Metadata and Replica Catalog (MRC) to discover storage servers. The MRC contains the directory tree and file metadata. It is also responsible for authentication and authorization of file access. The third type is the Object Storage Device (OSD). The OSD stores actual objects of files and interfaces with Linux/Windows clients for read and write operations [1]. Figure 4 illustrates the XtreemFS layout. Metadata operations, Object locations, Authorization Metadata and Replica Catalog (MRC) Service registration XtreemFS Client File system management DIR server Parallel read/write Object Storage Devices (OSD s) Figure 4. XtreemFS architecture, based on [1] XtreemFS allows for replication of all three types of servers [1]. Whereas OSD server replication would be `trivial` to implement fault-tolerance for stored files, MRC and DIR replication could also be used to significantly increase the file systems reliability. For example, the MRC server is in some degree similar to the NameNode in the HDFS file system (see Section 2.1). When the NameNode fails, all metadata is lost

4 and HDFS cannot serve files any longer. However when we analogously have replicas of MRC servers, other MRC servers could take over in case of failures, avoiding file system disruptions. The implementation of replication in OSD, MRC and DIR servers are all similar. When a XtreemFS client connects a lease is given and a primary replica is identified for the transaction. This primary replica accepts all updates from the XtreemFS client and applies them to all other replica instances in the same order as received. When the primary node fails, for example due to a networking error, the lease will eventually expire and one of the other replica servers can become primary [1]. The transaction is restored up to the point where the previous server left after the lease is renewed. Upon connection of a new user machine, the XtreemFS client software mounts one of the MRC volumes in a local directory. This is done via the FUSE user-level driver which was briefly introduced in Section 2.2. The freshly mounted MRC is used to find the location of objects on the OSDs and its associated metadata such as file size, file type, modification date and ownership information. After acquiring the location of the file that the user wishes to read/write, the XtreemFS client facilitates a parallel read/write connection to the OSDs that correspond to the file that is being processed. XtreemFS uses striping for parallel reading/writing to objects that are part of a file. This increases data throughput without compromising for available data operations [10]. XtreemFS allows for three different policies to be used when storing files onto the file system. The first policy does not allow for replicas to be made and simply stores each file only once. The second policy, WaR1, stands for write all, read once. The amount of replicas can be configured by the replication count during setup and before writing a file it is checked if the required amount of servers is online to guarantee the replication count. If this is not the case the read operation would return false and the write is cancelled. The final policy, WqRq (Write Quorum Read Quorum) applies majority voting to writing and reading a file. It means that the majority of servers has to be available when writing a file to the file system. This is the most fault-tolerant strategy [1] and will be used for the rest of this research. 2.4 Architectural differences As becomes clear from the previous sections, one of the most distinct difference in the architectures from the three selected file systems lies within how metadata is stored. The metadata is separated from the actual storage of data in the HDFS and XtreemFS architectures whilst it is combined into one block in the GlusterFS file system. Also the metadata server implementation differs between file systems. The XtreemFS architecture allows for the metadata server to be replicated whilst the HDFS architecture does not allow for metadata replication, leaving a single point of failure in the file system. In all file systems the actual fault-tolerance behaviour is obtained by replicating the bytes of data that is being stored. When a single storage machine fails, other machines still contain a replica of the stored data and can serve that copy. 3. COMPARING FAULT TOLERANT ARCHITECTURES Several different approaches to fault-tolerance mechanisms are deployed in the file systems Apache Hadoop (HDFS), GlusterFS and XtreemFS. In order to compare these three different file systems, a set of criteria have to be defined. These criteria will then be used for a qualitative comparison in Section 3.2 and a quantitative comparison in Section Criteria This research is concerned with the fault-tolerance mechanisms of distributed file systems regarding networking issues. There are six identified criteria which are related to networking issues and can be used to compare fault-tolerant architectures. The first three criteria are related to node network failure. The subsequent criteria are related to network interruptions during a transaction. The last criterion is related to the file systems ability to recover from network outage. These criteria are identified after evaluating the different phases in a network transaction which could interrupt the file system. A fault-tolerant file system should be able to handle networking related errors in each of these phases, namely before a transaction, during a transaction and after a transaction. For each of the criteria a detailed description is listed. Some terminology is used to define the criteria. A node references a Linux machine running some (specific) file system software component. A storage node is a node which is dedicated to actually storing the bytes of a file. The term network relates to the network connection of a single node which is used by that node to communicate in a distributed way with other nodes that are part of the file system. The following six criteria are identified for this research: 1. Random node network failure. The file system's ability to handle the network loss of a single randomly chosen node without compromising for file availability. 2. Storage node network failure. The file system's ability to cope with the loss of a single storage node without loosing the capability to restore the original file from other, still available, nodes. 3. Multiple storage node network failure. The file system's ability to successfully respond to client file requests whilst coping with the concurrent loss of two or more storage nodes. 4. Write interruption. The file system's ability to continue writing the file served by the client whilst the primary storage node becomes unavailable during the transaction. 5. Read interruption. The file system's ability to continue serving a read request when the primary storage node that serves the request becomes unavailable during the transaction. 6. Re-replication time. The file system's ability to create a new replica of a file on an available storage node when a (primary) node containing the file disconnects due to networking failure. The first three criteria are concerned with network failure of specifically or randomly chosen nodes. A true fault-tolerant file system should be able to handle the loss of one or more nodes without compromising for system stability [4]. Criteria four and five are related to user interaction. Applications that rely on files that are stored on a distributed cloud file system can be interrupted if a read/write operation on a specific file fails due to a networking node error. Interruption of applications can be prevented if a file system is able to handle node networking failures real-time. The last criterion is concerned with the data redundancy factor. Each file should have multiple copies to guarantee maximal availability and

5 resistance against networking errors. If a node designated to contain a copy of a file recovers from a network crash, the file system should update it with all files that need to be replicated. Maximizing the amount of file copies in the system at any given point of time increases storage robustness [14]. Most of the six criteria regarding network fault-tolerance that are identified allow for a literature-study approach to compare how the different file systems behave in the different scenario's. File system architectures enforce a certain behavior upon events that occur. This behavior is thoroughly documented and tested by the designers of the file system and the users that have implemented the file system in their own computer systems. In Section 3.2 criteria 1-5 will be the basis for a literature-based comparison between the three different file systems. In Section 4 an experiment is described to discover how the file systems satisfy the sixth criterion. 3.2 Qualitative comparison In this section we perform a qualitative comparison based on the first five criteria described in the previous section. For each of the criteria, numbered 1-5, we describe the behavior of each of the three file systems. For each of the criteria the three file systems will be evaluated and graded bad, fair, good or N/A. File systems obtaining a bad grade mean that the file system does not satisfy the criterion at all. A fair grade is given if the file system satisfies the criterion for most situations. A good grade means that the criterion is satisfied even in the worst case in which a large part of the file system fails to operate due to network failures. A N/A grade is used when the selected criterion cannot be applied Random node network failure As we recall from Section 2.1 the HDFS architecture consists of two types of nodes, namely the NameNode and DataNodes. A random node network failure would mean that any of these types of nodes can fail. If the NameNode fails the entire file system is compromised and will be unable to serve files [15]. This single point of failure is a weakness in the HDFS file system and therefore HDFS is unable to guarantee correct functionality after a random node network failure. Mikami et al [11] explain in their research that GlusterFS does not store metadata and does not have a meta server. All servers that host bricks are able to locate any piece of data without looking it up in an index or querying other servers. Therefore the failure of a randomly chosen node does not influence file availability if Replicated Volumes were used. XtreemFS allows for all three types of servers; DIR, MRC and OSD to be replicated for redundancy [1]. Assuming that this functionality is exploited by the client the systems allows for full fault-tolerance, and at any given point in time a random node can fail without depending applications even noticing. Concerning random node network failure, HDFS is rated bad because the failure of a single NameNode renders the entire file system useless. GlusterFS and XtreemFS are both considered good because they allow for a random node to disconnect from the system without disrupting the file system Storage node network failure HDFS has DataNodes that are responsible for storing blocks of data. According to Oriani et al [12], the HDFS architecture can handle the failure of a single DataNode without compromising for file availability, due to the fact HDFS replicates blocks to at least three different DataNodes. GlusterFS distributes data over mirrors using synchronous writes [7]. A single storage server can fail without consequences to the rest of the file system. Stored data will be available if at least one node containing the file is online. Hupfeld et al [9] show that the XtreemFS architecture is capable of replicating stored files across multiple OSD instances. If a single OSD fails the data will still be retrievable from a duplicate stored on a different OSD as long as the majority of the servers remains online [1]. All of the three architectures allow for data to be stored redundantly and therefore are resistant against the failure of a single storage node. All three file systems are therefore classified as good Multiple storage node network failure HDFS is capable of handling multiple DataNode failures simultaneously. File availability after the failure of several DataNodes is depending on the replication factor that is set upon writing the file for the first time [2]. Once a DataNode becomes unavailable it triggers the NameNode to register a new copy of it on a still available DataNode in an attempt to honor the replication count set by the user [14]. GlusterFS maintains a parallel connection to all storage servers in order to quickly read an write data. In a replicated environment each file is stored n-times where n is to be configured in the GlusterFS setup. As long as at most one storage server is online, the file can be retrieved without the depending application being interrupted [8]. The availability of files in XtreemFS is depending on the replication policy that is chosen during setup. The WqRq policy, which is previously described in Section 2.3, is the most fault-tolerant policy. File availability using the WqRq policy is only guaranteed when the majority of the replica's is available [1]. HDFS is classified as good because multiple storage nodes can disconnect, and this failure of a storage node trigger the replication-process instantly. GlusterFS allows for a large number of nodes to disconnect before depending applications fail. Assuming a replication count of three or more copies, GlusterFS is also considered good. XtreemFS is considered fair because it depends on the majority of the storage nodes to be online to ensure data availability Write interruption Riahi et al [14] showed that HDFS is able to handle interruptions during the write operation of a file to the file system. If the primary DataNode that is used to write the file to becomes unavailable during the write operation, another DataNode is selected and a new attempt to write the blocks of the file is made. An upper bound on the amount of retries can be set in the HDFS configuration file and the system will retry the write operation until the transfer completed successfully. The GlusterFS client maintains parallel connections to all GlusterFS servers. Data being written to a Replicated Volume will be written to all associated servers at the same time. If a server disconnects during the write operation, all other parallel connections will stay alive, thus writing the file to all other servers that are still online. Applications depending on GlusterFS will not be interrupted in any way [7]. XtreemFS utilizes the so called hot-backup paradigm which assigns a new server to the job if a primary server fails [1]. This behavior, combined with parallel writing to multiple

6 OSDs ensures that the primary server can fail without the transaction being invalidated. All three of the evaluated file systems provide a mechanism, either at the server side or at the client side, which allows for real-time recovery of write operations and therefore are graded good Read interruption HDFS is capable of handling DataNode failures during read operations. Riahi et al [14] showed that when a disruption occurs on the primary DataNode during a read operation the file system is able to automatically select a new DataNode that serves a copy of the requested data block. GlusterFS is able to handle read interruption automatically and without disruption to the application served by the file system. GlusterFS utilizes parallel and dynamically requests data from another connected server if the I/O operation is disrupted [7]. XtreemFS hot-backup functionality ensures that a read operation performed by a application is handled with the minimal amount of disruption when the primary storage node fails. XtreemFS maintains a lease for the transaction which expires when a server becomes offline. This lease is then assigned to a OSD containing a copy of the file and the read operation is continued with minimal delay [1]. Each of the file systems provide a way to continue serving read requests of a specific file without interrupting the client that is reading the files. Therefore all of the tested file systems are considered good in handling read interruptions due to networking failures. In the subsections above several criteria regarding faulttolerance are tested against the HDFS, GlusterFS and XtreemFS architecture. Table 2 in Section 6 shows qualitative comparison results when using criterion 1-5 given in Section 3.1. Keep in mind that only fault-tolerance regarding networking issues are evaluated in this comparison. Different results might be obtained when tested against other issues, for example partial data corruption from disk errors. 4. EXPERIMENTS An experiment has been performed to compare HDFS 1.1.2, GlusterFS and XtreemFS 1.4 on the remaining criterion namely Re-replication time. For this experiment four virtual machines running Debian Squeeze are setup for each file system. Each virtual machine has one dedicated CPU core of 2,53 GHz and 1 GB ram assigned to it. The virtual machines are connected locally via a virtual 1Gbit network connection. One of the four virtual machines will be running the file system client and other related services. The other three servers will be part of the storage infrastructure. The virtualization software used to run and configure the virtual machines is Oracle VM VirtualBox 4.2, the host machine is a 64-bit Windows 7 installation. In order to test the Re-replication time criterion/metric, different sets of files containing randomly generated contents will be written to the file system and thus distributed over the different storage machines. For the purpose of this experiment, 12 different sets of files are used which are listed in figure 5. The sets vary in both file size and file count in order to identify how these parameters influence the rereplication time in comparison with the other file systems. Possible synchronization time differences will be highlighted when reviewing the test results in later sections of this paper. 1. Set of 10 files, 1 Megabyte each 2. Set of 20 files, 1 Megabyte each 3. Set of 50 files, 1 Megabyte each 4. Set of 70 files, 1 Megabyte each 5. Set of 90 files, 1 Megabyte each 6. Set of 100 files, 1 Megabyte each 7. Set of 10 files, 5 Megabytes each 8. Set of 20 files, 5 Megabytes each 9. Set of 50 files, 5 Megabytes each 10. Set of 70 files, 5 Megabytes each 11. Set of 90 files, 5 Megabytes each 12. Set of 100 files, 5 Megabytes each Figure 5. List of different file sets used Each test run starts by storing half the amount of the files to all three storage nodes. Then one storage machine will be disconnected from the network such that the replication count falls below the configured minimum of three servers. After a single server is disconnected, the other half of the files is written to the file system. Then the server is put back online and it is measured how long it takes for the file system to upload the newly written files on the reconnected node such that the minimum of three copies of each file is met. Measurements are done by regularly polling log files or status commands to see how long it takes for the synchronization to complete. The time is measured in seconds and will give a good indication of how fast a file system is able to upload the newly written files on the reconnected node over time. The following grading for the re-replication time criterion is used. A good grade is given when the file system is able to sync the files within an average time of ten minutes. The ten minute criterion is chosen because it allows for a few minutes of network recovery and message propagation, and gives room for possible timeouts that are used in the file system. It is assumed that ten minutes is an acceptable amount of time when operating in a real-life environment with hundreds or even thousands of servers in which a single node is never the critical factor for file availability. A fair grade is given when synchronization happens within minutes. A bad grade will be given when no synchronization happened after 30 minutes of idle waiting time. It is then assumed that the file system is unable to provide the reconnected node with files that were written to the file system during the network outage. The N/A grade is given when the criterion could not be applied to the file system. Each file system has many relevant configuration parameters and differs in installation procedure. For this research the default provided configuration is used and where needed improved for fault-tolerance purposes. A global overview of how the experiment is performed for each file system is provided in Section 4.1. Detailed installation procedures and relevant parameters for each file system are given in Appendix A. The replication count is set to three copies on all file systems to allow for a fair comparison. 4.1 File system setup HDFS setup The HDFS file system is installed and configured on all four nodes as described in appendix A.1. For each test run, the node running the file system client uploads half of the file set and then disconnects a single DataNode slave. Once the

7 second half of the files is uploaded, the slave node is reconnected and the log file of the reconnected DataNode is polled every second. The time difference between the time of reconnection and the time of sync completion can be found in the log file and is stored for further analysis when the count matches the total amount of blocks for the current run GlusterFS configuration GlusterFS is installed and configured as described in appendix A.2. The node running the GlusterFS client software uploads half of the file and then blocks the connection to a chosen Storage brick. The connection is restored after the second half of the files is stored on the file system. The GlusterFS heal info command, which contains information about the amount of replicas, is then polled every second and the time it takes for the replication process to complete stored. Note that polling the heal information does not trigger the replication process itself [8] XtreemFS configuration The precise XtreemFS configuration parameters are given in Appendix A.3. For each test run, the master node running the client, MRC and DIR services starts by uploading half of the file set and then, analogously to HDF and GlusterFS, disconnects a single slave node. The second half of the files is then written to the file system and the disconnected node is reconnected. The master node polls the log file on the slave node each second and stores the timestamp when the newly written files are synchronized. 5. RESULTS AND ANALYSIS This section provides the results of the experiment described in Section 4. The experiment defined 12 sets of test data to be used in test runs. Each test run is executed ten times in order to obtain an average, meaning that for each file system a total of 120 test measurements is done. The average times in seconds it takes for a file system to finish replicating the files to a previously disconnected node after it becomes back online are given in Table 1. Each row represents a set of files as previously listed in Figure 5. The XtreemFS architecture does not lend itself for testing the re-replication time criterion for reasons explained later on in this section and did not produce any time measurements. Table 1. HDFS and GlusterFS test run averages HDFS GlusterFS Set sec 487 sec Set sec 505 sec Set sec 504 sec Set sec 490 sec Set sec 510 sec Set sec 502 sec Set sec 504 sec Set sec 503 sec Set sec 498 sec Set sec 504 sec Set sec 513 sec Set sec 488 sec Overall it can be seen that the re-replication times are rather high, in the order of seconds. One might expect that the file systems tested would have timings in the order of seconds regarding the relative small amount of files that is tested in the experiment setup and the capability of the file systems to handle terabytes of data. It is however good to understand that this experiment measures the total time before the files are synchronized to the reconnected node. The largest part of the recorded timings is simply waiting time before the actual synchronization starts. Figure 6 illustrates the re-replication average times for files with 1 Mb and 5MB size when the number of files is increased. The 95% confidence intervals are also shown. Figure 6. HDFS and GlusterFS test runs For HDFS, the experiment does not show a linear relationship between increasing the file size or file count and the delay before the synchronization is completed. The HDFS architecture allows for a possible explanation for the absence of such a linear relationship. Each DataNode which holds a copy of the file is able of synchronizing with the DataNode that has encountered the network disconnection. Each DataNode might have a different timeout value for checking file integrity via so called heartbeat messages [2]. It could also be that HDFS can more efficiently synchronize large file sets than smaller sets, although this statement can not be backed by literature or documentation. Further research is needed to investigate the observed behavior of the HDFS file system in more detail. Due to the fact that the HDFS file system is capable of replicating files after networking failure in a reasonable amount of time and therefore is graded good regarding the re-replication time criterion. The GlusterFS experiment results show an almost constant synchronization time for Replicated Volumes after they recover from a networking error. An explanation for this behavior is that GlusterFS uses a constant timeout [8] to check for file integrity and that internal transfer rates are high enough to not make a significant time difference between the performed test runs. Due to the fact that GlusterFS is able to synchronize data within an acceptable amount of time it is graded good regarding the re-replication time criterion. XtreemFS allows for different policies to be used for file replication. As explained in Section 2.3 the write Quorum Read Quorum (WqRq) policy was used for this experiment. All of the policies specify a minimum amount of servers that has to be online for the file-write operation to even succeed. For the WqRq policy the majority of the servers has to be online and then the file system would allow for the files to be stored. When a single server is taken offline out the three available servers the majority, namely two servers, is still online. XtreemFS accepts the loss of a single server and

8 writes the files to only two servers. No effort will be made to synchronize the files to another node if it becomes back online since the policy only guarantees that the files are stored on the majority of servers, which is the case. This behavior earns XtreemFS a N/A grade in for the re-replication time criterion. 6. CONCLUSIONS AND FUTURE WORK Three file systems that are architecturally different have been compared on network related fault-tolerance in this research. HDFS, GlusterFS and XtreemFS each proved to be capable handling networking errors to some degree. A total of six criteria each of which is associated with a phase in a network transaction have been identified and used for comparison. HDFS is overall very capable of delivering fault-tolerance for file storage. However, the architecture of HDFS has a single point of failure, the NameNode. Because of this it is uncertain what will happen if a random node fails. If the NameNode is put out of service, all other files will become unavailable and the file system halts until the NameNode is back online. GlusterFS is an robust file system which is the only file system to receive the good grade on all six criteria. Faulttolerance data storage is one of the key aspects of the GlusterFS file system and is proved to be reliable for resilient data storage. The absence of a metadata server allows for a random node to fail without the possibility of disturbing applications which are depending on the contents served by GlusterFS. The XtreemFS file system architecture is not quite as resilient as the two other tested file systems. The XtreemFS architecture does not allow for replication to automatically occur once a node is reconnected after failure. XtreemFS establishes whether there are enough storage nodes online at the time the file-write operation is performed. It is not concerned with replicating data, to match the desired replication count, when other Storage servers become available after reconnection. Table 2 lists a summary of grades given for each file system based on the selected criteria. A bad grade is given when the file system does not satisfy the criterion at all. A fair grade is given if the file system satisfies the criterion for most situations. A good grade is only given when the criterion is satisfied in the worst possible scenario. A N/A grade is given if the criterion cannot be applied. Table 2. Comparison of fault-tolerant behavior 1. Random node network failure 2. Storage node network failure 3. Multiple storage node network failure HDFS GlusterFS XtreemFS Bad Good Good Good Good Good Good Good Fair 4. Write interruption Good Good Good 5. Read interruption Good Good Good 6. Re-replication time Good Good N/A For future work, other types of fault-tolerance such as data corruption could be evaluated and used as basis for file system comparison. The role of the file system clients could also be investigated and evaluated how the different file systems handle an error originated from the client pc. Also the experiments performed in this research were limited to time and resource constraints. Running the experiments with larger sets of data could provide more insight on the cause of the fluctuations in the synchronization times that were measured when evaluating criterion 6. Further research could also be put into finding why HDFS provides such large time differences compared to GlusterFS when evaluating the rereplication time metric. 7. REFERENCES [1] J. S. Björn Kolbeck, Michael Berlin,, P. S. Matthias Noack, Felix Langner, NEC HPC Europe, Felix Hupfeld,, and J. Gonzales., The XtreemFS Installation and User Guide, [2] D. Borthakur. "HDFS Architecture Guide," Visited on 01-05, 2013; ml. [3] M. C. Chan, J. R. Jiang, and S. T. Huang, "Faulttolerant and secure networked storage," 7th International Conference on Digital Information Management, ICDIM pp , [4] J. Evans, Fault Tolerance in Hadoop for Work Migration, , [5] Fuse. "Filesystem in Userspace," Visited on 01-05, 2013; [6] Gluster. "About GlusterFS," Visited on 04-05, 2013; [7] Gluster. "Intruduction to Gluster," Visited on 04-04, 2013; [8] Gluster, Gluster File System Administration Guide, gluster.org, [9] F. Hupfeld, T. Cortes, B. Kolbeck et al., The XtreemFS architecture - A case for object-based file systems in Grids, Concurrency Computation Practice and Experience, vol. 20, no. 17, pp , [10] B. K. Jan Stender, Felix Hupfeld,, E. F. Eugenio Cesario, Matthias Hess,, and J. M. Jesús Malo, Striping without Sacrifices: Maintaining POSIX Semantics in a Parallel File System, LASCO'08 First USENIX Workshop on Large-Scale Computing, no. 6, [11] S. Mikami, K. Ohta, and O. Tatebe, "Using the Gfarm file system as a POSIX compatible storage platform for Hadoop MapReduce applications." pp [12] A. Oriani, and I. C. Garcia, "From backup to hot standby: High availability for HDFS." pp [13] S. Pardi, A. Fella, F. Bianchi et al., Testing and evaluating storage technology to build a distributed Tier1 for SuperB in Italy, Journal of Physics: Conference Series, vol. 396, no. PART 4, [14] H. Riahi, G. Donvito, L. Fanò et al., Using hadoop file system and MapReduce in a small/medium grid site, Journal of Physics: Conference Series, vol. 396, no. PART 4, [15] F. Wang, J. Qiu, J. Yang et al., "Hadoop high availability through metadata replication." pp

9 APPENDIX A. FILE SYSTEM CONFIGURATIONS A.1 HDFS configuration For the Hadoop Distributed File System setup, HDFS was installed by downloading the stable release from the official repository and installing it with the `dpkg -i ` Linux command on all four virtual machines. HDFS comes with a predefined set of scripts, readily executable by the root system user. The main configuration script which is delivered with the installation files is `hadoop-setup-conf.sh`. It was executed on all four servers whilst designating the first server for both the namenode and jobtracker services (master node). The other three servers (slave nodes) are listed by their ipaddress in the conf/slaves file of the Hadoop installation. A passphraseless SSH setup is configured using ssh-keygen, allowing the master node to connect to all slaves via SSH without entering a password. The master node is then started using the predefined `start-all.sh` command which also automatically starts up the datanode daemon on all the slaves via a secure shell connection. For each test run, the master node uploads half of the file set and the disconnects a single slave node by executing ìptables -A INPUT -j DROP && iptables -A OUTPUT -j DROP` on the selected slave node. The connection is restored via the ìptables -F` command after the second half of the file set is uploaded. The log file of the disconnected DataNode, located on is then polled every second and the occurrence of the 'Received block' line counted. The time difference between the date of reconnection and date of sync completion is stored when the count matches the total amount of blocks for the current run. A.2 GlusterFS configuration GlusterFS setup required manually downloading the glusterfs- (common client server)_ _amd64.deb packages from the GlusterFS main repository. One virtual machine was designated for running the client software by installing the common and client packages via the `dpkg -i` command. The other three virtual machines were designated as GlusterFS slaves by installing the common, client and server packages. Installation of these packages automatically triggered the glusterd service to start on all servers. A Replicated Volume was created on the main node by executing the following command: `gluster volume create gv0 replica 3 s1:/export/sdb1/brick s2:/export/sdb1/brick s3:/export/sdb1/brick` In this command s1-s3 represent the GlusterFS-slaves ipaddresses and /export/sdb1/brick is an XFS formatted virtual drive used to store the actual files which is mounted on each GlusterFS-slave node. The resulting GlusterFS volume gv0 was then mounted on the node running the client software by executing `mkdir /gfs && mount -t glusterfs localhost:/gv0 /gfs`. The contents of the Replicated Volume are now available on the client node at the /gfs mount location. For each test run, the client node uploads half of the file set and the disconnects a single slave node by executing ìptables -A INPUT -s slave -j DROP && iptables -A OUTPUT -s slave -j DROP` on the client node itself, substituting "slave" for the ip-address of the slave node that is about to be disconnected. The connection is restored by executing the ìptables -F` command after the second half of the file set is uploaded. The GlusterFS heal information, available via the ` gluster volume heal gv0 info` command, is then polled every second. It is measured how long it takes for the heal information to display that all files have been replicated exactly three times. A.3 XtreemFS configuration XtreemFS 1.4 was installed via the àpt-get install xtreemfs- (client server)` command after adding the official Debian repository which can be found on the xtreemfs.org website. Of the four available servers in the test setup, the first server was considered the master node and the xtreemfs-client and xtreemfs-server packages were installed. On the remaining three servers only the xtreemfs-server package was set up. Installation of the XtreemFS packages delivered three types of services to the Linux installation, namely the xtreemfs-dir, xtreemfs-mrc and xtreemfs-osd services. The Directory Service and Metadata service were started on the master node, the Storage service was started on the three slave nodes. For each of the slave nodes the dir_service.host directive in osdconfig.properties was changed to match the ip-address of the master node so a connection could be made. Finally the file system is mounted on the master node at /xtreemfs by executing the following command: `mkfs.xtreemfs master-node-ip/volume && mkdir /xtreemfs && mount.xtreemfs master-node-ip /volume /xtreemfs ` The desired replication count is set to three replica's via the xtfsutil tool which resides in the xtreemfs-tools package: `xtfsutil --set-drp --replication-policy WqRq --replicationfactor 3 /xtreemfs` For each test run, the master node starts by uploading half of the file set and then disconnects a single slave node via SSH. Disconnecting is done by blocking all incoming and outgoing traffic via iptables, similar to the method described for HDFS. The second half of the files is then written to the file system and the disconnected node is reconnected. The master node polls the /var/logs/xtreemfs/osd.log on the slave node each second and stores the timestamp when the newly written files are synchronized.