Distributed Storage Evaluation on a Three-Wide Inter-Data Center Deployment

Distributed Storage Evaluation on a Three-Wide Inter-Data Center Deployment Yih-Farn Chen, Scott Daniels, Marios Hadjieleftheriou, Pingkai Liu, Chao Tian, Vinay Vaishampayan AT&T Labs-Research, Shannon Laboratory, 18 Park Ave., Florham Park, NJ 7932 Email: {chen,daniels,marioh,pingkai,tian,vinay}@research.att.com Abstract The demand for cloud storage is exploding as an ever increasing number of enterprises and consumers are storing and processing their data in the cloud. Hence, distributed object storage solutions (e.g., Tahoe-LAFS, Riak, Swift, HDFS) are becoming very critical components of any cloud infrastructure. These systems are able to offer good reliability by distributing redundant information across a large number of commodity servers, easily achieving up to 1 nines of reliability. One drawback of these systems is that they are usually designed for deployment within a single data center, where node-to-node latencies are small. Geo-replication (i.e., distributing redundant information across data centers) for most open-source storage systems is, to the best of our knowledge, accomplished by asynchronously mirroring a given deployment. Given that georeplication is critical for ensuring very high degrees of reliability (e.g., for achieving 1 nines), in this work we evaluate how these storage systems perform when they are directly deployed in a WAN setting. To this end, three popular distributed object stores, namely Quantcast-QFS, Swift and Tahoe-LAFS, are considered and tested on a three-wide data center environment and our findings are reported. I. INTRODUCTION Modern distributed object storage solutions, like HDFS[1], Swift[2], Quantcast-QFS[3], Tahoe-LAFS[4], Riak[5], Azure[], Colossus[7], and Amazon Dynamo[8], offer very good availability and reliability at a low price point by distributing data across a very large set of inexpensive commodity servers that consist of unreliable components. Despite their success, most of these systems, to the best of our knowledge, have been designed to distribute data across large clusters of servers redundantly (either by using replication or erasure coding) within a single data center. Geo-replication, i.e., distribution of redundant information across data centers, is typically handled by asynchronously mirroring a given deployment. There are several limiting factors that affect the performance of distributed object store deployments in a WAN setting (in the presence of TCP slow start and shared bandwidth), as will become clear after an in-depth description of these systems in Section II. In this work, we evaluate three distributed object stores, namely Quantcast-QFS, Swift and Tahoe-LAFS on a three-wide data center environment, in order to assess the impact of WAN latencies on these systems. We focus on the read/write performance of these systems, and ignore other features such as repair, ease of use, maintainability, recovery performance, and compatibility to other system components. These considerations, though important, are more subjective and application dependent. Moreover, our eventual goal is to understand the weaknesses of these distributed storage systems Fig. 1. IL 32. MiB/s.2 MiB/s 25ms GA Multi-site Data Center Deployment with Network Bandwidths on a WAN setting, and thus we make the conscious choice on these restricted but most fundamental issues. II. QUANTCAST-QFS, TAHOE-LAFS AND SWIFT PDFill PDF Editor with Free Writer and Tools In this section, we provide some background information for the three open-source storage systems that we chose to evaluate, and briefly describe the characteristics that are relevant to us. A. Quantcast-QFS Quantcast-QFS is a high-performance, fault-tolerant, distributed file system developed by Quantcast corporation, implemented in C++, that underlies its MapReduce [9] infrastructure 1. File storage in QFS utilizes the (9, ) Reed-Solomon (RS) codes [1], or simple replication codes. The system consists of a large number of storage servers (or chunk servers in QFS terminology) and a single meta-data repository residing on a dedicated meta-server. Each chunk server is responsible for storing erasure-coded chunks of files. The meta-server is responsible for balancing the placement of chunks across the set of chunk servers, maintaining the chunk placement information, and issuing repair requests. When a client initializes a write request, it first retrieves relevant file chunk placement information from the meta server. Then, the file is divided into equal-sized stripes, and each stripe is erasure coded written out to the nine chunk servers returned by the meta-server, and the process repeats with the next set of nine chunks, and potentially a new set of nine chunk servers. 1 Note that Quantcast-QFS is not related to SAN-QFS developed by Sun corporation. In what follows we will refer to Quantcast-QFS as QFS. NJ

To read a file, QFS retrieves any six out of the nine chunks (for every chunk group associated with that file) and reconstructs all stripes therein. The meta-server in QFS is a single-point-of-failure (SPOF); however, there exists a checkpointing mechanism such that the whole system can recover fairly quickly, without losing files that had been successfully stored before the last checkpoint. In addition, to achieve high reliability, QFS allows for certain placement policies to be specified such that the file chunks are placed onto different failure domains (e.g., across nodes, racks, and zones). Finally, given that chunk servers issue frequent keep-alive heartbeats to the meta-server, the meta-server has a very accurate view of the status of the cluster, and is hence responsible for issuing repair requests for lost chunks. B. Tahoe-LAFS Tahoe-LAFS (Least-Authority File System) is an opensource, distributed file system, implemented in Python, that can tolerate multiple data server failures or attacks, while preserving the privacy and security of the data. The underlying idea is that users can store data on the Tahoe-LAFS cluster in an encrypted form, using standard cryptographic techniques. Clients are responsible for maintaining the necessary cryptographic keys (embedded in read/write/verify capability strings ) needed to access the data, and without those keys, no entity is able to learn any information about the data, including its placement across the cluster. In addition, data is erasure coded for increased reliability, and users can choose the erasure coding parameters on a per-file basis. A Tahoe-LAFS cluster consists of a set of storage peers and a single coordinator node, called the Introducer, whose primary purpose is to announce the addition of new storage peers to the existing pool of peers by following a publish/subscribe paradigm, and relay relevant node information to clients upon read/write requests. Tahoe-LAFS exposes a file system interface to clients, similar to QFS, but in Tahoe-LAFS the file system meta-data is erasure-coded and distributed within Tahoe-LAFS itself. The client holds a key for retrieving a client specific meta-data object that associates directories/files with distinct read/write keys. When a client creates a new file, a unique public/private key pair is generated for that file, and the meta-data object is retrieved, updated and replaced. Then, the file is encrypted, erasure coded, and distributed across storage peers. Alternatively, the system can also be used as a simple object storage service, where users are responsible for managing file keys directly. The placement of the erasure coded shares is decided by a server selection algorithm that hashes the private key of the file, in order to generate a distinct server permutation. Then, servers without enough storage space are removed from the permutation, and the rest of the servers are contacted in sequence and asked to hold one share of the file. When enough peers are found, the write proceeds. Reading a file involves asking all known servers in the cluster to list the number of shares of that file that they hold, and on a subsequent round, choosing which shares to request, based on various heuristics that take into account latency and peer load. The introducer in Tahoe-LAFS is a single-point-of-failure for new clients or new storage peers, since they need to rely on it to be able to join the storage network. However, it is not an SPOF in the traditional sense, because in Tahoe-LAFS the file placement information is decided by hashing the keys held by the clients, and the introducer does not have to maintain any file (or chunk) placement information, therefore, losing the introducer does not jeopardize the normal operation of existing clients and storage peers. The downside of this introducer-storage-nodes architecture is that due to the lack of a centralized meta-server, Tahoe- LAFS is not able to provide complete statistics and details for file storage and placement, though from a security point of view, this is a choice by design. Moreover only a lazy repair policy can be implemented; in other words, the clients are responsible for frequently iterating through their file space and verifying that all files are complete. However when a storage peer fails, the system is not able to initiate an efficient repair mechanism specifically to replace the given peer. C. Swift As part of the OpenStack architecture, Swift was originally designed to provide a reliable, distributed storage facility for maintaining virtual machine images. Despite its roots, it can be used as a generic object store, that uses replication for increased reliability. It is implemented in Python. A Swift cluster consists of a large number of storage servers, which are called object servers in Swift terminology. Each object server is only capable of storing, retrieving, and deleting objects. Objects are stored as raw binary files on the underlying file system, with meta data stored in the file s extended attributes. Swift also uses the concept of proxy servers. Proxy servers are responsible for accepting read/write/delete requests from clients, and coordinating object servers in order to accomplish those requests. The proxy servers also act as load-balancers, firewalls and a caching layer for the Swift cluster. Finally, the Proxy servers are responsible for maintaining user accounts, and a list of all containers and objects within those containers. This meta-data is stored in individual SQLite databases, that are themselves treated as objects, hence they are replicated and stored within the object servers. It should be noted that Swift does not maintain any object placement meta-data, unlike QFS, and is similar to Tahoe- LAFS. Instead, Swift uses the concept of a ring (or consistent hashing), borrowed from Amazon Dynamo [8]. A ring represents a mapping between the names of entities and their physical location on object servers. There are separate rings for accounts, containers, and objects. The ring conceptually divides the entity namespace into a pre-specified number of partitions, and each partition is assigned to multiple object servers (hence entities falling within a given partition are replicated multiple times; the replication factor is configurable). Each partition replica is guaranteed to reside in a different failure domain (or zones in Swift terminology), where the zones are statically defined during system configuration. Swift does not currently support erasure coding. In terms of repair, Swift takes a pro-active approach. Each object server periodically checks whether each partition the server is responsible for needs to be repaired. For each partition, the server polls all other object servers that should also be replicating this partition, and if any objects within the

partition are missing (or if an object server has failed), rsync is run to repair the partition (or replicate the partition to a temporary handoff server). 15 III. THE TEST ENVIRONMENT For our tests, two physical configurations are used. The main configuration consists of three geographically distant data center. The baseline configuration consists of a single data center. The three-site layout is depicted in Figure 1. The sites selected for the test were roughly arranged in a geographically equilateral triangle with several hundred miles separating each. Site 1 is in Lisle, IL, site 2 is in Secaucus, NJ, and site 3 is in Alpharetta, GA. The connectivity between sites varies significantly from site to site, and we observe that network characteristics are not symmetric. Maximum throughput between sites varied depending on direction. Furthermore, in one case throughput varied significantly from one measurement to the next, even when measurements were taken within very short time spans. While the network was far from ideal, and prevented us from determining the best performance that could be obtained from each storage system, direct comparison of the three systems is still possible and meaningful. For the three-site configuration, within each data center we used three hosts as storage servers and one host, referred to as meta host in what follows, for supporting tasks (i.e., the QFS meta-server, the Swift proxy, and the Tahoe-LAFS introducer). The meta host in GA was also used to drive the tests. Each host has an Intel Xeon E5-29 8-Core processor (2.9 GHz), 4GiB of main memory, and nine 72RPM SATA drives of 2TB each in a JBOD configuration running XFS. The operating system is Ubuntu 12.4 LTS with Linux kernel 3.2.. For the single site configuration we used a total of ten hosts, nine as storage servers and one meta host for everything else. All hosts within a data center were connected to the same switch using 1-gigabit Ethernet. All the tests were driven using Cosbench (Cloud Object Store benchmarking) [11], an open-source tool developed by Intel. Cosbench implements a driver that simulates user defined workload configurations. Users can choose the characteristics of the workload in terms of the size of files, the type and percentages of create/write/read/delete operations, and the number of concurrent threads of execution. We extended Cosbench by adding plug-ins to provide interfaces to QFS and Tahoe-LAFS; the interface to Swift is provided as part of the Cosbench package. The tests were organized such that data written to the object store was randomly generated, and data read from the object store was discarded; no disk I/O on the test host impacted the throughput measurements. Results presented in this paper with regard to performance are measured in (powers of 1) and are referred to by Cosbench as bandwidth. In Cosbench, bandwidth is not computed similarly to the traditional throughput measurement (i.e., total bytes over elapsed time), but it is a summation over the throughput of each individual thread of execution. This method of calculation can yield larger values than would be observed by the traditional computation, especially for tests involving files of variable sizes, because it does not capture the idle time for threads in-between job scheduling. Clearly, Fig. 2. Fig. 3. 1 2 3 4 5 QFS/Swift/Tahoe-LAFS Multi-Site Read Performance 15 1 2 3 4 5 QFS/Swift/Tahoe-LAFS Multi-Site Write Performance bandwidth is a good measure in practice, because it does not reflect any design choices related to the Cosbench job scheduler itself. IV. RESULTS Measurements were collected for each system either reading or writing fixed-sized objects. Tests with concurrent reading and writing were not conducted. In our tests we used workloads consisting of 1MB and 1GB objects. For brevity, we present averages over all workloads. The tests were executed with a varying number of concurrent threads. In most cases, the addition of threads, up to a point, increased overall bandwidth, but also resulted in the degradation of the throughput of each individual thread, as expected, due to contention on resources. A. Multi-site Test Results The results are shown in Figures 2 and 3. There are a few observations that should be noted. Writing of larger objects (1 GB vs 1 MB; not shown in the figures), was observed to be slightly faster for all three storage systems for the same total volume of data written. This is likely due to the reduced number of interactions per MB with the meta/proxy server. The difference, though, was not significant. Read performance from multiple sites was slightly better than write performance for all three storage systems. This is expected in erasure coded systems, since reads in fact transfer less data than writes. Performance increased as more threads, to a point, were allowed to access the objects.

Overall, we can see that QFS exhibits the best performance and scalability with respect to concurrent requests. Note that the y-axis is in log-scale. Also notice that as more threads are used, the read/write performance flattens to a point where the available bandwidth dictates what is the maximum throughput that can be obtained, as expected. B. Single-Site Test Results In order to understand the impact that the network imposes on a multi-site environment we established a single site with nine storage hosts and one meta host. Two sets of tests were conducted: one is when the reads/writes are initiated from a node located in the same site, which is referred to as local read/write; the other is when they are initiated from a node located in a different geo-location, which is referred to as remote read/write. For the former, QFS, Tahoe-LAFS and Swift are all tested, while for the latter we focus on QFS. Figures 4 and 5 show the single-site test results, while Figures and 7 show the remote read/write performance for QFS. A few observations are noted below: In terms of local read/write performance, QFS is the clear winner among the three systems. Notice the logarithmic scale on the y-axis, once again. Accessing the single site object stores from a local host results in nearly one order of magnitude increase in performance, as shown in Figures and 7. Even taking into account the bandwidth limitation between data centers, this significant difference is remarkable. We suspect that this is due to the system optimization done in QFS for single-site deployment (further discussion is included in Section V). When accessing the single site object store from a remote location, the performance drops slightly below the performance observed with the object store distributed across multiple sites. This small difference might be caused by the fact that when testing on the multi-site configuration, one set of storage hosts is local to the host running the Cosbench driver and, thus, one third of the data transfers are local. Notice that in any application scenario where we expect the majority of client requests to originate from remote locations, this implies that in fact deploying these systems in a multi-site configuration is preferable than the single-site configuration (both in terms of reliability and performance). Of course, this is not the case, for example, for MapReduce deployments. V. DISCUSSION There are certain limiting factors in terms of performance when trying to deploy distributed storage systems in a WAN setting, mainly due to the latency introduced by the physical connection, TCP slow start, and of course due to shared bandwidth. Systems that rely on a meta-server, like QFS (and, for example, HDFS), introduce large latencies when reading/writing objects across the WAN, because every read/write request for each chunk of the file incurs one round-trip delay to the metaserver for the client and each storage server involved (although Fig. 4. Fig. 5. 15 1 2 3 4 5 QFS/Swift/Tahoe-LAFS Single-Site Read Performance 15 1 2 3 4 5 QFS/Swift/Tahoe-LAFS Single-Site Write Performance storage servers will typically aggregate multiple chunk replies into a single message digest for the meta-server). If the metaserver is located in a remote data center from the client, we expect this architecture to add a significant overhead for read/write requests. In addition, given that QFS splits files into 4MiB chunks, and chunks are uniformly distributed across all storage servers, we expect QFS to suffer significantly from TCP slow start. Systems that use consistent hashing to determine object placement (e.g., Riak and Amazon Dynamo), need to use a distributed consensus protocol (e.g., Paxos [12]) in order to keep the state of the cluster up-to-date, which requires at least one round-trip delay per server for each cluster update. But read/write requests happen independently of the cluster management protocol. Hence read/write requests go directly to the relevant storage servers, without any additional round-trip delays. From that respect, even though Swift uses consistent hashing in order to determine object placement, the ring is statically allocated during system configuration (and can change only manually). Hence, Swift does not have the overhead of running a distributed consensus protocol to maintain the consistent hashing ring up-to-date. On the other hand, Swift does have to maintain meta-data information (accounts and containers) as native objects within the object store itself, hence, incurring at least one round-trip delay for every update of each replica of the meta-data object. Although, in Swift this latency can be hidden completely due to the caching layer at the Proxy servers. We also observe that under-provisioning the proxy setup in Swift can have a detrimental effect in terms of scalability. Swift does not scale well as the number of concurrent threads increases, resulting in a large number of dropped operations (not shown in our figures since we plot averages across all workloads for successful operations only). This is because all data transfers have to go through

Fig.. Fig. 7. 15 Multi-site Single-loc Single-rmt 1 2 3 4 5 QFS Single-Site Read Performance 15 1 2 3 4 5 QFS Single-Site Write Performance Multi-site Single-loc Single-rmt the proxy server, which eventually becomes oversubscribed and starts to shed requests. Clearly this is an indication that they proxy server is not designed to scale gracefully as the number of clients increases, given that, ostensibly, for all storage systems eventually all data has to go through the sole Cosbench driver running on the meta host, in which case we would expect Cosbench to become oversubscribed and we should be observing the same behavior for QFS and Tahoe- LAFS, which is not the case. Nevertheless, a more robust Swift configuration would include several proxy servers that would load-balance requests, but this is something that we did not test in our configuration, and we plan to do as future work, since it would require a multi-driver configuration of Cosbench. Tahoe-LAFS is similar to Swift, in that the meta-data objects are stored within Tahoe-LAFS itself, necessitating at least one round-trip delay for each erasure coded share of the meta-data object, for every write/delete request, and an additional round-trip to all relevant servers to execute the request. On the other hand, reads are accomplished by submitting requests to all known storage peers (given by the introducer) simultaneously, hence the relevant peers are found with one round-trip to every server. In Tahoe-LAFS currently, a second round-trip is incurred, after choosing the peers to read the file from (the intention here is to be able to select which peers to read from, after the initial negotiation phase, based on various heuristics). But the poor performance of Tahoe-LAFS, on the multi-site and single-site environment, can be attributed to several factors, already pointed out by the developers themselves. First, the default stripe size is optimized for reading/writing small objects (the stripe size determines the granularity at which data is being encrypted and erasure coded). Second, due to several implementation issues such as inefficient message passing, and the expensive and frequent hash read/write seeks that are needed in order to reconstruct shares on the storage peers. Third, Tahoe-LAFS has to deal with the overhead of reading/writing the file system meta-data objects (i.e., the mutable directory objects), every time an object is accessed. Fourth, when creating new objects, Tahoe-LAFS has to generate a new public/private key, which is an expensive operation. Surprisingly, reads exhibit the same performance as writes, even though reads ideally have to transfer less data than writes. This is probably because both reads and writes of shares happen simultaneously to all relevant storage peers, hence the extra data transfers are hidden by parallelism. Moreover, this is an indication that pinging all available peers and requesting shares within two round-trips, as well as the fact that every read request has to read the mutable directory object, dominate the overall cost. VI. CONCLUSION We conducted extensive experiments with QFS, Swift and Tahoe-LAFS, which are three very popular distributed storage systems. Our focus was to deploy these systems in a multi-site environment and measure the impact of WAN characteristics on these systems. In addition, as a baseline, we also measured performance on a single-site configuration. Overall, we observe that WAN characteristics have an even larger than expected impact on the performance of these systems, mainly due to several design choices of these systems. Ideally, across the WAN, we would like to reduce the amount of round-trips to a minimum, which is something not particularly important on a LAN. In addition, we notice that good system design and extensive optimizations can have a significant effect on performance, as is seen by the relative difference between QFS, Swift and Tahoe-LAFS. It is important to point out here that QFS is implemented in C++, Swift and Tahoe-LAFS in Python. In addition, QFS is heavily optimized for MapReduce style processing. For future work we are planning to also test Riak and HDFS, as well as our own proprietary solution that is designed, from the ground up, for WAN deployment. REFERENCES [1] D. Borthakur, The hadoop distributed file system: Architecture and design, http://hadoop.apache.org/docs/r.18./hdfs design.pdf, 27. [2] Swift, http://docs.openstack.org/developer/swift. [3] QFS, https://www.quantcast.com/engineering/qfs. [4] Tahoe-LAFS, https://tahoe-lafs.org/trac/tahoe-lafs. [5] Riak, http://basho.com/riak. [] B. C. J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci et al., Windows azure storage: a highly available cloud storage service with strong consistency, in Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 211, pp. 143 157. [7] A. Fikes, Storage architecture and challenges, Faculty Summit, 29. [8] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, Dynamo: Amazon s highly available key-value store, in SOSP, vol. 7, 27, pp. 25 22. [9] J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, Communications of the ACM, vol. 51, no. 1, pp. 17 113, 28. [1] S. B. Wicker and V. K. Bhargava, Reed-Solomon codes and their applications. Wiley.com, 1999. [11] J. Duan, COSBench: A benchmark tool for cloud object storage services, OpenStack Summit Fall 212, 212. [12] L. Lamport, The part-time parliament, ACM Transactions on Computer Systems (TOCS), vol. 1, no. 2, pp. 133 19, 1998.