I/O intensive applications: what are the main differences in the design of the HPC filesystems vs the MapReduce ones? Matthieu Dorier, Radu Marius Tudoran Master 2 Research ENS Cachan - Brittany extension December 16, 2010 1
1 Introduction In this report, a comparison is drawn between HPC filesystems and the ones specific for MapReduce (MR) applications. The starting point of the report will be the four papers [13, 6, 9, 11] proposed. The first article [13], analyzes the possibility of using a parallel filesystem for MR storage instead of using the standard filesystem used with Hadoop [3]. This adaptation is done by implementing a shim layer, and shows comparable performances with HDFS. The second article [9] compares the performance of PVFS, a parallel filesystem, and HDFS, when two different workloads are run in a concurrent manner, each of them being usually specific to one of the filesystems. A comparison between the MR programing model and parallel databases is done in the third article [11], which tries to lower the growing interest of MR applications, while the last paper [6] just describes a successful installation of Hadoop. The problematics raised in the papers will be discussed above, by also comparing them to other publications, and having as goal the highlighting of the specificities of each type of filesystem. 2 Enriched API in DFS MapReduce applications require to process large files in parallel in a write-once-read-many scheme. Thus distributed filesystems (DSF) such as HDFS are designed to run on computation nodes, unlike parallel filesystems that usually run on dedicated nodes, and are part of the application. This deployment scheme lets DFS expose the data localization. Using Hadoop, Yahoo! s implementation of MapReduce, a java interface called FileSystem and a set of abstract classes handling streams let the user implement its own filesystem or select the one he wants to use through configuration files. Thus, several DFS have been adapted to Hadoop, such as CloudStore (previously KosmosFS) or Amazon S3. Specific parameters such as the chunk size and the replication policy can also be tuned in order to fit the need of the central component of the MapReduce framework: the scheduler. In this model, the filesystem becomes part of the entire application and is tuned to fit its precise needs. 3 Parallel Filesystems and HPC The major use of IO in HPC is for checkpointing. Scientific applications featuring timevarying datasets or data-mining programs including iterative optimization solvers often backup their entire dataset in the filesystem in order to be able to restart in case of failure. In this context having a computation driven by data locality makes no sense. If a failure occurs in a node that used its own local storage capability for checkpointing, its data is lost. Thus, clusters usually provide a set-aside storage area where a parallel filesystem, such as PVFS [4], GPFS [12] or Luster, is deployed. Multiple dedicated IO nodes act as IO servers and expose a POSIX-like mount point to the application. While distributed filesystems are deployed for the purpose of a single application, parallel filesystems for HPC are shared between all the users of the cluster. Thus users can book computation nodes on a cluster, but they cannot control or even predict their IO bandwidth. HPC applications are usually based on a main loop featuring a computation intensive phase, a communication phase and a checkpointing phase. A classical pattern consists in writing one file per process per backup, leading to a huge number of files written at the same time in the filesystem. As HPC infrastructures grow toward petascale capabilities, IO become a bottleneck and a high standard deviation in the time to write a file is noticed. Thus, MPI-IO, 2
the IO part of the Message Passing Interface standard, offers the ability to write big files in a collective manner, with several optimization with respect to the filesystem. As an example, MPI-IO/GPFS, an optimized version of MPI-IO on top of GPFS, is presented in [1]. While more information are provided by the application regarding the file layout, MPI-IO also adapt the IO patterns with respect to the knowledge he has of the filesystem. In MPI-IO/GPFS (as in other MPI-IO implementation [14]), such adaptation are conducted through IO agents, namely MPI tasks that are acting as IO adaptation layer to provide - data shipping: avoids multiple tasks to access a single chunk in the filesystem by binding IO agents to chunks and make IO agents read/write entire chunks; - prefetching: when a task tries to access a set of small parts of a file, big chunks are loaded instead in order to avoid multiple expensive access; - data sieving: when the access is sparse with respect to single processes but dense with respect to all, the entire file is loaded and the multiple IO operations are replaced by communications between IO agents; - double buffering: used for large access in order to overlap write requests from tasks with effective IO access to the filesystem. Moreover, MPI IO calls are usually hidden behind high performance data formats such as HDF5 or NetCDF, allowing to keep in the files a high level of semantics by embedding metadata and by arranging datasets in an efficient layout. 4 Failures and replication patterns The way failures are considered in HPC systems or in distributed systems (DS) in which MR programing model can be applied, is one of the key differences between the two. In HPC systems failures are an abnormal, also rare (even though when scaling this tends not to be true) events, while in the case of MR applications, failures are normal and are taken into account in the design of the application. The main mechanism for HPC to deal with failures is the checkpointing and rollback. Powerful algorithms [8] were constructed in order to optimize these mechanisms. But checkpointing always requires heavy access to the filesystem. In HPC systems, data availability is ensured by RAIDs and by additional bits to verify data consistency and in some cases to recover some damaged bits. One of the main problems nowadays is the bit flipping that is increasing year by year, because of the diminishing of the voltage used and the newly silicon layers, and becomes a big concerne for HPC systems. As an open question on how to adept to this situation, HPC systems have as alternatives the usage of more complex data schema for detecting/recovering bit errors, or to consider the replication techniques that are used by DFSs. We have seen [13] that it is possible to provide replication for Parallel FS but there is still work to be done in order to adept this to the HPC needs. On the other hand, the main approach to ensure data reliability and availability in DFS is the replication. Data is stored in more than one data node (in general three, but five is also popular for more sensible data), hence if one node crash or is not available, the DFS rapidly retrieves the data from another location. The cost for persistent storage is very low so having several physical copies is not expensive[7]. DFSs like GFS[7], HDFS[3], BlobSeerFS[10], all use data replication and have monitor mechanisms for maintaining the replication number. 5 Parallel Databases vs. MapReduce In article [11] a comparison between parallel DBMS (DataBases Management Services) and the very popular programing model MapReduce (MR) [5], invented by Google, is made. The 3
authors of the article compare the performances of two pdbms, Vertica and DBMS-X, with the Hadoop [3] system, which is an open source implementation of the MR model. Based on the results presented, on five different benchmarks, in which the pdbmss outperform Hadoop they conclude that the SQL programing model based on pdbms should be considered superior to the MR one. They sustain this by also arguing that DB in general have a 25 year advantage, so the model is greatly tuned. However, as drawbacks to pdb, they report the installation of the system to be more difficult compared to Hadoop. Also the initial upload of the data in the system is not that easy. A comparison is done between the fixed data schema that is imposed in DB with the more flexible approach offered by MR systems. Defending the DBs, they talk about the need for a custom parser that must be provided due to this freedom offered by MR model. The indexing capabilities of DB is also highlight, especially in the context of multiple indexes per table that are used by the the query optimizer. Hadoop (together with HDFS) is still young compared to databases, thus it is still possible to considerably improve the performance of such systems, in particular regarding I/O performance, by providing more efficient storage services and even enhance the capability of such system with respect to concurrent access semantics. Recent studies, like the one proposed by Nicolae et al [10], showed that Hadoop pipelines can be improved, up to 40% in efficiency, by providing a shim layer similar to the one discussed in article [13], for a different DFS (here BSFS). Although HDFS is write-once-read-many, a file system like BSFS can provide a more flexible access semantics, since it is fully concurrent. There also exists scientific applications that require a pipeline of MR phases that would not fit to the SQL modeling language. More, recent cloud storage systems, like Microsoft AZURE[2], offer extremely interesting properties for the stored data that can be compared to the power of indexes from DBs. Tables, not to be confused with tables from relational DBs, one of the three storage mechanism from Azure, offer total freedom in the records stored, and offer keys ids similar with indexes. Hence, MR programs could be optimized to reach very high performances, without being restricted to a specific schema and without developing dedicated parsers. 6 Discussion In this report, we have compared the main differences in the design of HPC-efficient parallel filesystems and MapReduce filesystems. While distributed filesystems often run on computation area, they can provide data localization in order to avoid high bandwidth usage, as well as replication capability. In fact, failures are considered as part of the system and are taken into account in the design itself of the system. Parallel filesystems run on storage areas and are usually well suited for MPI-based applications. Unlike DFS, chunk sizes in PFS are usually small to fit the need of fine grain access for parallel applications. Finally, let us mention that recent developments in cloud computing, in particular at Amazon EC2 services, allow the users to rent resources from both types: HPC-efficient and/or MapReduce-efficient, thus we may converge toward a joint usage of both workloads, leading a high demand in this field to build efficient yet generic distributed filesystems. 4
References [1] Azure: http://www.microsoft.com/windowsazure/. [2] Hadoop: http://hadoop.apache.org/. [3] P.H. Carns, W.B. Ligon III, R.B. Ross, and R. Thakur. PVFS: A parallel file system for Linux clusters. In Proceedings of the 4th annual Linux Showcase & Conference-Volume 4, page 28. USENIX Association, 2000. [4] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. [5] M. Dunn. Parallel I/O Testing for Hadoop. [6] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. [7] Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir, and Franck Cappello. Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications. Technical report, INRIA-Illinois Joint Laboratory on PetaScale Computing. [8] E. Molina-Estolano, M. Gokhale, C. Maltzahn, J. May, J. Bent, and S. Brandt. Mixing Hadoop and HPC Workloads on Parallel Filesystems. [9] B. Nicolae, D. Moise, G. Antoniu, L. Boug, and M. Dorier. BlobSeer: Bringing High Throughput under Heavy Concurrency to Hadoop Map/Reduce Applications. IPDPS, 2010. [10] A. Pavlo, E. Paulson, A. Rasin, D. Abadi, D. DeWitt, S. Madden, and M. StoneBraker. A comparison of approaches to large-scale data analysis. [11] Jean pierre Prost, Richard Treumann, Richard Hedges, Bin Jia, and Alice Koniges. Mpi-io gpfs, an optimized implementation of mpi-io on top of gpfs. In In Proceedings of Supercomputing 2001, 2001. [12] F. Schmuck and R. Haskin. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the First USENIX Conference on File and Storage Technologies, pages 231 244. Citeseer, 2002. [13] W. Tantisiriroj, S. Patil, and G. Gibson. Data-intensive le systems for Internet services: A rose by any other name. [14] R. Thakur, W. Gropp, and E. Lusk. Data sieving and collective I/O in ROMIO. frontiers, page 182, 1999. 5