I/O intensive applications: what are the main differences in the design of the HPC filesystems vs the MapReduce ones?



Similar documents
Performance Analysis of Mixed Distributed Filesystem Workloads

Mixing Hadoop and HPC Workloads on Parallel Filesystems

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

HDFS Space Consolidation

Mixing Hadoop and HPC Workloads on Parallel Filesystems

HFAA: A Generic Socket API for Hadoop File Systems

CHAIO: Enabling HPC Applications on Data-Intensive File Systems

The Google File System

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

BlobSeer: Enabling Efficient Lock-Free, Versioning-Based Storage for Massive Data under Heavy Access Concurrency

Distributed File Systems

Distributed File Systems

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Jeffrey D. Ullman slides. MapReduce for data intensive computing

THE HADOOP DISTRIBUTED FILE SYSTEM

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

Research on Job Scheduling Algorithm in Hadoop

Cloud Computing at Google. Architecture

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Accelerating and Simplifying Apache

Hadoop Architecture. Part 1

HPC Computation on Hadoop Storage with PLFS

On the Duality of Data-intensive File System Design: Reconciling HDFS and PVFS

Data Semantics Aware Cloud for High Performance Analytics

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Snapshots in Hadoop Distributed File System

Lustre * Filesystem for Cloud and Hadoop *

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2

Introduction to Hadoop

Distributed Metadata Management Scheme in HDFS

Apache Hadoop. Alexandru Costan

Hadoop & its Usage at Facebook

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop & its Usage at Facebook

Large Scale Distributed File System Survey

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Data-Intensive Computing with Map-Reduce and Hadoop

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Apache Hadoop FileSystem and its Usage in Facebook

Big Data Management in the Clouds and HPC Systems

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

Introduction to Hadoop

Improving Scalability Of Storage System:Object Storage Using Open Stack Swift

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

HDFS scalability: the limits to growth

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

- Behind The Cloud -

Cloud Computing based on the Hadoop Platform

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Will They Blend?: Exploring Big Data Computation atop Traditional HPC NAS Storage

Reduction of Data at Namenode in HDFS using harballing Technique

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Wrangler: A New Generation of Data-intensive Supercomputing. Christopher Jordan, Siva Kulasekaran, Niall Gaffney

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

CLOUD COMPUTING USING HADOOP TECHNOLOGY

A Comparison of Approaches to Large-Scale Data Analysis

The Inside Scoop on Hadoop

Massive Data Storage

Performance measurement of a Hadoop Cluster

Hadoop IST 734 SS CHUNG

The Hadoop Framework

Parallel Processing of cluster by Map Reduce

Introduction to Hadoop Distributed File System Vaibhav Gopal korat

Big Data With Hadoop

System Software for High Performance Computing. Joe Izraelevitz

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Scala Storage Scale-Out Clustered Storage White Paper

Distributed File Systems

xpaaerns on Spark, Shark, Tachyon and Mesos

Hadoop Cluster Applications

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

MapReduce for Data Warehouses

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Parallel Computing. Benson Muite. benson.

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Implementation Issues of A Cloud Computing Platform

Hadoop Architecture and its Usage at Facebook

MapReduce and Hadoop Distributed File System

Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive

Transcription:

I/O intensive applications: what are the main differences in the design of the HPC filesystems vs the MapReduce ones? Matthieu Dorier, Radu Marius Tudoran Master 2 Research ENS Cachan - Brittany extension December 16, 2010 1

1 Introduction In this report, a comparison is drawn between HPC filesystems and the ones specific for MapReduce (MR) applications. The starting point of the report will be the four papers [13, 6, 9, 11] proposed. The first article [13], analyzes the possibility of using a parallel filesystem for MR storage instead of using the standard filesystem used with Hadoop [3]. This adaptation is done by implementing a shim layer, and shows comparable performances with HDFS. The second article [9] compares the performance of PVFS, a parallel filesystem, and HDFS, when two different workloads are run in a concurrent manner, each of them being usually specific to one of the filesystems. A comparison between the MR programing model and parallel databases is done in the third article [11], which tries to lower the growing interest of MR applications, while the last paper [6] just describes a successful installation of Hadoop. The problematics raised in the papers will be discussed above, by also comparing them to other publications, and having as goal the highlighting of the specificities of each type of filesystem. 2 Enriched API in DFS MapReduce applications require to process large files in parallel in a write-once-read-many scheme. Thus distributed filesystems (DSF) such as HDFS are designed to run on computation nodes, unlike parallel filesystems that usually run on dedicated nodes, and are part of the application. This deployment scheme lets DFS expose the data localization. Using Hadoop, Yahoo! s implementation of MapReduce, a java interface called FileSystem and a set of abstract classes handling streams let the user implement its own filesystem or select the one he wants to use through configuration files. Thus, several DFS have been adapted to Hadoop, such as CloudStore (previously KosmosFS) or Amazon S3. Specific parameters such as the chunk size and the replication policy can also be tuned in order to fit the need of the central component of the MapReduce framework: the scheduler. In this model, the filesystem becomes part of the entire application and is tuned to fit its precise needs. 3 Parallel Filesystems and HPC The major use of IO in HPC is for checkpointing. Scientific applications featuring timevarying datasets or data-mining programs including iterative optimization solvers often backup their entire dataset in the filesystem in order to be able to restart in case of failure. In this context having a computation driven by data locality makes no sense. If a failure occurs in a node that used its own local storage capability for checkpointing, its data is lost. Thus, clusters usually provide a set-aside storage area where a parallel filesystem, such as PVFS [4], GPFS [12] or Luster, is deployed. Multiple dedicated IO nodes act as IO servers and expose a POSIX-like mount point to the application. While distributed filesystems are deployed for the purpose of a single application, parallel filesystems for HPC are shared between all the users of the cluster. Thus users can book computation nodes on a cluster, but they cannot control or even predict their IO bandwidth. HPC applications are usually based on a main loop featuring a computation intensive phase, a communication phase and a checkpointing phase. A classical pattern consists in writing one file per process per backup, leading to a huge number of files written at the same time in the filesystem. As HPC infrastructures grow toward petascale capabilities, IO become a bottleneck and a high standard deviation in the time to write a file is noticed. Thus, MPI-IO, 2

the IO part of the Message Passing Interface standard, offers the ability to write big files in a collective manner, with several optimization with respect to the filesystem. As an example, MPI-IO/GPFS, an optimized version of MPI-IO on top of GPFS, is presented in [1]. While more information are provided by the application regarding the file layout, MPI-IO also adapt the IO patterns with respect to the knowledge he has of the filesystem. In MPI-IO/GPFS (as in other MPI-IO implementation [14]), such adaptation are conducted through IO agents, namely MPI tasks that are acting as IO adaptation layer to provide - data shipping: avoids multiple tasks to access a single chunk in the filesystem by binding IO agents to chunks and make IO agents read/write entire chunks; - prefetching: when a task tries to access a set of small parts of a file, big chunks are loaded instead in order to avoid multiple expensive access; - data sieving: when the access is sparse with respect to single processes but dense with respect to all, the entire file is loaded and the multiple IO operations are replaced by communications between IO agents; - double buffering: used for large access in order to overlap write requests from tasks with effective IO access to the filesystem. Moreover, MPI IO calls are usually hidden behind high performance data formats such as HDF5 or NetCDF, allowing to keep in the files a high level of semantics by embedding metadata and by arranging datasets in an efficient layout. 4 Failures and replication patterns The way failures are considered in HPC systems or in distributed systems (DS) in which MR programing model can be applied, is one of the key differences between the two. In HPC systems failures are an abnormal, also rare (even though when scaling this tends not to be true) events, while in the case of MR applications, failures are normal and are taken into account in the design of the application. The main mechanism for HPC to deal with failures is the checkpointing and rollback. Powerful algorithms [8] were constructed in order to optimize these mechanisms. But checkpointing always requires heavy access to the filesystem. In HPC systems, data availability is ensured by RAIDs and by additional bits to verify data consistency and in some cases to recover some damaged bits. One of the main problems nowadays is the bit flipping that is increasing year by year, because of the diminishing of the voltage used and the newly silicon layers, and becomes a big concerne for HPC systems. As an open question on how to adept to this situation, HPC systems have as alternatives the usage of more complex data schema for detecting/recovering bit errors, or to consider the replication techniques that are used by DFSs. We have seen [13] that it is possible to provide replication for Parallel FS but there is still work to be done in order to adept this to the HPC needs. On the other hand, the main approach to ensure data reliability and availability in DFS is the replication. Data is stored in more than one data node (in general three, but five is also popular for more sensible data), hence if one node crash or is not available, the DFS rapidly retrieves the data from another location. The cost for persistent storage is very low so having several physical copies is not expensive[7]. DFSs like GFS[7], HDFS[3], BlobSeerFS[10], all use data replication and have monitor mechanisms for maintaining the replication number. 5 Parallel Databases vs. MapReduce In article [11] a comparison between parallel DBMS (DataBases Management Services) and the very popular programing model MapReduce (MR) [5], invented by Google, is made. The 3

authors of the article compare the performances of two pdbms, Vertica and DBMS-X, with the Hadoop [3] system, which is an open source implementation of the MR model. Based on the results presented, on five different benchmarks, in which the pdbmss outperform Hadoop they conclude that the SQL programing model based on pdbms should be considered superior to the MR one. They sustain this by also arguing that DB in general have a 25 year advantage, so the model is greatly tuned. However, as drawbacks to pdb, they report the installation of the system to be more difficult compared to Hadoop. Also the initial upload of the data in the system is not that easy. A comparison is done between the fixed data schema that is imposed in DB with the more flexible approach offered by MR systems. Defending the DBs, they talk about the need for a custom parser that must be provided due to this freedom offered by MR model. The indexing capabilities of DB is also highlight, especially in the context of multiple indexes per table that are used by the the query optimizer. Hadoop (together with HDFS) is still young compared to databases, thus it is still possible to considerably improve the performance of such systems, in particular regarding I/O performance, by providing more efficient storage services and even enhance the capability of such system with respect to concurrent access semantics. Recent studies, like the one proposed by Nicolae et al [10], showed that Hadoop pipelines can be improved, up to 40% in efficiency, by providing a shim layer similar to the one discussed in article [13], for a different DFS (here BSFS). Although HDFS is write-once-read-many, a file system like BSFS can provide a more flexible access semantics, since it is fully concurrent. There also exists scientific applications that require a pipeline of MR phases that would not fit to the SQL modeling language. More, recent cloud storage systems, like Microsoft AZURE[2], offer extremely interesting properties for the stored data that can be compared to the power of indexes from DBs. Tables, not to be confused with tables from relational DBs, one of the three storage mechanism from Azure, offer total freedom in the records stored, and offer keys ids similar with indexes. Hence, MR programs could be optimized to reach very high performances, without being restricted to a specific schema and without developing dedicated parsers. 6 Discussion In this report, we have compared the main differences in the design of HPC-efficient parallel filesystems and MapReduce filesystems. While distributed filesystems often run on computation area, they can provide data localization in order to avoid high bandwidth usage, as well as replication capability. In fact, failures are considered as part of the system and are taken into account in the design itself of the system. Parallel filesystems run on storage areas and are usually well suited for MPI-based applications. Unlike DFS, chunk sizes in PFS are usually small to fit the need of fine grain access for parallel applications. Finally, let us mention that recent developments in cloud computing, in particular at Amazon EC2 services, allow the users to rent resources from both types: HPC-efficient and/or MapReduce-efficient, thus we may converge toward a joint usage of both workloads, leading a high demand in this field to build efficient yet generic distributed filesystems. 4

References [1] Azure: http://www.microsoft.com/windowsazure/. [2] Hadoop: http://hadoop.apache.org/. [3] P.H. Carns, W.B. Ligon III, R.B. Ross, and R. Thakur. PVFS: A parallel file system for Linux clusters. In Proceedings of the 4th annual Linux Showcase & Conference-Volume 4, page 28. USENIX Association, 2000. [4] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. [5] M. Dunn. Parallel I/O Testing for Hadoop. [6] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. [7] Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir, and Franck Cappello. Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications. Technical report, INRIA-Illinois Joint Laboratory on PetaScale Computing. [8] E. Molina-Estolano, M. Gokhale, C. Maltzahn, J. May, J. Bent, and S. Brandt. Mixing Hadoop and HPC Workloads on Parallel Filesystems. [9] B. Nicolae, D. Moise, G. Antoniu, L. Boug, and M. Dorier. BlobSeer: Bringing High Throughput under Heavy Concurrency to Hadoop Map/Reduce Applications. IPDPS, 2010. [10] A. Pavlo, E. Paulson, A. Rasin, D. Abadi, D. DeWitt, S. Madden, and M. StoneBraker. A comparison of approaches to large-scale data analysis. [11] Jean pierre Prost, Richard Treumann, Richard Hedges, Bin Jia, and Alice Koniges. Mpi-io gpfs, an optimized implementation of mpi-io on top of gpfs. In In Proceedings of Supercomputing 2001, 2001. [12] F. Schmuck and R. Haskin. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the First USENIX Conference on File and Storage Technologies, pages 231 244. Citeseer, 2002. [13] W. Tantisiriroj, S. Patil, and G. Gibson. Data-intensive le systems for Internet services: A rose by any other name. [14] R. Thakur, W. Gropp, and E. Lusk. Data sieving and collective I/O in ROMIO. frontiers, page 182, 1999. 5