Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance When to choose HDFS? HDFS in action Future of HDFS Alternative approaches References / literature
Apache Hadoop HDFS Apache Hadoop Project that "develops open-source software for reliable, scalable, distributed computing" [1] HDFS (Hadoop Distributed File System) Subproject of Hadoop Target: reliable and rapid computation on large data sets, emphasis on high throughput Primary storage system for Hadoop applications Designed especially for sending and receiving data sets for MapReduce operations Serves also as a (limited) general purpose DFS [1][3][10]
Flesh and bones of HDFS User level file system Written in Java Typically running on some GNU/Linux operating system Can be deployed on commodity hardware This is actually a key assumption in the design Inter-node and client communication protocols work on top of TCP/IP API/shell/browser access [5]
Architecture 1/4: Overview Based on GFS, master/slave [5][10]
Architecture 2/4: Namespace and files Common hierarchical namespace structure Directories that include directories and files WORM (write-once-read-many) access model Files and directories can be created, deleted, moved, renamed, opened, closed NOT modified Simplifies replication Files split into blocks Default size 64 MB [5][10]
Architecture 3/4: NameNode NameNode = Master Provides instructions to DataNodes Point of access for clients One NameNode per cluster Typically a dedicated machine Achilles' heel Namespace and metadata management Keeps metadata on RAM ( scalability bottleneck) Decides how blocks are placed in DataNodes [5][10]
Architecture 4/4: DataNode DataNode = Slave [5] Serves block creation/deletion/replication requests from NameNode Serves read/write requests from clients Typically several DataNodes, dedicated machines Stores blocks as files in local file system, knows nothing about HDFS files Provides Blockreports to NameNode Blocks always transferred directly between DataNodes and Clients
Accessing data [5] FileSystem Java API + wrapper for C Commandline interface: FS Shell Practical for scripts Commands resemble using Unix utilities, e.g. bin/hadoop dfs mkdir /tempdir bin/hadoop dfs -cat /tempdir/tempfile.txt dfsadmin for administrative tasks e.g. bin/hadoop dfsadmin -refreshnodes Web browser based interface for browsing the namespace
Data replication strategy 1/2: Overview Basis for fault tolerance Replica placement affects performance a lot NameNode responsible for deployment Number of replicas and block size can be configured separately for each file Concept of rack-awareness NameNode determines which DataNodes belong to same racks Idea is to minimize network traffic between racks [5]
Data replication strategy 2/2: Default strategy Replication factory = 3 One replica in a node in local rack One replica in another node in the same rack One replica in another rack Balanced for write performance and fault tolerance Replication pipelining DataNode forwards data to another DataNode according to a list generated by NameNode [5]
Fault tolerance Failure at DataNode Hearbeat missing stop I/O, re-replicate Network failure Data integrity failure Checksum Failure at NameNode Backing up data highly recommended No built-in method for automatic recovery available [5]
When to choose HDFS? 1/2: Applications & data Think of HDFS as data set system instead of file system Ideal for batch processing, not interactive tasks Intended for streaming a lot of data though seeking to an arbitrary point is also supported Throughput optimized at the cost of latency Typically millions of files, avg file size 1 GB... 1 TB E.g. web crawlers, GIS data management, archival, statistical analysis, and naturally Hadoop apps WORM access model must be acceptable [5][10]
When to choose HDFS? 2/2: Points regarding environment Works whenever Java works Highly portable, good support for Java applications Supports mechanisms for briging computation physically closer to the data Saves bandwidth compared to moving data Designed for thousands of nodes, several of which are always broken [5]
HDFS in action Yahoo: The Yahoo! Search Webmap 10 000 cores and 5 PB of storage capacity Produces data for all Yahoo! web search queries HDFS caused a 34% drop in processing time [7] Adobe [2] 30 nodes in clusters of 5-14 nodes (dev+prod) Social services, data storage, internal use AOL 50-node cluster with 37 TB of HDFS capacity Behavioral analysis, targeting, statistics generation
HDFS in action Facebook [2] 600-node cluster with 2 PW of storage capacity Logs, reporting, analysis, machine learning FUSE implementation over HDFS Iterend Blog search engine 10-node HDFS cluster Spadac Storing and processing geospatial imagery and vector data
Future of HDFS 1/2: Confirmed plans Moving on from WORM: support for appending data to files Improvements in namespace maintenance (invisible to clients) Access via WebDAV protocol Extends HTTP for file management & modification Support for snapshots for returning to a functional state in case of corruption of file system [5] Tuning the replica placement policy
Future of HDFS 2/2: Possible improvements User quotas Hard and soft links Data rebalancing mecahisms Move blocks to other nodes if disk space on a certain DataNode drops too low Create additional replicas if demand for a certain file rises significantly Automatic recovery from NameNode failure [5]
Alternative approaches DFS's come generally in two flavors 1) Designed for running Internet services Often developed by companies like Google and Amazon GoogleFS, Amazon S3, HDFS 2) Designed for high-performance computing Parallel file systems IBM GPFS, Sun Lustre FS PVFS (Parallel Virtual File System) Open-source, user-level filesystem like HDFS, has some highlevel design similarities In use at Argonne national lab, Ohio supercomputer center,... [8][9]
[9] HDFS vs. PVFS 1/2: Design
HDFS vs. PVFS 2/2: Performance [9] Executed in the Hadoop Internet services stack, note that PVFS is sending writes to three servers
References / Literature [1] What is Hadoop?, The Apache Software Foundation, referenced on 2009-10-29, available at http:// hadoop.apache.org/ [2] PoweredBy, The Apache Software Foundation, referenced on 2009-10-29, available at http://wiki.apache.org/hadoop/poweredby [4] HDFS User Guide, The Apache Software Foundation, referenced on 2009-10-29, available at http://hadoop.apache.org/common/docs/current/hdfs_user_guide.html [5] HDFS Architecture, The Apache Software Foundation, referenced on 2009-10-29, available at http://hadoop.apache.org/common/docs/current/hdfs_design.html [7] Yahoo! Launches World's Largest Hadoop Production Application, Eric Baldeschwieler (Senior Director, Grid Computing, Yahoo! Inc.), referenced on 2009-10-29, available at http://developer.yahoo.net/blogs/hadoop/2008/02/yahoo-worlds-largest-productionhadoop.html [8] The Google File System; Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung; 2003; available at http://labs.google.com/papers/gfs-sosp2003.pdf [9] Data-intensive File systems for Internet services: A rose by any other name...; Wittawat Tantisiriroj, Swapnil Patil, Garth Gibson; Carnegie Mellon University / Parallel Data Laboratory; 10/2008; available at http://www.pdl.cs.cmu.edu/pdl-ftp/pdsi/cmu-pdl-08-114.pdf [10] MapReduce and HDFS; Cloudera, Inc.; referenced on 2009-11-07, slides and video available at http://www.cloudera.com/hadoop-training-mapreduce-hdfs