Implementing the Hadoop Distributed File System Protocol on OneFS Jeff Hughes EMC Isilon
Outline Hadoop Overview OneFS Overview MapReduce + OneFS Details of isi_hdfs_d Wrap up & Questions 2
Hadoop Overview
Apache Hadoop Project The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. http://hadoop.apache.org/ Two main components MapReduce Distributed File System (DFS) 4
Hadoop: MapReduce MapReduce Distributed computation framework Optimized for batch processing Typical I/O profile: DFS Read Map Task Map Output Shuffle Reduce Task DFS Write 5
Hadoop: HDFS Architecture http://hadoop.apache.org/common/docs/r0.20.2/hdfs_design.html 6
Hadoop: HDFS Semantics Metadata/data server cluster architecture Cluster coherent namespace and data Write-once-read-many access Single writer only Can append to existing files Data mirrored 3x for resiliency Client exposed to data topology Block locations as part of file metadata http://www.snia.org/sites/default/files2/sdc_archives/2010_presentations/wednesday/dhrubaborth akur-hadoop_file_systems.pdf 7
Hadoop: Why HDFS? Portability All user space and OS independent Purpose-built Primary workflow is MapReduce Limited set of operations to implement Single software package Fluid client/server protocol development Exposure of data topology Enables client to control data path locality 8
OneFS Overview
OneFS: Architecture Servers Servers Servers Client/Application Layer Ethernet Layer Isilon IQ Storage Layer Intracluster Communication Infiniband 10
OneFS: OS Built from the ground up on FreeBSD File system is a loadable kernel module with VFS interface Supports POSIX syscalls locally Protocol servers access /ifs paths FS Built for mixed namespace access Supports SMB, NFS, HTTP, etc. 11
OneFS: Semantics Symmetric cluster architecture Metadata distributed across all nodes Tightly coupled group semantics Globally coherent file system access Distributed lock manager Two-phase commit for all write operations Reed-Solomon FEC used for data protection 12
Running MapReduce against OneFS
MapReduce + OneFS: Architecture 1) Request( /file ) DFSClient Hadoop Node OneFS runs a daemon that speaks NameNode and DataNode natively 2) Response (block locations) 3) GetBlock(block) NameNode NameNode NameNode NameNode DataNode OneFS Node DataNode OneFS Node DataNode OneFS Node DataNode OneFS Node OneFS Clustered FileSystem 14
MapReduce + OneFS: Benefits Easier integration to existing workflows First class multi-protocol access Reduce ETL stages Increased disk efficiency HDFS: 30% usable, OneFS: 80% usable Reduced data center footprint More data management options Snapshots, site replication, etc. 15
MapReduce + OneFS: Challenges Typical data path locality changes MapReduce HDFS acts like DAS MapReduce OneFS goes over the network Client/server compatibility and maintenance OneFS and MapReduce clusters run different software versions Hidden benefit of multi-hdfs version access 16
MapReduce + OneFS: Mitigations 1GbE < SATA controller < 10GbE Hadoop designed for 1GbE 10GbE prices dropping Denser storage == less nodes/networking Rack locality limits cross-switch contention Rack A Rack B Rack C 17
MapReduce + OneFS: Performance DFS Read Map Task Map Output Shuffle Reduce Task DFS Write Typically ~100Mbit per task from HDFS I/Os against temp vary considerable per job More variable, still ~100Mbit per task to HDFS Performance bottleneck likely to be temp space Terasort example: 75% of I/Os against temp Latency not much impact over HDFS large block read/write operations http://cto.vmware.com/analyzing-hadoops-internals-with-analytics/ 18
Details of isi_hdfs_d
HDFS Protocols Two TCP based protocols NameNode metadata operations DataNode data transfers About 26 NameNode RPCs Mostly use fully qualified paths POSIX-like file attrs (modebits, user/group) Only 2 DataNode client operations Simple read/write with a block identifier More in Apache HDFS for administration 20
NameNode Request Example Example getfileinfo( /testfile ) request (think stat): Method name Parameter type (string) /testfile 21
NameNode Response Example And the reply: Owner Group Object type 22
isi_hdfs_d Multi-threaded daemon runs on all nodes Services both NN and DN protocols Translates RPCs to POSIX system calls Stateless, underlying FS handles coherency Request isi_hdfs_d Syscall VFS Response Thread OneFS OneFS Node 23
Example NameNode RPCs Most NameNode RPCs are straightforward setpermission() chmod( ) settimes() utimes( ) create() open(, O_CREAT, ) Other RPCs need some creative interpretation recoverlease()/renewlease() abandonblock() setreplication() 24
HDFS Data Path NN RPC: getblocklocations(file)/addblock(file) Returns list of LocatedBlocks DFSClient connects to DN Chooses which DatanodeInfo based on locality Only Block structure passed to DN in read/write operation LocatedBlock long offset Block long blkid long numbytes long genstamp DatanodeInfo[]... 25
LocatedBlocks Translation LocatedBlock long offset Block Logical byte offset into the file Opaque to client, used by DN long blkid long numbytes long genstamp Inode number Size of extent Absolute byte offset Specific to OneFS and isi_hdfs_d DatanodeInfo[]... <IP:port> and rack info for different paths to same block 26
Read Path Example DFSClient NameNode DataNode getblocklocations() LocatedBlocks DN_OP_READ(Block) Data stream 27
NameNode Connection Routing NameNode is configured as single URL Easy configuration: hdfs://log-server.isilon.com:8020/ DNS round-robin to distribute across nodes Metadata IOPs get spread out OneFS maintains cross-node consistency IP Failover plus client retries for resiliency Hadoop retries ops 5 times at many levels 28
DataNode References DataNode references returned by NameNode All OneFS DataNodes can access same data Each LocatedBlocks for reads has 3 DN refs Round-robin across available nodes Multiple refs lets client try other nodes before coming back to the NameNode again Write path only 1 reference per logical block No need for client to replicate writes 29
Few Quirks User/group identities are strings OneFS natively stores UIDs or SIDs only Requires name resolution on access Locking! HDFS uses leases to restrict to single writer Implemented but no cross-protocol contention Hadoop apps don t expect files to move Caveat emptor when mixing protocols 30
It Works! You see this across NFS: And the same directory on HDFS: 31
Conclusions HDFS protocol can map to POSIX pretty easily Not all traditional shared storage is bad for Hadoop workflows Locality features worth preserving even without node locality as a possibility Interoperability can unlock new novel workflows 32
Questions? jeff.hughes@isilon.com Special thanks to Conrad Meyer! 33