Distributed File Systems Alemnew Sheferaw Asrese University of Trento - Italy December 12, 2012 Acknowledgement: Mauro Fruet Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 1 / 55
Outline 1 Distributed File Systems Introduction Most Important DFS 2 The Google File System (GFS) Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis 3 The Hadoop Distributed File System (HDFS) Introduction Architecture File I/O Operations and Replica Management Writing and Reading Data Replica Placement 4 Conclusions 5 Bibliography Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 2 / 55
Distributed File Systems Introduction Introduction Definition of DFS File system that allows access to files from multiple hosts via a computer network Features Share files and storage resources on multiple machines No direct access to data storage for clients Clients interact using a protocol Access restrictions to file system Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 3 / 55
Distributed File Systems Introduction DFS Goals Access transparency Location transparency Concurrent file updates File replication Hardware and software hetereogenity Fault tolerance Consistency Security Efficiency Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 4 / 55
Distributed File Systems Most Important DFS Most Important DFS NFS (Network File System), developed by Sun Microsystems in 1984 Small number of clients, very small number of servers Any machine can be a client and/or a server Stateless file server with few exceptions (file locking) High performance AFS (Andrew File System), developed by the Carnegie Mellon University Support sharing on a large scale (up to 10,000 users) Client machines have disks (cache files for long periods) Most files are small Reads more common than writes Most files read/written by one user Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 5 / 55
Distributed File Systems Most Important DFS Need for Large Streams of Data Standard DFS not adequate to manage large streams of data: Google : 24 PB/day Facebook: 130 TB/day for logs, 500+ TB/day new date entry, 300 M photos uploaded per day, 5 040 TB data scanned in Hive per day LHC-CERN: 25 PB/year Such workloads need different assumptions and goals. Proposed solutions: The Google File System (GFS) Hadoop Distributed File System (HDFS) by Apache Software Foundation Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 6 / 55
The Google File System (GFS) Introduction The Google File System (GFS) [1] Proprietary DFS, only key ideas available Developed and implemented by Google in 2003 Shares many goals with standard DFS (performance, scalability, reliability, availability,...) Needed to manage data-intensive applications Provides fault tolerance Provides high performance Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 7 / 55
The Google File System (GFS) Introduction GFS - Motivations Component failures are the norm: A storage cluster is built from hundreds or thousands of inexpensive commodity servers Constant monitoring, error detection, fault tolerance, and automatic recovery is needed Huge files (Multi- GB) with many application objects Appending new data rather than overwriting existing Co-designing the applications and the file system API to increase flexibility Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 8 / 55
The Google File System (GFS) Design Overview Assumptions Composed of inexpensive commodity components that often fail Modest number of huge files (Multi GBs) Large streaming reads and small random reads Sequential writes that append to files rather than overwrite Files seldom modified again Semantics for concurrently append to the same file by multiple clients High bandwidth more important than low latency Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 9 / 55
The Google File System (GFS) Design Overview GFS Interface Familiar file system interface Don t implement Standard API such as POSIX Supports Common Operations (create, delete,open, close, read, and write files) Snapshot Record append Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 10 / 55
Architecture The Google File System (GFS) Design Overview Figure: GFS Architecture Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 11 / 55
The Google File System (GFS) Design Overview Architecture Master has file system metadata Master controls: Chunk lease management Garbage collection of Orphaned chunks Chunk migration Periodically HeartBeat messages exchanged between master and chunkservers to check the state of chunkserver Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 12 / 55
The Google File System (GFS) Design Overview Single Master Enables to make sophisticated chunk placement and replication decisions Must minimize its involvement in reads and writes Clients never read and write file data through the master Client asks the master which chunkservers it should contact Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 13 / 55
The Google File System (GFS) Design Overview Single Master Cont... Problems Bandwidth bottleneck A single point of failure Opportunities Central place to control replication and garbage collection and many other activities Solution Around 2009 a Distributed Master is proposed and implemented for the current schema. BigTable and file count are being used. Details are not clearly available :) Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 14 / 55
The Google File System (GFS) Design Overview ChunkServer Large chunk size 64 MB - larger than typical File system Large chunk size: Pros Cons Reduces client master interaction Reduce network overhead by keeping a persistent TCP connection Reduces the size of the metadata stored on the master internal fragmentation (solution: lazy space allocation) large chunks can become hot spots (solutions: higher replication factor, staggering application start time, client-to-client communication) Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 15 / 55
Metadata The Google File System (GFS) Design Overview All metadata is kept in main memory of the master Namespace File-to-chunk mapping Chunks replicas location Access control information Namespace and file-to-chunk mapping are persistently stored Mutations logged to an operation log No persistent record of replicas locations Operation Log: Contains critical metadata changes Defines order of concurrent operations Replicated on multiple remote machines Used to recover the master Checkpoint of master if log beyond certain size: Master switches to a new log Checkpoint created with all mutations before switch Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 16 / 55
The Google File System (GFS) Design Overview Consistency Model File namespace mutations are atomic Namespace locking guarantees atomicity and correctness Mutations applied in same order on all replicas Chunk version numbers used to detect any stale replica (missed mutations while chunkserver was down) Stale replicas are garbage collected Checksum for detection of data corruption A file region is consistent if all the clients will always see the same data regardless of which replicas they read from. Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 17 / 55
The Google File System (GFS) System Interactions Leases and Mutation Order Master uses leases to maintain consistent mutation order Master grants a chunk lease to a replica, called primary Primary chooses serial order between Mutations Lease grant order defined by the master Order within a lease defined by the primary Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 18 / 55
The Google File System (GFS) Control Flow of a Write System Interactions Figure: Write Control and Data Flow Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 19 / 55
Data Flow The Google File System (GFS) System Interactions Separated flow of data and control Goals: 1 Fully use outgoing bandwidth of each machine: Data pushed in a chain of chunkservers 2 Avoid network bottlenecks and high-latency links: Each machine forwards data to closest machine 3 Minimize latency: Once a chunkserver receives data, it starts forwarding immediately Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 20 / 55
The Google File System (GFS) System Interactions Atomic Record Appends Traditional Write Client specifies the offset Concurrent writes to the same region are not serializable Record Append Client specifies only the data GFS Appends it to the file at least once atomically at an offset of GFSs choosing and returns that offset to the client Used by Distributed application in which many clients on different machines append to the same file concurrently Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 21 / 55
Snapshot The Google File System (GFS) System Interactions Used to: 1 Quickly create copies of huge data sets 2 Easily checkpoint current state The master receives snapshot request: 1 Revokes leases on chunks of the files it is about to snapshot 2 Logs the operation to disk 3 Duplicates the corresponding metadata New snapshot file points to same chunk C of source file New chunk C created for first write after snapshot Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 22 / 55
The Google File System (GFS) Master Operation Namespace Management and Locking No per-directory data structure that lists all files in the directory Doesn t support aliases for the same file (No hard or symbolic links) Namespace maps full pathnames to metadata Each master operation acquires a set of locks before it runs Read lock on internal nodes and read/write lock on the leaf Locking allows concurrent mutations in the same directory Read lock on directory prevents its deletion, renaming or snapshot Write locks on file names serialize attempts to create a file with the same name twice Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 23 / 55
The Google File System (GFS) Master Operation Replica Placement Hundreds of chunkservers across many racks Chunkservers accessed from hundreds of clients Goals: Maximize data reliability and availability Maximize network bandwidth utilization Replicas spread across machines and racks Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 24 / 55
The Google File System (GFS) Master Operation Creation, Re-replication and Rebalancing Creation Place new replicas on chunkservers with below-average disk usage Limit number of recent creations on each chunkservers Replication When number of available replicas falls below a user-specified goal Prioritization: based on how far it is from its replication goal, live files than recently deleted files... Rebalancing Periodically, for better disk utilization and load balancing Distribution of replicas is analyzed Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 25 / 55
The Google File System (GFS) Master Operation Garbage Collection File deletion logged by master File renamed to a hidden name with deletion timestamp instead of reclaiming the resources immediately Master regularly scan and removes hidden files older than 3 days (configurable) Until then, hidden file can be read and undeleted When a hidden file is removed, its in-memory metadata is erased Orphaned chunks identified removed during a regular scan of the chunk namespace Safety against accidental irreversible deletion Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 26 / 55
The Google File System (GFS) Master Operation Stale Replica Detection Need to distinguish between up-to-date and stale replicas Chunk version number: Increased when master grants new lease on the chunk Not increased if replica is unavailable Stale replicas deleted by master in regular garbage collection Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 27 / 55
The Google File System (GFS) Fault Tolerance and Diagnosis High Availability Fast Recovery Master and chunkservers have to restore their state and start in seconds no matter how they terminated Chunk Replication Each chunk is replicated on multiple chunkservers on different racks Master Replication Master state replicated for reliability on multiple machines When master fails: It can restart almost instantly A new master process is started elsewhere Shadow (not mirror) master provides only read-only access to file system when primary master is down Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 28 / 55
The Google File System (GFS) Fault Tolerance and Diagnosis Data Integrity Checksumming to detect corruption of stored data Integrity verified by each chunkserver independently Corruptions not propagated to other chunkservers Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 29 / 55
The Google File System (GFS) Fault Tolerance and Diagnosis Diagnostic Tools Extensive and detailed diagnostic logging Diagnostic logs record many significant events(chunkservers going up and down, RPC requests and replies)... can be easily deleted without affecting the system Bibliography Case Study, GFS: Evolution on Fast-forward, A discussion between Kirk McKusick and Sean Quinlan; ACM Queue, 2009 Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 30 / 55
The Hadoop Distributed File System (HDFS) Introduction What is Hadoop? Summary An Apache project Yahoo! has developed and contributed to 80% of the core of Hadoop (HDFS and MapReduce). Components: HBase - originally by Powerset, then Microsoft Hive - Facebook Pig, ZooKeeper, Chukwa - Yahoo! Avro - Originally Yahoo, and co-developed with cloudera Uses MapReduce Distributed Computation framework Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 31 / 55
The Hadoop Distributed File System (HDFS) Introduction The Hadoop Distributed File System (HDFS) [2] Sub-project of Apache Hadoop project Designed to store very large data sets separately: Metadata on dedicated server called NameNode Application data on DataNodes Creates multiple replicas of data blocks Distributes replicas on nodes Highly fault-tolerant Designed to run on commodity hardware Suitable for applications for large data sets Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 32 / 55
Alemnew S. Asrese (UniTN) Figure: Distributed HDFSFile Architecture Systems 2012/12/12 33 / 55 Architecture The Hadoop Distributed File System (HDFS) Architecture
The Hadoop Distributed File System (HDFS) Architecture NameNode Namespace is a hierarchy of files and directories Files and directories are represented on the NameNode by inodes Maintains the namespace tree and the mapping of file blocks to DataNodes Authorization and authentication A multithreaded system and processes requests simultaneously from multiple clients Keeps the entire namespace in RAM, allowing fast access to the metadata. Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 34 / 55
The Hadoop Distributed File System (HDFS) Architecture DataNodes Responsible for serving read and write requests from the file systems clients Perform block creation, deletion, and replication upon instruction from the NameNode Performs handshake by connecting with NameNode during startup To verify the Namespace ID and software version of the DataNode Nodes with a different namespace ID will not be able to join the cluster Incompatible version may cause data corruption or loss Send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available. Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 35 / 55
HDFS Client The Hadoop Distributed File System (HDFS) Architecture Figure: HDFS Client, NameNode and DataNode interaction for creating a file Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 36 / 55
The Hadoop Distributed File System (HDFS) Architecture Image and Journal Image File system metadata that describes the organization of application data as directories and files A persistent record of the image written to disk is called a checkpoint Journal A write-ahead commit log for changes to the file system that must be persistent During startup the NameNode initializes the namespace image from the checkpoint, and then replays changes from the journal A new checkpoint and empty journal are written back to the storage directories before the NameNode starts serving clients Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 37 / 55
The Hadoop Distributed File System (HDFS) Architecture CheckpointNode Periodically combines the existing checkpoint and journal to create a new checkpoint and an empty journal Usually runs on a different host wrt NameNode Downloads the current checkpoint and journal files from the NameNode, Merges them locally,and returns the new checkpoint back to the NameNode Multiple CheckPoint Nodes may be used simultaneously Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 38 / 55
The Hadoop Distributed File System (HDFS) Architecture BackupNode Create periodic checkpoints As Checkpoint Node, but it applies edits to its own namespace copy Without downloading checkpoint and journal files from the active NameNode But in addition it maintains an in-memory, up-to-date image of the file system namespace Always synchronized with the state of the NameNode Can be viewed as a read-only NameNode Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 39 / 55
The Hadoop Distributed File System (HDFS) File Read and Write, Block Placement File I/O Operations and Replica Management File Read and Write Implements a single-writer, multiple-reader model Block Placement Spread the nodes across multiple racks Replicas placed: The first on the node where the writer located, The second and third on different nodes at different racks The others palced on random nodes: 1 No Datanode contains more than one replica of any block 2 No rack contains more than two replicas are in the same rack Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 40 / 55
The Hadoop Distributed File System (HDFS) File I/O Operations and Replica Management Replication management Detects that a block has become under- or over-replicated When a block becomes over replicated, the NameNode chooses a replica to remove When a block becomes under-replicated, it is put in the replication priority queue A block with only one replica has the highest priority A block with a number of replicas that is greater than two thirds of its replication factor has the lowest priority Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 41 / 55
The Hadoop Distributed File System (HDFS) File I/O Operations and Replica Management Inter-Cluster Data Copy A tool called DistCp used for large inter/intra-cluster parallel copying A MapReduce job; each of the map tasks copies a portion of the source data into the destination file system Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 42 / 55
The Hadoop Distributed File System (HDFS) Writing Data in HDFS Cluster Writing and Reading Data Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 43 / 55
The Hadoop Distributed File System (HDFS) Writing Data in HDFS Cluster Writing and Reading Data Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 44 / 55
The Hadoop Distributed File System (HDFS) Writing Data in HDFS Cluster Writing and Reading Data Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 45 / 55
The Hadoop Distributed File System (HDFS) Reading Data in HDFS Cluster Writing and Reading Data Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 46 / 55
The Hadoop Distributed File System (HDFS) Types of Faults and Their Detection Writing and Reading Data Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 47 / 55
The Hadoop Distributed File System (HDFS) Types of Faults and Their Detection Writing and Reading Data Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 48 / 55
The Hadoop Distributed File System (HDFS) Writing and Reading Data Handling Reading and Writing Failures Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 49 / 55
The Hadoop Distributed File System (HDFS) Handling DataNode Failures Writing and Reading Data Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 50 / 55
The Hadoop Distributed File System (HDFS) Replica Placement Strategy Replica Placement Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 51 / 55
The Hadoop Distributed File System (HDFS) Replica Placement Cluster Rebalancing Balanced cluster: no under-utilized or over-utilized DataNodes Common reason: addition of new DataNodes Requirements No block loss No change of the number of replicas No reduction of the number of racks a block resides in No saturation of the network Rebalancing issued by an administrator Rebalancing decisions made by a Rebalancing Server Goal of each iteration: reduce imbalance in over/under-utilized DataNodes Source and destination selection based on some priorities Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 52 / 55
The Hadoop Distributed File System (HDFS) Replica Placement Balancer A tool that balances disk space usage on an HDFS cluster. Iteratively moves replicas from DataNodes with higher utilization to DataNodes with lower utilization Bibliography http: // hadoop. apache. org/ docs/ hdfs Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 53 / 55
Conclusions Conclusions GFS Proprietary file system All the details in the paper of 2003 HDFS Open-source project Details are available in the paper of 2010 and in Hadoop website Used by many big companies Inspired by GFS Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 54 / 55
Bibliography Bibliography Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP 03, pages 29 43, New York, NY, USA, 2003. ACM. Shvachko Konstantin, Kuang Hairong, Radia Sanjay, and Robert Chansler. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST 10, pages 1 10, Washington, DC, USA, 2010. IEEE Computer Society. Alemnew S. Asrese (UniTN) Distributed File Systems 2012/12/12 55 / 55