The Google File System

Similar documents

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

The Google File System

Distributed File Systems

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File Systems

RAID. Tiffany Yu-Han Chen. # The performance of different RAID levels # read/write/reliability (fault-tolerant)/overhead

Google File System. Web and scalability

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Distributed File Systems

Massive Data Storage

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data Processing in the Cloud. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center

Cloud Computing at Google. Architecture

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Seminar Presentation for ECE 658 Instructed by: Prof.Anura Jayasumana Distributed File Systems

Parallel Processing of cluster by Map Reduce

Data-Intensive Computing with Map-Reduce and Hadoop

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Hadoop Architecture. Part 1

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

The Hadoop Distributed File System

Cloud Scale Storage Systems. Sean Ogden October 30, 2013

Hadoop Distributed File System (HDFS) Overview

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2

Comparative analysis of Google File System and Hadoop Distributed File System

Accelerating and Simplifying Apache

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Distributed Filesystems

Quantcast Petabyte Storage at Half Price with QFS!

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Design and Evolution of the Apache Hadoop File System(HDFS)

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Apache Hadoop. Alexandru Costan

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

The Hadoop Distributed File System

International Journal of Advance Research in Computer Science and Management Studies

MapReduce (in the cloud)

Big Table A Distributed Storage System For Data

- Behind The Cloud -

Research on Reliability of Hadoop Distributed File System

Hadoop & its Usage at Facebook

Architecture for Hadoop Distributed File Systems

A programming model in Cloud: MapReduce

Algorithms and Methods for Distributed Storage Networks 8 Storage Virtualization and DHT Christian Schindelhauer

Sunita Suralkar, Ashwini Mujumdar, Gayatri Masiwal, Manasi Kulkarni Department of Computer Technology, Veermata Jijabai Technological Institute

Hypertable Architecture Overview

Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Hadoop & its Usage at Facebook

Distributed Storage Networks and Computer Forensics

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Implementation Issues of A Cloud Computing Platform

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

CLOUD- SCALE FILE SYSTEMS THANKS TO M. GROSSNIKLAUS

Technical Overview Simple, Scalable, Object Storage Software

CSE-E5430 Scalable Cloud Computing Lecture 2

Christian Schindelhauer Technical Faculty Computer-Networks and Telematics University of Freiburg

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Hadoop IST 734 SS CHUNG

HDFS Users Guide. Table of contents

MinCopysets: Derandomizing Replication In Cloud Storage

Apache Hadoop FileSystem and its Usage in Facebook

Reduction of Data at Namenode in HDFS using harballing Technique

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Big Data Analytics. Lucas Rego Drumond

Parallels Cloud Storage

SDFS Overview. By Sam Silverberg

EMC Virtual Infrastructure for Microsoft Applications Data Center Solution

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Configuring Apache Derby for Performance and Durability Olav Sandstå

Scalable Multi-Node Event Logging System for Ba Bar

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming

HDFS Space Consolidation

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Scala Storage Scale-Out Clustered Storage White Paper

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Map Reduce / Hadoop / HDFS

Big Data With Hadoop

Processing of Hadoop using Highly Available NameNode

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

Distributed Systems (CS236351) Exercise 3

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

CitusDB Architecture for Real-Time Big Data

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Fast Crash Recovery in RAMCloud

HDFS and Availability Data Retragement

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

CLOUD scale storage Anwitaman DATTA SCE, NTU Singapore CE 7490 ADVANCED TOPICS IN DISTRIBUTED SYSTEMS

Transcription:

The Google File System By Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (Presented at SOSP 2003) Introduction Google search engine. Applications process lots of data. Need good file system. Solution: Google File System (GFS).

Motivational Facts More than 15,000 commodity-class PC's. Multiple clusters distributed worldwide. Thousands of queries served per second. One query reads 100's of MB of data. One query consumes 10's of billions of CPU cycles. Google stores dozens of copies of the entire Web! Conclusion: Need large, distributed, highly faulttolerant file system. Topics Design Motivations Architecture Read/Write/Record Append Fault-Tolerance Performance Results

Design Motivations 1. Fault-tolerance and auto-recovery need to be built into the system. 2. Standard I/O assumptions (e.g. block size) have to be re-examined. 3. Record appends are the prevalent form of writing. 4. Google applications and GFS should be codesigned. GFS Architecture (Analogy) On a single-machine FS: An upper layer maintains the metadata. A lower layer (i.e. disk) stores the data in units called blocks. Upper layer store In the GFS: A master process maintains the metadata. A lower layer (i.e. a set of chunkservers) stores the data in units called chunks.

GFS Architecture Client (request for metadata) (metadata reponse) ( read/write request) Master Chunkserver Metadata Chunkserver ( read/write response) Linux FS Linux FS GFS Architecture What is a chunk? Analogous to block, except larger. Size: 64 MB! Stored on chunkserver as file Chunk handle (~ chunk file name) used to reference chunk. Chunk replicated across multiple chunkservers Note: There are hundreds of chunkservers in a GFS cluster distributed over multiple racks.

GFS Architecture What is a master? A single process running on a separate machine. Stores all metadata: File namespace File to chunk mappings Chunk location information Access control information Chunk version numbers Etc. GFS Architecture Master <-> Chunkserver Communication: Master and chunkserver communicate regularly to obtain state: Is chunkserver down? Are there disk failures on chunkserver? Are any replicas corrupted? Which chunk replicas does chunkserver store? Master sends instructions to chunkserver: Delete existing chunk. Create new chunk.

GFS Architecture Serving Requests: Client retrieves metadata for operation from master. Read/Write data flows between client and chunkserver. Single master is not bottleneck, because its involvement with read/write operations is minimized. Overview Design Motivations Architecture Master Chunkservers Clients Read/Write/Record Append Fault-Tolerance Performance Results

And now for the Meat Read Algorithm Application 1 (file name, byte range) GFS Client 2 (file name, chunk index) (chunk handle, replica locations) 3 Master

Read Algorithm Chunk Server Application 4 6 (data from file) (chunk handle, byte range) Chunk Server GFS Client (data from file) 5 Chunk Server Read Algorithm 1. Application originates the read request. 2. GFS client translates the request from (filename, byte range) -> (filename, chunk index), and sends it to master. 3. Master responds with chunk handle and replica locations (i.e. chunkservers where the replicas are stored). 4. Client picks a location and sends the (chunk handle, byte range) request to that location. 5. Chunkserver sends requested data to the client. 6. Client forwards the data to the application.

Read Algorithm (Example) Indexer 1 (crawl_99, 2048 bytes) GFS Client 2 (crawl_99, index: 3) (ch_1003, {chunkservers: 4,7,9}) Master crawl_99 Ch_1001 {3,8,12} Ch_1002 {1,8,14} Ch_1003 {4,7,9} 3 Read Algorithm (Example) Calculating chunk index from byte range: (Assumption: File position is 201,359,161 bytes) Chunk size = 64 MB. 64 MB = 1024 *1024 * 64 bytes = 67,108,864 bytes. 201,359,161 bytes = 67,108,864 * 2 + 32,569 bytes. So, client translates 2048 byte range -> chunk index 3.

Read Algorithm (Example) 6 Application (2048 bytes of data) 4 (ch_1003, {chunkservers: 4,7,9}) Chunk Server #4 Chunk Server #7 GFS Client (2048 bytes of data) Chunk Server #9 5 Write Algorithm Application 1 2 (file name, data) GFS Client (file name, chunk index) (chunk handle, primary and secondary replica locations) 3 Master

Write Algorithm Primary Buffer Chunk Application (Data) (Data) Secondary Buffer Chunk GFS Client (Data) Secondary Buffer Chunk 4 Write Algorithm Application (Write command) 5 Primary Secondary D1 D2 D3 D4 D1 D2 D3 D4 (write command, serial order) 6 7 Chunk Chunk GFS Client Secondary D1 D2 D3 D4 Chunk

Write Algorithm Application 9 (response) Primary (empty) Chunk 8 Secondary (empty) Chunk (response) GFS Client Secondary (empty) Chunk Write Algorithm 1. Application originates write request. 2. GFS client translates request from (filename, data) -> (filename, chunk index), and sends it to master. 3. Master responds with chunk handle and (primary + secondary) replica locations. 4. Client pushes write data to all locations. Data is stored in chunkservers internal buffers. 5. Client sends write command to primary.

Write Algorithm 6. Primary determines serial order for data instances stored in its buffer and writes the instances in that order to the chunk. 7. Primary sends serial order to the secondaries and tells them to perform the write. 8. Secondaries respond to the primary. 9. Primary responds back to client. Note: If write fails at one of chunkservers, client is informed and retries the write. Record Append Algorithm Important operation at Google: Merging results from multiple machines in one file. Using file as producer - consumer queue. 1. Application originates record append request. 2. GFS client translates request and sends it to master. 3. Master responds with chunk handle and (primary + secondary) replica locations. 4. Client pushes write data to all locations.

Record Append Algorithm 5. Primary checks if record fits in specified chunk. 6. If record does not fit, then the primary: pads the chunk, tells secondaries to do the same, and informs the client. Client then retries the append with the next chunk. 7. If record fits, then the primary: appends the record, tells secondaries to do the same, receives responses from secondaries, and sends final response to the client. Observations Clients can read in parallel. Clients can write in parallel. Clients can append records in parallel.

Overview Design Motivations Architecture Algorithms: Read Write Record Append Fault-Tolerance Performance Results Fault Tolerance Fast Recovery: master and chunkservers are designed to restart and restore state in a few seconds. Chunk Replication: across multiple machines, across multiple racks. Master Mechanisms: Log of all changes made to metadata. Periodic checkpoints of the log. Log and checkpoints replicated on multiple machines. Master state is replicated on multiple machines. Shadow masters for reading data if real master is down. Data integrity: Each chunk has an associated checksum.

Performance (Test Cluster) Performance measured on cluster with: 1 master 16 chunkservers 16 clients Server machines connected to central switch by 100 Mbps Ethernet. Same for client machines. Switches connected with 1 Gbps link. Performance (Test Cluster)

Performance (Test Cluster) Performance (Real-world Cluster) Cluster A: Used for research and development. Used by over a hundred engineers. Typical task initiated by user and runs for a few hours. Task reads MB s-tb s of data, transforms/analyzes the data, and writes results back. Cluster B: Used for production data processing. Typical task runs much longer than a Cluster A task. Continuously generates and processes multi-tb data sets. Human users rarely involved. Clusters had been running for about a week when measurements were taken.

Performance (Real-world Cluster) Performance (Real-world Cluster) Many computers at each cluster (227, 342!) On average, cluster B file size is triple cluster A file size. Metadata at chunkservers: Chunk checksums. Chunk Version numbers. Metadata at master is small (48, 60 MB) -> master recovers from crash within seconds.

Performance (Real-world Cluster) Performance (Real-world Cluster) Many more reads than writes. Both clusters were in the middle of heavy read activity. Cluster B was in the middle of a burst of write activity. In both clusters, master was receiving 200-500 operations per second -> master is not a bottleneck.

Performance (Real-world Cluster) Experiment in recovery time: One chunkserver in Cluster B killed. Chunkserver has 15,000 chunks containing 600 GB of data. Limits imposed: Cluster can only perform 91 concurrent clonings. Each clone operation can consume at most 6.25 MB/s. Took 23.2 minutes to restore all the chunks. This is 440 MB/s. Conclusion Design Motivations Architecture Algorithms: Fault-Tolerance Performance Results