Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl



Similar documents
Distributed File Systems

THE HADOOP DISTRIBUTED FILE SYSTEM

The Google File System

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

The Hadoop Distributed File System

Distributed File Systems

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Distributed File Systems

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

The Hadoop Distributed File System

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Hadoop Architecture. Part 1

CSE-E5430 Scalable Cloud Computing Lecture 2

Design and Evolution of the Apache Hadoop File System(HDFS)

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Big Data With Hadoop

Jeffrey D. Ullman slides. MapReduce for data intensive computing

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Parallel Processing of cluster by Map Reduce

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Intro to Map/Reduce a.k.a. Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop

Distributed Filesystems

Cloud Computing at Google. Architecture

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

The Google File System

Massive Data Storage

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Google File System. Web and scalability

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Hadoop Distributed File System (HDFS) Overview

International Journal of Advance Research in Computer Science and Management Studies

Apache Hadoop FileSystem and its Usage in Facebook

Comparative analysis of Google File System and Hadoop Distributed File System

Processing of Hadoop using Highly Available NameNode

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop & its Usage at Facebook

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data Processing in the Cloud. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center

Hadoop & its Usage at Facebook

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

HADOOP MOCK TEST HADOOP MOCK TEST I

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Hadoop: Embracing future hardware

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

HDFS Space Consolidation

Apache Hadoop. Alexandru Costan

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Fault Tolerance in Hadoop for Work Migration

Highly Available Hadoop Name Node Architecture-Using Replicas of Name Node with Time Synchronization among Replicas

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

HDFS Architecture Guide

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Big Table A Distributed Storage System For Data

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Hadoop Big Data for Processing Data and Performing Workload

HDFS Users Guide. Table of contents

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Sunita Suralkar, Ashwini Mujumdar, Gayatri Masiwal, Manasi Kulkarni Department of Computer Technology, Veermata Jijabai Technological Institute

MapReduce, Hadoop and Amazon AWS

Reduction of Data at Namenode in HDFS using harballing Technique

Introduction to Hadoop

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

BBM467 Data Intensive ApplicaAons


Open source Google-style large scale data analysis with Hadoop

Snapshots in Hadoop Distributed File System

HDFS: Hadoop Distributed File System

Application Development. A Paradigm Shift

Big Data Analytics. Lucas Rego Drumond

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

RAID. Tiffany Yu-Han Chen. # The performance of different RAID levels # read/write/reliability (fault-tolerant)/overhead

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

The Hadoop Framework

Data-intensive computing systems

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Certified Big Data and Apache Hadoop Developer VS-1221

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Keywords: Big Data, HDFS, Map Reduce, Hadoop

A very short Intro to Hadoop

Hadoop implementation of MapReduce computational model. Ján Vaňo

LOCATION-AWARE REPLICATION IN VIRTUAL HADOOP ENVIRONMENT. A Thesis by. UdayKiran RajuladeviKasi. Bachelor of Technology, JNTU, 2008

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Transcription:

Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1

Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the scenes of MapReduce: HDFS & Scheduling Algorithm design for MapReduce A high-level language for MapReduce: Pig 1 & 2 MapReduce is not a database, but HBase nearly is Lets iterate a bit: Graph algorithms & Giraph How does all of this work together? ZooKeeper/Yarn 2

Learning objectives Explain the design considerations behind GFS/ HDFS Explain the basic procedures for data replication, recovery from failure, reading and writing Design alternative strategies to handle the issues GFS/HDFS was created for Decide whether GFS/HDFS is a good fit given a usage scenario 3

The Google File System Hadoop is heavily inspired by it. One way (not the way) to design a distributed file system. there are always alternatives 4

MapReduce & Hadoop MapReduce is a programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers. -Jimmy Lin GFS Hadoop is an open-source implementation of the MapReduce framework. HDFS 5

History of MapReduce (and GFS) Developed by researchers at Google around 2003 Built on principles in parallel and distributed processing Seminal papers: The Google file system by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (2003) MapReduce: Simplified Data Processing on Large Clusters. by Jeffrey Dean and Sanjay Ghemawat (2004) The Hadoop distributed file system by Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler (2010) 6

What is a file system? File systems determine how data is stored and retrieved Distributed file systems manage the storage across a network of machines Added complexity due to the network GFS/HDFS are distributed file systems Question: which file systems do you know? 7

GFS Assumptions based on Google s main use cases Hardware failures are common (commodity hardware) Files are large (GB/TB) and their number is limited (millions, not billions) Two main types of reads: large streaming reads and small random reads! Workloads with sequential writes that append data to files Once written, files are seldom modified (!=append) again Random modification in files possible, but not efficient in GFS High sustained bandwidth trumps low latency 8

Question: which of the following scenarios fulfil the brief? Global company dealing with the data of its 100 million employees (salary, bonuses, age, performance, etc.) A search engine s query log (a record of what kind of search requests people make) A hospital s medical imaging data generated from an MRI scan Data sent by the Hubble telescope A search engine s index (used to serve search results to users) 9

Disclaimer GFS/HDFS are not a good fit for: Low latency data access (in the milliseconds range) Solution: use HBase instead [later in the course] Many small files! Solution: *.har *.warc [later in this lecture]! Constantly changing data! Not all details of GFS are public knowledge 10

Question: how would you design a distributed file system? Application How to write data to the cluster? How to read data from the cluster? our cluster 11 Source: http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

GFS architecture user level processes: they can run on the same physical machine Remember: one way, not the way. several clients a single master (metadata) Data does not flow across the GFS master several data servers 12 Source: http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

GFS: Files 13

Files on GFS A single file can contain many objects (e.g. Web documents) Files are divided into fixed size chunks (64MB) with unique 64 bit identifiers IDs assigned by GFS master at chunk creation time Question: what does this mean for the maximum allowed file size in a cluster? chunkservers store chunks on local disk as normal Linux files Reading & writing of data specified by the tuple (chunk_handle, byte_range) Question: what is the purpose of byte_range? 14

Files on GFS A single file can contain many objects (e.g. Web documents) Files are divided into fixed size chunks (64MB) with unique 64 bit identifiers IDs assigned by GFS master at chunk creation time chunkservers store chunks on local disk as normal Linux files Reading & writing of data specified by the tuple (chunk_handle, byte_range) Question: what is the purpose of byte_range? 15

Master Files are replicated (by default 3 times) across all chunk servers master maintains all file system metadata! Namespace, access control information, mapping from file to chunks, chunk locations, garbage collection of orphaned chunks, chunk migration, distributed systems are complex! Heartbeat messages between master and chunk servers Is the chunk server still alive? What chunks are stored at the chunkserver?! To read/write data: client communicates with master (metadata operations) and chunk servers (data) 16

Files on GFS Clients cache metadata! seek time: 10ms transfer rate: 100MB/s What is the chunk size to make the seek time 1% of the transfer rate? 100 MB Clients do not cache file data Chunkservers do not cache file data (responsibility of Question: why don t clients store file data? the underlying file system: Linux s buffer cache) Question: Advantages why not of (large) increase fixed-size the chunks: size to more than 128MB? Disk (Hint: seek Map time tasks small operate compared on one to transfer chunk at time! a time) A single file can be larger than a node s disk space Fixed size makes allocation computations easy 17

Files on GFS Clients cache metadata! seek time: 10ms transfer rate: 100MB/s What is the chunk size to make the seek time 1% of the transfer rate? 100 MB Clients do not cache file data Chunkservers do not cache file data (responsibility of the underlying file system: Linux s buffer cache) Advantages of (large) fixed-size chunks: Disk seek time small compared to transfer time! A single file can be larger than a node s disk space Fixed size makes allocation computations easy 18

GFS: Master 19

One master Single master simplifies the design tremendously Chunk placement and replication with global knowledge! Single master in a large cluster can become a bottleneck Goal: minimize the number of reads and writes (thus metadata vs. data) Question: as the cluster grows, can the master become a bottleneck? 20

A read operation (in detail) 1. Client translates filename and byte offset specified by the application into a chunk index within the file. Sends request to master. 21 Source: http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

A read operation (in detail) 2. Master replies with chunk handle and locations. 22 Source: http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

A read operation (in detail) 3. Client caches the metadata.! 4. Client sends a data request to one of the replicas (the closest one). Byte range indicates wanted part of the chunk. More than one chunk can be included in a single request. Question: how can the cluster topology be discovered automatically? 23 Source: http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

A read operation (in detail) 5. Contacted chunk server replies with the requested data. 24 Source: http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

Metadata on the master 3 types of metadata! Files and chunk namespaces Mapping from files to chunks Locations of each chunk s replicas Question: what happens to the cluster if the master crashes? All metadata is kept in master s memory (fast random access) Sets limits on the entire system s capacity Operation log is kept on master s local disk: in case of the master s crash, master state can be recovered Namespaces and mappings are logged Chunk locations are not logged 25

GFS: Chunks 26

Chunks 1 chunk = 64MB or 128MB (can be changed); chunk stored as a plain Linux file on a chunk server Advantages of large (but not too large) chunk size Reduced need for client/master interaction 1 request per chunk suits the target workloads Client can cache all the chunk locations for a multi-tb working set Reduced size of metadata on the master (kept in memory) Disadvantage: chunkserver can become hotspot for popular file(s) Question: how could the hotspot issue be solved? 27

Chunk locations Master does not keep a persistent record of chunk replica locations Master polls chunkservers about their chunks at startup Master keeps up to date through periodic HeartBeat messages! Master/chunkservers easily kept in sync when chunk servers leave/join/fail/restart [regular event] Chunkserver has the final word over what chunks it has 28

Operation log Persistent record of critical metadata changes Critical to the recovery of the system Changes C1 to metadata are only made visible to clients after they have get file_name1 been written to the operation log C2 mv file_name1 file_name2 crashed! Operation log replicated on multiple master remote machines Before get file_name1 responding to client operation, log record must C3 have been flashed locally and remotely! Question: when does the master relay the new information to Master recovers its file system from checkpoint + operation the clients? Before or after having written it to the op. log? 29

master writes to operation log Operation log checkpoint creation; log file reset recovery too large log size? Persistent record of critical metadata changes Critical to the recovery of the system Changes to metadata are only made visible to clients after they have been written to the operation log Operation log replicated on multiple remote machines Before responding to client operation, log record must have been flashed locally and remotely! Master recovers its file system from checkpoint + operation 30

Chunk replica placement Creation of (initially empty) chunks Use under-utilised chunk servers; spread across racks Limit number of recent creations on each chunk server Re-replication! Started once the available replicas fall below setting Master instructs chunkserver to copy chunk data directly from existing valid replica Number of active clone operations/bandwidth is limited Re-balancing! Changes in replica distribution for better load balancing; gradual filling of new chunk servers 31

GFS: Data integrity 32

Garbage collection Question: how can a file be deleted from the cluster? Deletion logged by master File renamed to a hidden file, deletion timestamp kept Periodic scan of the master s file system namespace Hidden files older than 3 days are deleted from master s memory (no further connection between file and its chunk) Periodic scan of the master s chunk namespace Orphaned chunks (not reachable from any file) are identified, their metadata deleted HeartBeat messages used to synchronise deletion between master/chunkserver 33

Stale replica detection Scenario: a chunkserver misses a change ( mutation ) applied to a chunk, e.g. a chunk was appended Master maintains a chunk version number to distinguish up-to-date and stale replicas Before an operation on a chunk, master ensures that version number is advanced Stale replicas are removed in the regular garbage collection cycle 34

Data corruption Data corruption or loss can occur at the read and write stage Chunkservers use checksums to detect corruption of stored data Alternative: compare replicas across chunk servers Chunk is broken into 64KB blocks, each has a 32 bit checksum Kept in memory and stored persistently Read requests: chunkserver verifies checksum of data blocks that overlap read range (i.e. corruptions not send to clients) 35

HDFS: Hadoop Distributed File System HDFS is still under active development (many changes between Hadoop 1.X/0.X and 2.X) 36

GFS vs. HDFS GFS Master chunkserver operation log chunk random file writes possible multiple writer, multiple reader model chunk: 64KB data and 32bit checksum pieces default block size: 64MB HDFS NameNode DataNode journal, edit log block only append is possible single writer, multiple reader model per HDFS block, two files created on a DataNode: data file & metadata file (checksums, timestamp) default block size: 128MB 37

Hadoop s architecture O.X and 1.X NameNode! Master of HDFS, directs the slave DataNode daemons to perform low-level I/O tasks Keeps track of file splitting into blocks, replication, block location, etc. Secondary NameNode: takes snapshots of the NameNode DataNode: each slave machine hosts a DataNode daemon 38

NameNodes and DataNodes NameNode File meta-data: /user/ch/dat1 1,2,3! /user/ch/dat2 4,5 replication factor 3 poll 3 3 3 2 5 2 5 4 2 4 1 1 5 1 4 DataNode DataNode DataNode DataNode 39

Hadoop cluster topology Secondary NameNode merges namespace image with edit log & keeps a copy of it NameNode! JobTracker HDFS Federation 2.X: several NameNodes share control (partition of the filesystem) DataNode!! TaskTracker DataNode!! TaskTracker DataNode!! TaskTracker DataNode!! TaskTracker 40

Hadoop in practice: your system s health [cloudera@localhost ~]$ hdfs fsck /user/cloudera -files -blocks! http://localhost.localdomain:50070 41

Hadoop in practice: Yahoo! (2010) 40 nodes/rack sharing one IP switch 16GB RAM per cluster node, 1-gigabit Ethernet 70% of disk space allocated to HDFS! Remainder: operating system, data emitted by Mappers (not in HDFS) NameNode: up to 64GB RAM! Total storage: 9.8PB -> 3.3PB net storage (replication: 3) 60 million files, 63 million blocks 54,000 blocks hosted per DataNode 1-2 nodes lost per day! Time for cluster to re-replicate lost blocks: 2 minutes 42 HDFS cluster with 3500 nodes

Summary Ides behind the Google File System Basic procedures for replication, recovery, reading and writing of data Differences between GFS and HDFS 43

Recommended reading Chapter 3. A warning: coding takes time. More time than usual. MapReduce is not difficult to understand, but different templates, different advice on different sites (of widely different quality). Small errors are disastrous. 44

THE END