Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Similar documents

HDFS Architecture Guide

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Distributed File Systems

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

THE HADOOP DISTRIBUTED FILE SYSTEM

Introduction to HDFS. Prasanth Kothuri, CERN

Distributed File Systems

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Apache Hadoop. Alexandru Costan

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Introduction to HDFS. Prasanth Kothuri, CERN

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Hadoop & its Usage at Facebook

Hadoop Architecture. Part 1

Hadoop IST 734 SS CHUNG

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

HDFS Users Guide. Table of contents

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Design and Evolution of the Apache Hadoop File System(HDFS)

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Apache Hadoop FileSystem and its Usage in Facebook

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop & its Usage at Facebook

Hadoop Distributed File System (HDFS) Overview

Accelerating and Simplifying Apache

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Apache Hadoop FileSystem Internals

Distributed File Systems

Big Data With Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Apache Hadoop new way for the company to store and analyze big data

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data Management and NoSQL Databases

CSE-E5430 Scalable Cloud Computing Lecture 2

IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

HDFS Space Consolidation

Comparative analysis of Google File System and Hadoop Distributed File System

Reduction of Data at Namenode in HDFS using harballing Technique

The Google File System

Hadoop: Embracing future hardware

Snapshots in Hadoop Distributed File System

HDFS: Hadoop Distributed File System

Application Development. A Paradigm Shift

Large scale processing using Hadoop. Ján Vaňo

The Hadoop Distributed File System

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Open source Google-style large scale data analysis with Hadoop

Fault Tolerance in Hadoop for Work Migration

BBM467 Data Intensive ApplicaAons

Processing of Hadoop using Highly Available NameNode

TP1: Getting Started with Hadoop

Distributed Filesystems

Hadoop Architecture and its Usage at Facebook

Jeffrey D. Ullman slides. MapReduce for data intensive computing

International Journal of Advance Research in Computer Science and Management Studies

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Data-Intensive Computing with Map-Reduce and Hadoop

NoSQL and Hadoop Technologies On Oracle Cloud

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

GraySort and MinuteSort at Yahoo on Hadoop 0.23

The Hadoop Distributed File System

Parallel Processing of cluster by Map Reduce

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

MapReduce and Hadoop Distributed File System

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Chapter 7. Using Hadoop Cluster and MapReduce

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming

Hadoop Distributed File System (HDFS)

Seminar Presentation for ECE 658 Instructed by: Prof.Anura Jayasumana Distributed File Systems

Big Data Technology Core Hadoop: HDFS-YARN Internals

Big Data and Apache Hadoop s MapReduce

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Introduction to MapReduce and Hadoop

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Chase Wu New Jersey Ins0tute of Technology

Big Data Analytics. Lucas Rego Drumond

Hadoop and Map-Reduce. Swati Gore

<Insert Picture Here> Big Data

Introduction to Hadoop

Generic Log Analyzer Using Hadoop Mapreduce Framework

Transcription:

Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance When to choose HDFS? HDFS in action Future of HDFS Alternative approaches References / literature

Apache Hadoop HDFS Apache Hadoop Project that "develops open-source software for reliable, scalable, distributed computing" [1] HDFS (Hadoop Distributed File System) Subproject of Hadoop Target: reliable and rapid computation on large data sets, emphasis on high throughput Primary storage system for Hadoop applications Designed especially for sending and receiving data sets for MapReduce operations Serves also as a (limited) general purpose DFS [1][3][10]

Flesh and bones of HDFS User level file system Written in Java Typically running on some GNU/Linux operating system Can be deployed on commodity hardware This is actually a key assumption in the design Inter-node and client communication protocols work on top of TCP/IP API/shell/browser access [5]

Architecture 1/4: Overview Based on GFS, master/slave [5][10]

Architecture 2/4: Namespace and files Common hierarchical namespace structure Directories that include directories and files WORM (write-once-read-many) access model Files and directories can be created, deleted, moved, renamed, opened, closed NOT modified Simplifies replication Files split into blocks Default size 64 MB [5][10]

Architecture 3/4: NameNode NameNode = Master Provides instructions to DataNodes Point of access for clients One NameNode per cluster Typically a dedicated machine Achilles' heel Namespace and metadata management Keeps metadata on RAM ( scalability bottleneck) Decides how blocks are placed in DataNodes [5][10]

Architecture 4/4: DataNode DataNode = Slave [5] Serves block creation/deletion/replication requests from NameNode Serves read/write requests from clients Typically several DataNodes, dedicated machines Stores blocks as files in local file system, knows nothing about HDFS files Provides Blockreports to NameNode Blocks always transferred directly between DataNodes and Clients

Accessing data [5] FileSystem Java API + wrapper for C Commandline interface: FS Shell Practical for scripts Commands resemble using Unix utilities, e.g. bin/hadoop dfs mkdir /tempdir bin/hadoop dfs -cat /tempdir/tempfile.txt dfsadmin for administrative tasks e.g. bin/hadoop dfsadmin -refreshnodes Web browser based interface for browsing the namespace

Data replication strategy 1/2: Overview Basis for fault tolerance Replica placement affects performance a lot NameNode responsible for deployment Number of replicas and block size can be configured separately for each file Concept of rack-awareness NameNode determines which DataNodes belong to same racks Idea is to minimize network traffic between racks [5]

Data replication strategy 2/2: Default strategy Replication factory = 3 One replica in a node in local rack One replica in another node in the same rack One replica in another rack Balanced for write performance and fault tolerance Replication pipelining DataNode forwards data to another DataNode according to a list generated by NameNode [5]

Fault tolerance Failure at DataNode Hearbeat missing stop I/O, re-replicate Network failure Data integrity failure Checksum Failure at NameNode Backing up data highly recommended No built-in method for automatic recovery available [5]

When to choose HDFS? 1/2: Applications & data Think of HDFS as data set system instead of file system Ideal for batch processing, not interactive tasks Intended for streaming a lot of data though seeking to an arbitrary point is also supported Throughput optimized at the cost of latency Typically millions of files, avg file size 1 GB... 1 TB E.g. web crawlers, GIS data management, archival, statistical analysis, and naturally Hadoop apps WORM access model must be acceptable [5][10]

When to choose HDFS? 2/2: Points regarding environment Works whenever Java works Highly portable, good support for Java applications Supports mechanisms for briging computation physically closer to the data Saves bandwidth compared to moving data Designed for thousands of nodes, several of which are always broken [5]

HDFS in action Yahoo: The Yahoo! Search Webmap 10 000 cores and 5 PB of storage capacity Produces data for all Yahoo! web search queries HDFS caused a 34% drop in processing time [7] Adobe [2] 30 nodes in clusters of 5-14 nodes (dev+prod) Social services, data storage, internal use AOL 50-node cluster with 37 TB of HDFS capacity Behavioral analysis, targeting, statistics generation

HDFS in action Facebook [2] 600-node cluster with 2 PW of storage capacity Logs, reporting, analysis, machine learning FUSE implementation over HDFS Iterend Blog search engine 10-node HDFS cluster Spadac Storing and processing geospatial imagery and vector data

Future of HDFS 1/2: Confirmed plans Moving on from WORM: support for appending data to files Improvements in namespace maintenance (invisible to clients) Access via WebDAV protocol Extends HTTP for file management & modification Support for snapshots for returning to a functional state in case of corruption of file system [5] Tuning the replica placement policy

Future of HDFS 2/2: Possible improvements User quotas Hard and soft links Data rebalancing mecahisms Move blocks to other nodes if disk space on a certain DataNode drops too low Create additional replicas if demand for a certain file rises significantly Automatic recovery from NameNode failure [5]

Alternative approaches DFS's come generally in two flavors 1) Designed for running Internet services Often developed by companies like Google and Amazon GoogleFS, Amazon S3, HDFS 2) Designed for high-performance computing Parallel file systems IBM GPFS, Sun Lustre FS PVFS (Parallel Virtual File System) Open-source, user-level filesystem like HDFS, has some highlevel design similarities In use at Argonne national lab, Ohio supercomputer center,... [8][9]

[9] HDFS vs. PVFS 1/2: Design

HDFS vs. PVFS 2/2: Performance [9] Executed in the Hadoop Internet services stack, note that PVFS is sending writes to three servers

References / Literature [1] What is Hadoop?, The Apache Software Foundation, referenced on 2009-10-29, available at http:// hadoop.apache.org/ [2] PoweredBy, The Apache Software Foundation, referenced on 2009-10-29, available at http://wiki.apache.org/hadoop/poweredby [4] HDFS User Guide, The Apache Software Foundation, referenced on 2009-10-29, available at http://hadoop.apache.org/common/docs/current/hdfs_user_guide.html [5] HDFS Architecture, The Apache Software Foundation, referenced on 2009-10-29, available at http://hadoop.apache.org/common/docs/current/hdfs_design.html [7] Yahoo! Launches World's Largest Hadoop Production Application, Eric Baldeschwieler (Senior Director, Grid Computing, Yahoo! Inc.), referenced on 2009-10-29, available at http://developer.yahoo.net/blogs/hadoop/2008/02/yahoo-worlds-largest-productionhadoop.html [8] The Google File System; Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung; 2003; available at http://labs.google.com/papers/gfs-sosp2003.pdf [9] Data-intensive File systems for Internet services: A rose by any other name...; Wittawat Tantisiriroj, Swapnil Patil, Garth Gibson; Carnegie Mellon University / Parallel Data Laboratory; 10/2008; available at http://www.pdl.cs.cmu.edu/pdl-ftp/pdsi/cmu-pdl-08-114.pdf [10] MapReduce and HDFS; Cloudera, Inc.; referenced on 2009-11-07, slides and video available at http://www.cloudera.com/hadoop-training-mapreduce-hdfs