Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com



Similar documents
HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Apache Hadoop. Alexandru Costan

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Design and Evolution of the Apache Hadoop File System(HDFS)

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Hadoop & its Usage at Facebook

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

The Hadoop Distributed File System

Hadoop & its Usage at Facebook

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

The Hadoop Distributed File System

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop Architecture. Part 1

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Big Data With Hadoop

HDFS Users Guide. Table of contents

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

CSE-E5430 Scalable Cloud Computing Lecture 2

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Apache Hadoop FileSystem and its Usage in Facebook

Chapter 7. Using Hadoop Cluster and MapReduce

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop Distributed FileSystem on Cloud

Apache Hadoop FileSystem Internals

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Optimize the execution of local physics analysis workflows using Hadoop

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Introduction to HDFS. Prasanth Kothuri, CERN

NoSQL and Hadoop Technologies On Oracle Cloud

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

GraySort and MinuteSort at Yahoo on Hadoop 0.23

A very short Intro to Hadoop

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Big Data Technology Core Hadoop: HDFS-YARN Internals

Distributed File Systems

5 HDFS - Hadoop Distributed System

HDFS Architecture Guide

Apache Hadoop new way for the company to store and analyze big data

HDFS: Hadoop Distributed File System

ATLAS Tier 3

Intro to Map/Reduce a.k.a. Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to HDFS. Prasanth Kothuri, CERN

CS54100: Database Systems

International Journal of Advance Research in Computer Science and Management Studies

Hadoop: Embracing future hardware

Processing of Hadoop using Highly Available NameNode

Petabyte-scale Data with Apache HDFS. Matt Foley Hortonworks, Inc.

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Storage Architectures for Big Data in the Cloud

Data-Intensive Computing with Map-Reduce and Hadoop

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.

HADOOP MOCK TEST HADOOP MOCK TEST I

Distributed File Systems

Distributed Filesystems

BBM467 Data Intensive ApplicaAons

Hadoop Distributed File System (HDFS) Overview

Parallel Processing of cluster by Map Reduce

From Wikipedia, the free encyclopedia

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Fault Tolerance in Hadoop for Work Migration

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Data Processing in the Cloud at Yahoo!

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

Distributed File Systems

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

HDFS. Hadoop Distributed File System

GeoGrid Project and Experiences with Hadoop


Generic Log Analyzer Using Hadoop Mapreduce Framework

Hadoop implementation of MapReduce computational model. Ján Vaňo

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop Architecture and its Usage at Facebook

HADOOP MOCK TEST HADOOP MOCK TEST II

Open source Google-style large scale data analysis with Hadoop

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Introduction to MapReduce and Hadoop

MapReduce, Hadoop and Amazon AWS

Introduction to Cloud Computing

Big Fast Data Hadoop acceleration with Flash. June 2013

Implementing the Hadoop Distributed File System Protocol on OneFS Jeff Hughes EMC Isilon

Hadoop and Map-Reduce. Swati Gore

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

MapReduce Job Processing

Big Data Trends and HDFS Evolution

Transcription:

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data on thousands of nodes Include Storage: HDFS Processing: MapReduce Support the Map/Reduce programming model Requirements Economy: use cluster of comodity computers Easy to use Users: no need to deal with the complexity of distributed computing Reliable: can handle node failures automatically 2

Open source Apache project Implemented in Java Apache Top Level Project http://hadoop.apache.org/core/ Core (15 Committers) HDFS MapReduce Community of contributors is growing Though mostly Yahoo for HDFS and MapReduce You can contribute too! 3

Hadoop Characteristics Commodity HW Add inexpensive servers Storage servers and their disks are not assumed to be highly reliable and available Use replication across servers to deal with unreliable storage/servers Metadata-data separation - simple design Namenode maintains metadata Datanodes manage storage Slightly Restricted file semantics Focus is mostly sequential access Single writers No file locking features Support for moving computation close to data Servers have 2 purposes: data storage and computation Single storage + compute cluster vs. Separate clusters 4

Hadoop Architecture Data Hadoop Cluster DFS Block 1 DFS Block 1 DFS Block 1 DFS Block 2 MAP DFS Block 2 MAP Reduce DFS Block 2 MAP DFS Block 3 DFS Block 3 DFS Block 3 Results 5

HDFS Data Model Data is organized into files and directories Files are divided into uniform sized blocks and distributed across cluster nodes Blocks are replicated to handle hardware failure Filesystem keeps checksums of data for corruption detection and recovery HDFS exposes block placement so that computation can be migrated to data 6

HDFS Data Model NameNode(Filename, replicationfactor, block-ids, ) /users/user1/data/part-0, r:2, {1,3}, /users/user1/data/part-1, r:3, {2,4,5}, Datanodes 1 2 1 2 4 2 5 5 3 4 3 5 4 7

HDFS Architecture Master-Slave architecture DFS Master Namenode Manages the filesystem namespace Maintain file name to list blocks + location mapping Manages block allocation/replication Checkpoints namespace and journals namespace changes for reliability Control access to namespace DFS Slaves Datanodes handle block storage Stores blocks using the underlying OS s files Clients access the blocks directly from datanodes Periodically sends block reports to Namenode Periodically check block integrity 8

HDFS Architecture Metadata ops Namenode Metadata (Name, #replicas, ): /users/foo/data, 3, Client Block ops Read Datanodes Replication Datanodes Blocks Write Rack 1 Rack 2 Client 9

Block Placement And Replication A file s replication factor can be set per file (default 3) Block placement is rack aware Guarantee placement on two racks 1st replica is on local node, 2rd/3rd replicas are on remote rack Avoid hot spots: balance I/O traffic Writes are pipelined to block replicas Minimize bandwidth usage Overlapping disk writes and network writes Reads are from the nearest replica Block under-replication & over-replication is detected by Namenode Balancer application rebalances blocks to balance DN utilization 10

HDFS Future Work: Scalability Scale cluster size Scale number of clients Scale namespace size (total number of files, amount of data) Possible solutions Multiple namenodes Read-only secondary namenode Separate cluster management and namespace management Dynamic Partition namespace Mounting 11

Map/Reduce Map/Reduce is a programming model for efficient distributed computing It works like a Unix pipeline: cat input grep sort uniq -c cat > output Input Map Shuffle & Sort Reduce Output A simple model but good for a lot of applications Log processing Web index building 12

Word Count Dataflow 13

Word Count Example Mapper Input: value: lines of text of input Output: key: word, value: 1 Reducer Input: key: word, value: set of counts Output: key: word, value: sum Launching program Defines the job Submits job to cluster 14

Map/Reduce features Fine grained Map and Reduce tasks Improved load balancing Faster recovery from failed tasks Automatic re-execution on failure In a large cluster, some nodes are always slow or flaky Framework re-executes failed tasks Locality optimizations With large data, bandwidth to data is a problem Map-Reduce + HDFS is a very effective solution Map-Reduce queries HDFS for locations of input data Map tasks are scheduled close to the inputs when possible 15

Documentation Hadoop Wiki Introduction http://hadoop.apache.org/core/ Getting Started http://wiki.apache.org/hadoop/gettingstartedwithhadoop Map/Reduce Overview http://wiki.apache.org/hadoop/hadoopmapreduce DFS http://hadoop.apache.org/core/docs/current/hdfs_design.html Javadoc http://hadoop.apache.org/core/docs/current/api/index.html 16

Questions? Thank you! 17