Apache Hadoop. Alexandru Costan

Similar documents

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Big Data Processing using Hadoop. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Architecture. Part 1

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Chapter 7. Using Hadoop Cluster and MapReduce

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

HDFS Architecture Guide

HDFS Users Guide. Table of contents

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Big Data With Hadoop

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Apache Hadoop FileSystem and its Usage in Facebook

THE HADOOP DISTRIBUTED FILE SYSTEM

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Survey on Scheduling Algorithm in MapReduce Framework

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Distributed File Systems

Hadoop Parallel Data Processing

Hadoop IST 734 SS CHUNG

Open source Google-style large scale data analysis with Hadoop

Design and Evolution of the Apache Hadoop File System(HDFS)

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Apache Hadoop new way for the company to store and analyze big data

Introduction to HDFS. Prasanth Kothuri, CERN

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Distributed Filesystems

Data-Intensive Computing with Map-Reduce and Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Apache HBase. Crazy dances on the elephant back

The Hadoop Distributed File System

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop Architecture and its Usage at Facebook

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Apache Hadoop FileSystem Internals

MapReduce, Hadoop and Amazon AWS

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

CSE-E5430 Scalable Cloud Computing Lecture 2

Intro to Map/Reduce a.k.a. Hadoop

HDFS: Hadoop Distributed File System

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Hadoop: Embracing future hardware

Hadoop Scheduler w i t h Deadline Constraint

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

NoSQL and Hadoop Technologies On Oracle Cloud

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

Processing of Hadoop using Highly Available NameNode

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

TP1: Getting Started with Hadoop

Hadoop and Map-Reduce. Swati Gore

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Introduction to HDFS. Prasanth Kothuri, CERN

HADOOP MOCK TEST HADOOP MOCK TEST II

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

International Journal of Advance Research in Computer Science and Management Studies

HADOOP MOCK TEST HADOOP MOCK TEST

HadoopRDF : A Scalable RDF Data Analysis System

Hadoop Ecosystem B Y R A H I M A.

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Big Data Management and NoSQL Databases

MapReduce with Apache Hadoop Analysing Big Data

BBM467 Data Intensive ApplicaAons

How To Use Hadoop

Introduction to Cloud Computing

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Open source large scale distributed data management with Google s MapReduce and Bigtable

Fault Tolerance in Hadoop for Work Migration

The Hadoop Distributed File System

CURSO: ADMINISTRADOR PARA APACHE HADOOP

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Big Data Technology Core Hadoop: HDFS-YARN Internals

Large scale processing using Hadoop. Ján Vaňo

MapReduce. Tushar B. Kute,

Parallel Processing of cluster by Map Reduce

Introduction to Hadoop

Research Article Hadoop-Based Distributed Sensor Node Management System

Transcription:

1 Apache Hadoop Alexandru Costan

Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2

Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open Issues Examples 3

4 4

What is Hadoop? Hadoop is a top-level Apache project Open source implementation of MapReduce Developed in Java Platform for data storage and processing Scalable Fault tolerant Distributed Any type of complex data 5

Why? 6

What for? Advocated by industry s premier Web players (Google, Yahoo!, Microsoft, Facebook) as the engine to power the cloud. Used for: Batch data processing, not real-time / user facing: web search Log processing Document analysis and indexing Web graphs and crawling Highly parallel data intensive distributed applications 7

Who uses Hadoop? 8 8

Components: the Hadoop stack 9

HDFS Distributed storage system Files are divided into large blocks (128 MB) Blocks are distributed across the cluster Blocks are replicated to help against hardware failure Data placement is exposed so that computation can be migrated to data Master / Slave architecture Notable differences from mainstream DFS work Single storage + compute cluster vs. separate clusters Simple I/O centric API 10

HDFS Architecture HDFS Master: NameNode Manages all file system metadata in memory: List of files For each file name: a set of blocks For each block: a set of DataNodes File attributes (creation time, replication factor) Controls read/write access to files Manages block replication Transaction log: register file creation, deletion, etc. 11

HDFS Architecture HDFS Slaves: DataNodes A DataNode is a block server Stores data in the local file system (e.g. ext3) Stores meta-data of a block (e.g. CRC) Serves data and meta-data to Clients Block report Periodically sends a report of all existing blocks to the NameNode Pipelining of data Forwards data to other specified DataNodes Perform replication tasks upon instruction by NameNode Rack-aware 12

HDFS Architecture 13

Fault tolerance in HDFS NameNode uses heartbeats to detect DataNode failures: Once every 3 seconds Chooses new DataNodes for new replicas Balances disk usage Balances communication traffic to DataNodes Multiple copies of a block are stored: Default replication: 3 Copy #1 on another node on the same rack Copy #2 on another node on a different rack 14

Data Correctness Use checksums to validate data CRC32 File creation Client computes checksum per 512 byte DataNode stores the checksum File access Client retrieves the data and checksum from DataNode If validation fails, client tries other replicas 15

NameNode failures 16 A single point of failure Transaction Log stored in multiple directories A directory on the local file system A directory on a remote file system (NFS/ CIFS) The Secondary NameNode holds a backup of the NameNode data On the same machine L Need to develop a really highly available solution!

Data pipelining Client retrieves a list of DataNodes on which to place replicas of a block Client writes block to the first DataNode The first DataNode forwards the data to the second The second DataNode forwards the data to the third DataNode in the pipeline When all replicas are written, the client moves on to write the next block in file 17

User interface Commands for HDFS User: hadoop dfs -mkdir /foodir hadoop dfs -cat /foodir/myfile.txt hadoop dfs -rm /foodir myfile.txt Commands for HDFS Administrator hadoop dfsadmin -report hadoop dfsadmin -decommission datanodename Web Interface http://host:port/dfshealth.jsp 18

Hadoop MapReduce Parallel processing for large datasets Relies on HDFS Master-Slave architecture: Job Tracker Task Trackers 19

Hadoop MapReduce Architecture Map-Reduce Master: JobTracker Accepts MapReduce jobs submitted by users Assigns Map and Reduce tasks to TaskTrackers Monitors task and TaskTracker status Re-executes tasks upon failure 20

Hadoop MapReduce Architecture Map-Reduce Slaves: TaskTrackers Run Map and Reduce tasks upon instruction from the JobTracker Manage storage and transmission of intermediate output Generic Reusable Framework supporting pluggable user code (file system, I/O format) 21

Putting everything together: HDFS and MapReduce deployment 22

Hadoop MapReduce Client Define Mapper and Reducer classes and a launching program Language support Java C++ Python Special case: Maps only 23

Zoom on the Map phase 24

Zoom on the Reduce Phase 25

Data locality Data locality is exposed in the map task scheduling put tasks where data is Data are replicated: Fault tolerance Performance: divide the work among nodes JobTracker schedules map tasks considering: Node-aware Rack-aware Non-local map tasks 26

Fault tolerance TaskTrackers send heartbeats to the JobTracker Once every 3 seconds Node is labled as failed if no heartbeat is recieved for a defined expiry time (default: 10 minutes) Re-execute all the ongoing and completed tasks Need to develop a more efficient policy to prevent re-executing completed tasks (storing this data in HDFS)! 27

Speculation in Hadoop Slow nodes (stragglers) à run backup tasks Node 1 Node 2 28

Life cycle of Map/Reduce tasks 29 29

Open Issues - 1 All the metadata are handled through one single Master in HDFS (Name Node) Performs bad when: Handling many files Heavy concurrency 30

Open Issues - 2 Data locality is crucial for Hadoop s performance How can we expose data-locality of Hadoop in the Cloud efficiently? Hadoop in the Cloud: Unaware of network topology Node-aware or non-local map tasks 31

Open Issues - 2 32 Data locality in the Cloud Node1 1 3 4 9 Empty node Node2 2 5 8 12 Node3 1 6 7 12 Node4 2 3 5 8 Empty node Node5 Node6 4 7 10 6 9 10 11 11 Empty node The simplicity of map tasks scheduling leads to non-local maps execution (25%)

Open Issues - 3 Data Skew The current Hadoop hash partitioning works well when the keys are equally frequent and uniformly stored in the data nodes In the presence of partitioning skew: Variation in Intermediate Keys frequencies Variation in Intermediate Keys distribution among different Data Nodes Native blindly hash-partitioning is inadequate and will lead to: Network congestion Unfairness in reducers inputs -> Reduce computation skew Performance degradation 33

Open Issues - 3 Data Node1 34 Data Node1 Data Node1 K1 K1 K1 K2 K2 K2 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K2 K2 K2 K2 K3 K3 K3 K3 K1 K1 K1 K2 K4 K4 K2 K4 K4 K4 K4 K4 K4 K4 K4 K4 K5 K6 K4 K5 K5 K6 K6 K6 K4 K5 K5 K5 K5 K5 Hash code: (Intermediate-key) Modulo ReduceID K1 K2 K3 K4 K5 K6 Data Node1 Data Node2 Data Node3 Total Out Data Transfer 11 15 18 Reduce Input 29 17 8 Total 44/54

Job scheduling in Hadoop Considerations Job priority: deadline Capacity: cluster resources available, resources needed for job FIFO The first job to arrive at a JobTracker is processed first Capacity Scheduling Organizes jobs into queues Queue shares as % s of cluster Fair Scheduler Group jobs into pools Divide excess capacity evenly between pools Delay scheduler to expose data locality 35

Hadoop at work! Cluster of machines running Hadoop at Yahoo! (credit: Yahoo!) 36

Running Hadoop Multiple options: On your local machine (standalone or pseudo distributed) Local with a Virtual Machine On the cloud (i.e. Amazon EC2) In your own datacenter (e.g. Grid5000) 37

Word Count Example In Hadoop 38

Useful links HDFS Design: http://hadoop.apache.org/core/docs/current/hdfs_design.html Hadoop API: http://hadoop.apache.org/core/docs/current/api/ 39