Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh



Similar documents
Chapter 7. Using Hadoop Cluster and MapReduce

Apache Hadoop. Alexandru Costan

Big Data With Hadoop

Hadoop Architecture. Part 1

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop IST 734 SS CHUNG

Open source Google-style large scale data analysis with Hadoop

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Data-Intensive Computing with Map-Reduce and Hadoop

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

HadoopRDF : A Scalable RDF Data Analysis System

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop Parallel Data Processing

Parallel Processing of cluster by Map Reduce

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop and Map-Reduce. Swati Gore


BBM467 Data Intensive ApplicaAons

A very short Intro to Hadoop

NoSQL and Hadoop Technologies On Oracle Cloud

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Map Reduce & Hadoop Recommended Text:

Large scale processing using Hadoop. Ján Vaňo

Open source large scale distributed data management with Google s MapReduce and Bigtable

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Jeffrey D. Ullman slides. MapReduce for data intensive computing

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Internals of Hadoop Application Framework and Distributed File System

Hadoop & its Usage at Facebook

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Hadoop implementation of MapReduce computational model. Ján Vaňo

MapReduce, Hadoop and Amazon AWS

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

CSE-E5430 Scalable Cloud Computing Lecture 2

Processing of Hadoop using Highly Available NameNode

A Brief Outline on Bigdata Hadoop

Introduction to MapReduce and Hadoop

Hadoop & its Usage at Facebook

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

BIG DATA TECHNOLOGY. Hadoop Ecosystem

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Big Data and Apache Hadoop s MapReduce

MapReduce with Apache Hadoop Analysing Big Data

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Big Data Weather Analytics Using Hadoop

Qsoft Inc

Generic Log Analyzer Using Hadoop Mapreduce Framework

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

THE HADOOP DISTRIBUTED FILE SYSTEM

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Fault Tolerance in Hadoop for Work Migration

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Constructing a Data Lake: Hadoop and Oracle Database United!

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Intro to Map/Reduce a.k.a. Hadoop

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Deploying Hadoop with Manager

Getting to know Apache Hadoop

Introduction to Hadoop

Big Data Analysis and HADOOP

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Alternatives to HIVE SQL in Hadoop File Structure

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Design and Evolution of the Apache Hadoop File System(HDFS)

Apache Hadoop FileSystem and its Usage in Facebook

Big Data Too Big To Ignore

Transcription:

1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets across large clusters of computers Hadoop is open-source implementation for Google MapReduce Hadoop is based on a simple programming model called MapReduce Hadoop is based on a simple data model, any data will fit Hadoop framework consists on two main layers Distributed file system (HDFS) Execution engine (MapReduce)

3 Hadoop Infrastructure Hadoop is a distributed system like distributed databases However, there are several key differences between the two infrastructures Data model Computing model Cost model Design objectives

4 How Data Model is Different? Distributed Databases Hadoop Deal with tables and relations Must have a schema for data Data fragmentation & partitioning Deal with flat files in any format No schema for data Files are divide automatically into blocks

5 How Computing Model is Different? Distributed Databases Hadoop Notion of a transaction Transaction properties ACID Distributed transaction Notion of a job divided into tasks Map-Reduce computing model Every task is either a map or reduce

6 Hadoop: Big Picture High-level languages Execution engine Distributed light-weight DB Centralized tool for coordination Distributed File system HDFS + MapReduce are enough to have things working

7 What is Next? Hadoop Distributed File System (HDFS) MapReduce Layer Examples Word Count Join Fault Tolerance in Hadoop

HDFS: Hadoop Distributed File System Ø Single namenode and many datanodes Ø Namenode maintains the file system metadata HDFS is a master-slave architecture Master: namenode Slaves: datanodes (100s or 1000s of nodes) Ø Files are split into fixed sized blocks and stored on data nodes (Default 64MB) Ø Data blocks are replicated for fault tolerance and fast access (Default is 3) Ø Datanodes periodically send heartbeats to namenode

9 HDFS: Data Placement and Replication Datanodes can be organized into racks Default placement policy: Where to put a given block? First copy is written to the node creating the file (write affinity) Second copy is written to a data node within the same rack Third copy is written to a data node in a different rack Objectives: load balancing, fast access, fault tolerance

10 What is Next? Hadoop Distributed File System (HDFS) MapReduce Layer Examples Word Count Join Fault Tolerance in Hadoop

11 MapReduce: Hadoop Execution Layer Ø Jobtracker knows everything about submitted jobs Ø Divides jobs into tasks and decides where to run each task Ø Continuously communicating with tasktrackers Ø Tasktrackers execute tasks (multiple per node) Ø Monitors the execution of each task Ø Continuously sending feedback to Jobtracker MapReduce is a master-slave architecture Master: JobTracker Slaves: TaskTrackers (100s or 1000s of tasktrackers) Every datanode is running a tasktracker

12 Hadoop Computing Model Two main phases: Map and Reduce Any job is converted into map and reduce tasks Developers need ONLY to implement the Map and Reduce classes Data Flow Shuffling and Sorting Blocks of the input file in HDFS Output is written to HDFS Map tasks (one for each block) Reduce tasks

13 Hadoop Computing Model (Cont d) Mapper and Reducers consume and produce (Key, Value) pairs Users define the data type of the Key and Value Shuffling & Sorting phase: Map output is shuffled such that all same-key records go to the same reducer Each reducer may receive multiple key sets Each reducer sorts its records to group similar keys, then process each group

14 What is Next? Hadoop Distributed File System (HDFS) MapReduce Layer Examples Word Count Join Fault Tolerance in Hadoop

15 Word Count Job: Count the occurrences of each word in a data set Map Tasks Reduce Tasks Reduce phase is optional: Jobs can be MapOnly

Joining Two Large Datasets Dataset A Dataset B Different join keys Reducer 1 Reducer 2 Reducer N Reducers perform the actual join Shuffling and Sorting Phase Shuffling and sorting over the network Mapper 1 Mapper 2 Mapper 3 Mapper M+N - Each mapper processes one block (split) - Each mapper produces the join key and the record pairs HDFS stores data blocks (Replicas are not shown)

Joining Large Dataset (A) with Small Dataset (B) Dataset A Dataset B Different join keys Every map task processes one block from A and the entire B Every map task performs the join (MapOnly job) Avoid the shuffling and reduce expensive phases Mapper 1 Mapper 2 Mapper N HDFS stores data blocks (Replicas are not shown)

18 What is Next? Hadoop Distributed File System (HDFS) MapReduce Layer Examples Word Count Join Fault Tolerance in Hadoop

19 Hadoop Fault Tolerance Intermediate data between mappers and reducers are materialized to simple & straightforward fault tolerance What if a task fails (map or reduce)? Tasktracker detects the failure Sends message to the jobtracker Jobtracker re-schedules the task What if a datanode fails? Both namenode and jobtracker detect the failure All tasks on the failed node are re-scheduled Namenode replicates the users data to another node Intermediate data (materialized) What if a namenode or jobtracker fails? The entire cluster is down

20 Reading/Writing Files Recall: Any data will fit in Hadoop, so how does Hadoop understand/read the data? User-pluggable class Input Format Input formats know how to parse and read the data (convert byte stream to records) Each record is then passed to the mapper for processing Hadoop provides built-in Input Formats for reading text & sequence files Map code Input Format Input Formats can do a lot of magic to change the job behavior The same for writing à Output Formats

Back to Joining Large & Small Datasets Dataset A Dataset B Different join keys Every map task processes one block from A and the entire B How does a single mapper reads multiple splits (from different datasets)? Customized input formats Mapper 1 Mapper 2 Mapper N HDFS stores data blocks (Replicas are not shown)

22 Using Hadoop Java language High-Level Languages Hive (Facebook) Pig (Yahoo) Jaql (IBM)

Java Code Example 23 Import Hadoop libs Map class Job configuration Reduce class

24 Hive Language High-level language on top of Hadoop Like SQL on top of DBMSs Support structured data, e.g., creating tables, as well as extensibility for un-structured data Create Table user (userid int, age int, gender char) Row Format Delimited Fields; Load Data Local Inpath /user/ local/users.txt into Table user;

From Hive To MapReduce 25

Hive: Group By 26

27 Summary Hadoop is a distributed systems for processing large-scale datasets Scales to thousands of nodes and petabytes of data Two main layers HDFS: Distributed file system(namenode is centralized) MapReduce: Execution engine (JobTracker is centralized) Simple data model, any format will fit At query time, specify how to read (write) the data using input (output) formats Simple computation model based on Map-Reduce phases Very efficient in aggregation and joins Higher-level Languages on top of Hadoop Hive, Jaql, Pig

28 Summary: Hadoop vs. Other Systems Distributed Databases Computing Model - Notion of transactions - Transaction is the unit of work - ACID properties, Concurrency control Data Model - Structured data with known schema - Read/Write mode Hadoop - Notion of jobs - Job is the unit of work - No concurrency control - Any data will fit in any format - (un)(semi)structured - ReadOnly mode Cost Model - Expensive servers - Cheap commodity machines Fault Tolerance - Failures are rare - Recovery mechanisms - Failures are common over thousands of machines - Simple yet efficient fault tolerance Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance Cloud Computing A computing model where any computing infrastructure can run on the cloud Hardware & Software are provided as remote services Elastic: grows and shrinks based on the user s demand Example: Amazon EC2