How MapReduce Works 資碩一 戴睿宸



Similar documents
Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

6. How MapReduce Works. Jari-Pekka Voutilainen

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Data-intensive computing systems

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

HADOOP PERFORMANCE TUNING

Hadoop Fair Scheduler Design Document

Lecture 3 Hadoop Technical Introduction CSE 490H

MapReduce on YARN Job Execution

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Fair Scheduler. Table of contents

Hadoop Architecture. Part 1

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Apache Hama Design Document v0.6

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

THE ANATOMY OF MAPREDUCE JOBS, SCHEDULING, AND PERFORMANCE CHALLENGES

Apache Hadoop. Alexandru Costan

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

A very short Intro to Hadoop

Capacity Scheduler Guide

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

University of Maryland. Tuesday, February 2, 2010

Big Data Management and NoSQL Databases

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Tutorial: MapReduce. Theory and Practice of Data-intensive Applications. Pietro Michiardi. Eurecom

Map Reduce & Hadoop Recommended Text:

Hadoop Scheduler w i t h Deadline Constraint

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Extreme Computing. Hadoop MapReduce in more detail.

Research on Job Scheduling Algorithm in Hadoop

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Survey on Scheduling Algorithm in MapReduce Framework

The Improved Job Scheduling Algorithm of Hadoop Platform

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy

Large scale processing using Hadoop. Ján Vaňo

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework

Parameterizable benchmarking framework for designing a MapReduce performance model

H2O on Hadoop. September 30,

USING HADOOP TO ACCELERATE THE ANALYSIS OF SEMICONDUCTOR-MANUFACTURING MONITORING DATA

7 Deadly Hadoop Misconfigurations. Kathleen Ting February 2013

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Hadoop Parallel Data Processing

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Analysis of Information Management and Scheduling Technology in Hadoop

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Record Setting Hadoop in the Cloud By M.C. Srivas, CTO, MapR Technologies

How To Use Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop implementation of MapReduce computational model. Ján Vaňo

Distributed Filesystems

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Architecture of Next Generation Apache Hadoop MapReduce Framework

MapReduce, Hadoop and Amazon AWS

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

CapacityScheduler Guide

An improved task assignment scheme for Hadoop running in the clouds

Big Data With Hadoop

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Hands-on Exercises with Big Data

Introduction to MapReduce and Hadoop

GraySort on Apache Spark by Databricks

Hadoop Optimizations for BigData Analytics

HADOOP MOCK TEST HADOOP MOCK TEST II

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

MapReduce. Tushar B. Kute,

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

COURSE CONTENT Big Data and Hadoop Training

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Introduction to Apache YARN Schedulers & Queues

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Fault Tolerance in Hadoop for Work Migration

Internals of Hadoop Application Framework and Distributed File System

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Survey on Job Schedulers in Hadoop Cluster

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Transcription:

How MapReduce Works

MapReduce Entities four independent entities: The client The jobtracker The tasktrackers The distributed filesystem

Steps 1. Asks the jobtracker for a new job ID 2. Checks the output specification of the job 3. Computes the input splits for the job 4. Copies the resources needed to run the job 5. Tells the jobtracker that the job is ready for execution

Job Submission polls the job s progress once a second reports the progress to the console if it has changed since the last report When the job is complete, if it was successful, the job counters are displayed.

Job Initialization job scheduler first retrieves the input splits computed by the JobClient from the shared filesystem It then creates one map task for each split. The number of reduce tasks is determined by the property Tasks are given IDs at this point

Task Assignment Tasktrackers run a simple loop that periodically sends heartbeat method calls to the jobtracker. As a part of the heartbeat, a tasktracker will indicate whether it is ready to run a new task, and if it is, the jobtracker will allocate it a task

Task Assignment In the optimal case, the task is data-local Alternatively, the task may be rack-local tasks are neither data-local nor rack-local and retrieve their data from a different rack from the one they are running on

Task Execution First, it localizes the job JAR by copying it from the shared filesystem Second, it creates a local working directory Third, it creates an instance of TaskRunner to run the task.rectory for the task TaskRunner launches a new Java Virtual Machine to run each task in

Task Execution progress When a task is running, it keeps track of its progress, that is, the proportion of the task completed. For map tasks, this is the proportion of the input that has been processed. For reduce tasks, it s the estimated proportion

Task Execution If a task reports progress, it sets a flag to indicate that the status change should be sent to the tasktracker. Meanwhile, the tasktracker is sending heartbeats to the jobtracker every five seconds

Job Completion jobtracker receives a notification that the last task for a job is complete, it changes the status for the job to successful. JobClient polls for status, it learns that the job has completed successfully, so it prints a message to tell the user, and then returns from the runjob() method.

Task Failure Wihle the child JVM reports the error back to its parent tasktracker, The tasktracker marks the task attempt as failed, freeing up a slot to run another task. The tasktracker notices that it hasn t received a progress update for a while, and proceeds to mark the task as failed. (normally 10 min)

Task Failure When the jobtracker is notified of a task attempt that has failed, it will reschedule execution of the task The jobtracker will try to avoid rescheduling the task on a tasktracker where it has previously failed By default, if any task fails more than four times, the whole job fails (editable)

Tasktracker Failure If a tasktracker fails by crashing, or running very slowly, it will stop sending heartbeats to the jobtracker The jobtracker arranges for map tasks that were run and completed successfully on that tasktracker to be rerun if they belong to incomplete jobs and reschedule.

Jobtracker Failure Hadoop has no mechanism for dealing with failure of the jobtracker It is possible that a future releaseof Hadoop will remove this limitation by running multiple jobtrackers, only one of which is the primary jobtracker at any time

Job Scheduling Basically FIFO setting a job s priority was enabled, which take one of the values VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW priorities do not support preemption

The Fair Scheduler For Multiuser The Fair Scheduler aims to give every user a fair share of the cluster capacity over time. If a single job is running, it gets all of the cluster It is also possible to define custom pools with guaranteed minimum capacities and to set weighting for each pool The Fair Scheduler supports preemption

Shuffle and Sort MapReduce makes the guarantee that the input to every reducer is sorted by key. Theprocess by which the system performs the sort and transfers the map outputs to thereducers as inputs is known as the shuffle.

The Map Side

The Map Side Spills are written in round-robin fashion to the directories specified by the mapred.local.dir property, in a job-specific subdirectory Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. The background thread performs an in-memory sort by key

The Map Side The spill files are merged into a single partitioned and sorted output file It is often a good idea to compress the map output as it is written to disk

The Reduce Side

The Reduce Side The map outputs are copied to the reduce tasktracker s memory if they are small enough ;otherwise, they are copied to disk. When all the map outputs have been copied, the reduce task moves into the sort phase (merge phase) The merge saves a trip to disk by directly feeding the reduce function in what is the last phase: the reduce phase. This final merge can come from a mixture of in-memory and on-disk segments.