6. How MapReduce Works. Jari-Pekka Voutilainen

Similar documents
How MapReduce Works 資碩一 戴睿宸

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

YARN and how MapReduce works in Hadoop By Alex Holmes

Data-intensive computing systems

Large scale processing using Hadoop. Ján Vaňo

MapReduce on YARN Job Execution

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Extending Hadoop beyond MapReduce

Hadoop implementation of MapReduce computational model. Ján Vaňo

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Architecture. Part 1

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Fair Scheduler. Table of contents

Apache Hama Design Document v0.6

Apache Hadoop YARN: The Nextgeneration Distributed Operating. System. Zhijie Shen & Jian Hortonworks

Architecture of Next Generation Apache Hadoop MapReduce Framework

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

The Improved Job Scheduling Algorithm of Hadoop Platform

CURSO: ADMINISTRADOR PARA APACHE HADOOP

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Hadoop Scheduler w i t h Deadline Constraint

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Survey on Scheduling Algorithm in MapReduce Framework

Cloudera Manager Health Checks

Cloudera Manager Health Checks

Hadoop Fair Scheduler Design Document

Hadoop 2.6 Configuration and More Examples

Big Data Technology Core Hadoop: HDFS-YARN Internals

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

COURSE CONTENT Big Data and Hadoop Training

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop Parallel Data Processing

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Capacity Scheduler Guide

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Map Reduce & Hadoop Recommended Text:

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Apache Hadoop new way for the company to store and analyze big data

Big Data With Hadoop

HADOOP MOCK TEST HADOOP MOCK TEST

Fault Tolerance in Hadoop for Work Migration

Analysis of Information Management and Scheduling Technology in Hadoop

GraySort and MinuteSort at Yahoo on Hadoop 0.23

MapReduce. Tushar B. Kute,

A. Aiken & K. Olukotun PA3

BIG DATA PROCESSING WITH HADOOP

A Cost-Evaluation of MapReduce Applications in the Cloud

YARN Apache Hadoop Next Generation Compute Platform

Big Data Management and NoSQL Databases

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

A very short Intro to Hadoop

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

Evaluation of Security in Hadoop

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Apache Hadoop. Alexandru Costan

H2O on Hadoop. September 30,

HADOOP PERFORMANCE TUNING

Lecture 3 Hadoop Technical Introduction CSE 490H

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Understanding Hadoop Performance on Lustre

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Computing at Scale: Resource Scheduling Architectural Evolution and Introduction to Fuxi System

Introduction to Apache YARN Schedulers & Queues

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop and Map-Reduce. Swati Gore

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Survey on Job Schedulers in Hadoop Cluster

Sujee Maniyam, ElephantScale

MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration

USING HADOOP TO ACCELERATE THE ANALYSIS OF SEMICONDUCTOR-MANUFACTURING MONITORING DATA

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

Transcription:

6. How MapReduce Works Jari-Pekka Voutilainen

MapReduce Implementations Apache Hadoop has 2 implementations of MapReduce: Classic MapReduce (MapReduce 1) YARN (MapReduce 2)

Classic MapReduce The Client JobTracker TaskTrackers Distributed FileSystem (usually HDFS)

Job Submission Ask JobID from JobTracker Validate output specification of the job Compute input splits for the job Copy resources for the job (job JAR file, configuration and input splits) Tell JobTracker that the job is ready to run

Job Initialization Job object encapsulates tasks and bookkeeping of progress List of tasks consists of one map task for every input split, reduce tasks set by the job, job setup and job cleanup tasks

Task assignment TaskTrackers send heartbeat messages to JobTracker: TaskTracker is still alive and messages contain payload information for task assignment. JobTracker chooses job according to scheduling algorithm.

Task Assignment TaskTrackers have fixed number of map and reduce slots If there is a free map slot, map task is chosen. Otherwise reduce task is chosen.

Task execution TaskRunner launches new JVM for each task. (It is possible to reuse JVMs for later tasks) Task progress is reported every few seconds.

Job Completion When the JobTracker receives notification from TaskTracker that the last task of the job is complete, status of job is changed to successful. Client polls the job and eventually notices that the job is finished.

YARN Classic MapReduce has scalability issues around 4000 nodes and higher. YARN splits responsibilities of the jobtracker into separate entities Jobtracker takes care of job scheduling and task progress monitoring.

YARN ResourceManager manages resources across the cluster Application Master manages lifecycle of application. Each MR job has a dedicated Application Master, which runs for duration of application

YARN YARN is more general than Classic MapReduce. Classic MapReduce is just one type of YARN application. Same cluster can run different YARN applications.

YARN entities The Client Resource manager Node manager, which launch and monitor containers Application master, which coordinates tasks Distributed filesystem

Job Submission Similar to Classic MapReduce JobID is retrieved from ResourceManager Job is submitted to Resourcemanager

Job Initialization ResourceManager allocates container and launches application master inside the container Application master initializes job as in Classic MapReduce

Job Initialization AppMaster decides how to run the tasks of the job. If the job is small, AppMaster may choose to run tasks in the same JVM as itself. Larger tasks are executed in their own containers and JVMs. The choice is done by judging the overhead of JVM creations.

Task Execution Once a task is assigned to a container by the resource managers scheduler, appmaster starts the container. Progress is reported to AppMaster which is polled by the client.

Failures Failures in Classic MapReduce: failure of task failure of TaskTracker failure of JobTracker

Task Failure Exception in map or reduce tasks, TaskTracker marks the task as failed. JVM suddenly exists, TaskTracker marks the task as failed. Hanging tasks stop sending progress updates, TaskTracker kill the JVM and task is marked as failed.

Task Failure JobTracker reschedules the failed task to different TaskTracker if possible If the task has failed 4 or more times, it will not retried again. If the task fails 4 times, the whole job fails.

TaskTracker Failure If the TaskTracker crashes or runs very slowly, the JobTracker notices this from missing heartbeats. Successful map tasks are rescheduled to different TaskTracker if they belong to incomplete job. All tasks in progress are also rescheduled.

TaskTracker Failure TaskTracker may be blacklisted by JobTracker If 4 or more tasks from the same job has failed on a particular TaskTracker, JobTracker records this as fault. When minimum threshold of faults is exceeded, TaskTracker is blacklisted. Faults expire over time (one per day), TaskTrackers get a chance to run jobs again.

JobTracker Failure Single Point of Failure in Classic MapReduce All running jobs fail After restart, All jobs must be resubmitted.

Failures in YARN Task Failure, same as in Classic MapReduce AppMaster Failure ResourceManager Failure

Application Master Failure Resource Manager notices failed AppMaster Resource Manager starts a new instance of AppMaster in new container Client experiences a timeout and get a new address of AppMaster from ResourceManager

Resource Manager Failure Resource Managers have checkpointing mechanism which saves its state to persistent storage. After crash, administrator brings new Resource Manager up and it recovers saved state.

Speculative Execution If Hadoop detects that some task is slower than normal, another equivalent backup task is launched. Which ever completes first, the second one is killed immediately. Optimization, not a feature to make jobs run more reliably.