YARN and how MapReduce works in Hadoop By Alex Holmes



Similar documents
6. How MapReduce Works. Jari-Pekka Voutilainen

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Map Reduce & Hadoop Recommended Text:

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Apache Hadoop YARN: The Nextgeneration Distributed Operating. System. Zhijie Shen & Jian Hortonworks

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

The Improved Job Scheduling Algorithm of Hadoop Platform

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014

Big Data Technology Core Hadoop: HDFS-YARN Internals

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

Qsoft Inc

Test-King.CCA Q.A. Cloudera CCA-500 Cloudera Certified Administrator for Apache Hadoop (CCAH)

CS455 - Lab 10. Thilina Buddhika. April 6, 2015

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Chapter 7. Using Hadoop Cluster and MapReduce

Getting to know Apache Hadoop

Hadoop Architecture. Part 1

Hadoop Applications on High Performance Computing. Devaraj Kavali

Certified Big Data and Apache Hadoop Developer VS-1221

Hadoop MultiNode Cluster Setup

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

YARN Apache Hadoop Next Generation Compute Platform

Hadoop Parallel Data Processing

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Hadoop 2.6 Configuration and More Examples

COURSE CONTENT Big Data and Hadoop Training

MapReduce on YARN Job Execution

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Understanding Hadoop Performance on Lustre

Survey on Scheduling Algorithm in MapReduce Framework

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış


MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy

Data-intensive computing systems

Hadoop Setup Walkthrough

A very short Intro to Hadoop

Apache Hadoop. Alexandru Costan

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Complete Java Classes Hadoop Syllabus Contact No:

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

How to install Apache Hadoop in Ubuntu (Multi node setup)

BIG DATA APPLICATIONS

Hadoop Data Locality Change for Virtualization Environment

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Architecture of Next Generation Apache Hadoop MapReduce Framework

Developing a MapReduce Application

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

CDH 5 Quick Start Guide

MPJ Express Meets YARN: Towards Java HPC on Hadoop Systems

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Hadoop MapReduce: Review. Spring 2015, X. Zhang Fordham Univ.

CURSO: ADMINISTRADOR PARA APACHE HADOOP

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Apache Hadoop new way for the company to store and analyze big data

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Apache Hama Design Document v0.6

H2O on Hadoop. September 30,

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Big Data and Scripting map/reduce in Hadoop

Introduction to MapReduce and Hadoop

Hadoop and Map-Reduce. Swati Gore

Pre-Stack Kirchhoff Time Migration on Hadoop and Spark

Fault Tolerance in Hadoop for Work Migration

HADOOP PERFORMANCE TUNING

Mobile Cloud Computing for Data-Intensive Applications

HADOOP MOCK TEST HADOOP MOCK TEST II

How MapReduce Works 資碩一 戴睿宸

How To Use Hadoop

A Cost-Evaluation of MapReduce Applications in the Cloud

Recommended Literature for this Lecture

HADOOP. Revised 10/19/2015

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Apache Hadoop: Past, Present, and Future

A Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group

BIG DATA HADOOP TRAINING

Computing at Scale: Resource Scheduling Architectural Evolution and Introduction to Fuxi System

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Transcription:

YARN and how MapReduce works in Hadoop By Alex Holmes YARN was created so that Hadoop clusters could run any type of work. This meant MapReduce had to become a YARN application and required the Hadoop developers to rewrite key parts of MapReduce. This article will demystify how MapReduce works in Hadoop 2. Given that MapReduce had to go through some open-heart surgery to get it working as a YARN application, the goal of this article is to demystify how MapReduce works in Hadoop 2. Dissecting a YARN MapReduce application Architectural changes had to be made to MapReduce to port it to YARN. Figure 1 shows the processes involved in MRv2 and some of the interactions between them. Each MapReduce job is executed as a separate YARN application. When you launch a new MapReduce job, the client calculates the input splits and writes them along with other job resources into HDFS (step 1). The client then communicates with the ResourceManager to create the ApplicationMaster for the MapReduce job (step 2). The ApplicationMaster is actually a container, so the ResourceManager will allocate the container when resources become available on the cluster and then communicate with a NodeManager to create the ApplicationMaster container (steps 3 4). 1 1 If there aren t any available resources for creating the container, the ResourceManager may choose to kill one or more existing containers to free up space. For source code, sample chapters, the Online Author Forum, and other resources, go to http://manning.com/holmes2/.

2 CHAPTER 2 Introduction to YARN The MapReduce ApplicationMaster (MRAM) is responsible for creating map and reduce containers and monitoring their status. The MRAM pulls the input splits from HDFS (step 5) so that when it communicates with the ResourceManager (step 6) it can request that map containers are launched on nodes local to their input data. Container allocation requests to the ResourceManager are piggybacked on regular heartbeat messages that flow between the ApplicationMaster and the ResourceManager. The heartbeat responses may contain details on containers that are allocated for the application. Data locality is maintained as an important part of the architecture when it requests map containers, the MapReduce ApplicationManager will use the input splits location details to request that the containers are assigned to one of the nodes that contains the input splits, and the ResourceManager will make a best attempt at container allocation on these input split nodes.

2 Client host 3 Hadoop master JobHistoryServer Hadoop slave NodeManager 4 Client ResourceManager MR AppMaster 6 7 5 Hadoop slave HDFS NodeManager 1 8 YarnChild Hadoop slave ShuffleHandler ShuffleHandler Figure 1 The interactions of a MapReduce 2 YARN application Once the MapReduce ApplicationManager is allocated a container, it talks to the NodeManager to launch the map or reduce task (steps 7 8). At this point, the map/ reduce process acts very similarly to the way it worked in MRv1. THE SHUFFLE The shuffle phase in MapReduce, which is responsible for sorting mapper outputs and distributing them to the reducers, didn t fundamentally change in MapReduce 2. The main difference is that the map outputs are fetched via ShuffleHandlers, which are auxiliary YARN services that run on each slave For source code, sample chapters, the Online Author Forum, and other resources, go to http://manning.com/holmes2/.

node. 2 Some minor memory management tweaks were made to the shuffle implementation; for example, io.sort.record.percent is no longer used. WHERE S THE JOBTRACKER? You ll note that the JobTracker no longer exists in this architecture. The scheduling part of the JobTracker was moved as a general-purpose resource scheduler into the YARN ResourceManager. The remaining part of JobTracker, which is primarily the metadata about running and completed jobs, was split in two. Each MapReduce ApplicationMaster hosts a UI that renders details on the current job, and once jobs are completed, their details are pushed to the JobHistoryServer, which aggregates and renders details on all completed jobs. Uber jobs When running small MapReduce jobs, the time taken for resource scheduling and process forking is often a large percentage of the overall runtime. In MapReduce 1 you didn t have any choice about this overhead, but MapReduce 2 has become smarter and can now cater to your needs to run lightweight jobs as quickly as possible. TECHNIQUE 7 Running small MapReduce jobs This technique looks at how you can run MapReduce jobs within the MapReduce ApplicationMaster. This is useful when you re working with a small amount of data, as you remove the additional time that MapReduce normally spends spinning up and bringing down map and reduce processes. TECHNIQUE 7 Running small MapReduce jobs 2 The ShuffleHandler must be configured in your yarn-site.xml; the property name is yarn.nodemanager.auxservices and the value is mapreduce_shuffle.

The JobHistory UI, showing MapReduce applications that have completed Problem You have a MapReduce job that operates on a small dataset, and you want to avoid the overhead of scheduling and creating map and reduce processes. Solution Configure your job to enable uber jobs; this will run the mappers and reducers in the same process as the ApplicationMaster. Discussion Uber jobs are jobs that are executed within the MapReduce ApplicationMaster. Rather than liaise with the ResourceManager to create the map and reduce containers, the ApplicationMaster runs the map and reduce tasks within its own process and avoids the overhead of launching and communicating with remote containers. To enable uber jobs, you need to set the following property: mapreduce.job.ubertask.enable=true Table 1 lists some additional properties that control whether a job qualities for uberization. For source code, sample chapters, the Online Author Forum, and other resources, go to http://manning.com/holmes2/.

Table 1 Properties for customizing uber jobs Property Default value Description mapreduce.job.ubertask.maxmaps mapreduce.job.ubertask.maxreduces mapreduce.job.ubertask.maxbytes 9 1 Default block size The number of mappers for a job must be less than or equal to this value for the job to be uberized. The number of reducers for a job must be less than or equal to this value for the job to be uberized. The total input size of a job must be less than or equal to this value for the job to be uberized. When running uber jobs, MapReduce disables speculative execution and also sets the maximum attempts for tasks to 1. Reducer restrictions Currently only map-only jobs and jobs with one reducer are supported for uberization. Uber jobs are a handy new addition to the MapReduce capabilities, and they only work on YARN. This concludes our look at MapReduce on YARN. To read more about YARN, MapReduce, and Hadoop in action, check out Alex Holmers book Hadoop in Practice, 2 nd edition.