Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış



Similar documents
Hadoop WordCount Explained! IT332 Distributed Systems

Word count example Abdalrahman Alsaedi

Introduction to MapReduce and Hadoop

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

How To Write A Mapreduce Program In Java.Io (Orchestra)

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

map/reduce connected components

Hadoop: Understanding the Big Data Processing Method

CS54100: Database Systems

Data Science in the Wild

Introduction To Hadoop

How MapReduce Works 資碩一 戴睿宸

MR-(Mapreduce Programming Language)

By Hrudaya nath K Cloud Computing

HPCHadoop: MapReduce on Cray X-series

Word Count Code using MR2 Classes and API

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Big Data Management and NoSQL Databases

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Hadoop Configuration and First Examples

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Programming in Hadoop Programming, Tuning & Debugging

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Download and install Download virtual machine Import virtual machine in Virtualbox

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Hadoop for HPC. Instructor: Ekpe Okorafor

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Implementations of iterative algorithms in Hadoop and Spark

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Data Science Analytics & Research Centre

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Building a distributed search system with Apache Hadoop and Lucene. Mirko Calvaresi

MapReduce Tutorial. Table of contents

Extreme Computing. Hadoop MapReduce in more detail.

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

HDFS: Hadoop Distributed File System

Getting to know Apache Hadoop

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

6. How MapReduce Works. Jari-Pekka Voutilainen

Introduction to MapReduce

Internals of Hadoop Application Framework and Distributed File System

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Introduction to MapReduce and Hadoop

Data-intensive computing systems

Hadoop, Hive & Spark Tutorial

BIG DATA ANALYTICS HADOOP PERFORMANCE ANALYSIS. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment

Introduction to Cloud Computing

The Hadoop Eco System Shanghai Data Science Meetup

Lecture 3 Hadoop Technical Introduction CSE 490H

Petabyte-scale Data with Apache HDFS. Matt Foley Hortonworks, Inc.

University of Maryland. Tuesday, February 2, 2010

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Data management in the cloud using Hadoop

MapReduce Programming with Apache Hadoop Viraj Bhat

Xiaoming Gao Hui Li Thilina Gunarathne

Hadoop Design and k-means Clustering

HowTo Hadoop. Devaraj Das

Connecting Hadoop with Oracle Database

Introduc8on to Apache Spark

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

INTRODUCTION TO HADOOP

Lecture #1. An overview of Big Data

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduc)on to Map- Reduce. Vincent Leroy

HADOOP PERFORMANCE TUNING

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

A very short Intro to Hadoop

Hadoop Architecture. Part 1

Apache Hadoop. Alexandru Costan

Data Deluge. Billions of users connected through the Internet. Storage getting cheaper

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Big Data 2012 Hadoop Tutorial

Big Data Analytics* Outline. Issues. Big Data

Introduction to Big Data Science. Wuhui Chen

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Tes$ng Hadoop Applica$ons. Tom Wheeler

Map Reduce a Programming Model for Cloud Computing Based On Hadoop Ecosystem

Transcription:

Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış

Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details Reduce Side Details Future Concepts Demo Q&A

Map Reduce Concepts Basic Idea In the Schema: Input data is splitted into partitions and processed in parallel, then output from mapper tasks are collected in reduce tasks, final computations done in reducer part and output is prepared

Map Reduce Concepts JobClient: Client agent that resides in hadoop-client.jar, starts communication with JobTracker and submits job JobTracker: A JobTracker is the service within Hadoop that assigns MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack. Also keeps track of tasks and input data. TaskTracker: A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.

Map Reduce Concepts http://aws.typepad.com/files/mapreduce.gif

Map Reduce Internal Mapper Class public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable( ); private Text word = new Text(); public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, one); } } }

Map Reduce Internal Reducer Class public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Map Reduce Internal Main Class public static void main(string[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(map.class); conf.setcombinerclass(reduce.class); conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); }

Job Execution Workflow Image Source: Hadoop: The Definitive Guide Book, P: 168

Job Execution Workflow 1. JobClient.runJob(conf); JobClient submits job and MapReduce Program starts. 2. JobClient asks JobTracker for a new job id. 3. JobClient checks whether output directory exist. (new Path(args[1]) in our case) 4. JobClient computes input splits. (Input directory check also happens here, new Path(args[0]))

Job Execution Workflow 5. Copies resources needed to run the job, including job JAR file, conf file, input splits to the JobTrackers File System in a directory named with the job ID. The job jar is also copied with a high replication factor (mapred.submit.replication) across cluster. (Good Question: To Which worker machines?) 6. Tells JobTracker job is ready to run. 7. JobTracker puts job to an internal queue where JobScheduler picks the job and initializes it.

Job Execution Workflow 8. JobScheduler retrieves the input splits from shared file system (HDFS). 9. JobScheduler creates one map task for each split (Here split means block data). 10. JobScheduler sets number of reduce tasks from (mapred.reduce.tasks) property and creates this number of reduce tasks. (If this property is not set). 11. JobScheduler gives IDs to all tasks at this point.

Job Execution Workflow 12. Task assignments starts. How to know which TaskTrackers are ready to run tasks? TaskTrackers send heartbeat to JobTracker periodically. As a part of heartbeat TaskTracker indicates whether it is ready to run. 13. Assignments are done with priority of map tasks. 14. JobTracker assigns map tasks to TaskTrackers that are more close to related data. There are three options: data-local, rack-local and remote. Built-in Containers holds this statistics

Job Execution Workflow 15. Task execution starts in each TaskTracker. 16. TaskTracker copies JAR to its local file system. 17. TaskTracker creates a folder for the task. 18. TaskTracker creates instance of TaskRunner. 19. TaskRunner launches new JVM for each task

Job Execution Workflow 20. TaskTracker updates JobTracker about progress of tasks 21. Job succeeds if all tasks in each TaskTracker finishes successfully. 22. JobTracker send notification to JobClient about job status via http etc..

Map Side Details Each map task has memory buffer that it writes output to. The buffer is 100MB (io.sort.mb) io.sort.spill.percent default %80 Thread will start to spill the content to the disk to the directory specified by mapred.local.dir Each time the memory buffer reaches the spill threshold, a new spill file is created

Reduce Side Details Map output files are sitting in local disk of TaskTracker. The map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes. As map tasks complete, they notify their TaskTracker of status update. TaskTracker notifies JobTracker by heartbeat. Therefore, JobTracker knows the mapping between map outputs and TaskTrackers. In reduce phase, the reduce function runs on each key and saves output to the HDFS (generally).

Future Concepts Resource Management (YARN) Security

Demo