INFO5011. Cloud Computing Semester 2, 2011 Lecture 6, MapReduce

Similar documents
Getting to know Apache Hadoop

Mrs: MapReduce for Scientific Computing in Python

Big Data 2012 Hadoop Tutorial

Big Data Analytics* Outline. Issues. Big Data

Extreme Computing. Hadoop MapReduce in more detail.

Introduction to Cloud Computing

University of Maryland. Tuesday, February 2, 2010

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Introduction to Parallel Programming and MapReduce

Hadoop WordCount Explained! IT332 Distributed Systems

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Introduction to MapReduce and Hadoop

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Hadoop Design and k-means Clustering

Tutorial- Counting Words in File(s) using MapReduce

Lecture 3 Hadoop Technical Introduction CSE 490H

Introduction to Hadoop

BIG DATA APPLICATIONS

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Data Science in the Wild

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Introduc)on to Map- Reduce. Vincent Leroy

Introduction to Hadoop

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Hadoop MapReduce: Review. Spring 2015, X. Zhang Fordham Univ.

Big Data Management and NoSQL Databases

Jordan Boyd-Graber University of Maryland. Tuesday, February 10, 2011

Map-Reduce and Hadoop

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Xiaoming Gao Hui Li Thilina Gunarathne

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Word Count Code using MR2 Classes and API

The Hadoop Eco System Shanghai Data Science Meetup

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

What s Big Data? Big Data: 3V s. Variety (Complexity) 5/5/2016. Introduction to Big Data, mostly from by Ruoming Jin

MapReduce (in the cloud)

Data-intensive computing systems

Enterprise Data Storage and Analysis on Tim Barr

Hadoop Overview. July Lavanya Ramakrishnan Iwona Sakrejda Shane Canon. Lawrence Berkeley National Lab

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

CS54100: Database Systems

Parallel Processing of cluster by Map Reduce

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

CSE-E5430 Scalable Cloud Computing Lecture 2

Jeffrey D. Ullman slides. MapReduce for data intensive computing

MapReduce, Hadoop and Amazon AWS

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Performance and Energy Efficiency of. Hadoop deployment models

Hadoop Architecture. Part 1


map/reduce connected components

The MapReduce Framework

Big Data and Scripting map/reduce in Hadoop

By Hrudaya nath K Cloud Computing

AVRO - SERIALIZATION

CS 378 Big Data Programming. Lecture 2 Map- Reduce

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

MapReduce Job Processing

Cloudera Certified Developer for Apache Hadoop

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

Outline of Tutorial. Hadoop and Pig Overview Hands-on

MAPREDUCE Programming Model

CS 378 Big Data Programming

Cloud Computing: MapReduce and Hadoop

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Internals of Hadoop Application Framework and Distributed File System

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

hadoop Running hadoop on Grid'5000 Vinicius Cogo Marcelo Pasin Andrea Charão

Hadoop Configuration and First Examples

A very short Intro to Hadoop

Hadoop Streaming. Table of contents

INTRODUCTION TO HADOOP

Map Reduce / Hadoop / HDFS

Hadoop Basics with InfoSphere BigInsights

HowTo Hadoop. Devaraj Das

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

The Hadoop Framework

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data and Apache Hadoop s MapReduce

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

MapReduce and Hadoop Distributed File System

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Sriram Krishnan, Ph.D.

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez ( ) Sameer Kumar ( )

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Big Data Processing with Google s MapReduce. Alexandru Costan

Transcription:

INFO5011 Cloud Computing Semester 2, 2011 Lecture 6, MapReduce COMMONWEALTH OF Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of the university of Sydney pursuant to Part VB of the Copyright Act 1968 (the Act). The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. Do not remove this notice. The presentation is based on: Jeff Dean, Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. In OSDI'04. Original slides can be found from: http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Outline MapReduce Framework introduction Word Count Example in Hadoop - Mapper, Reducer, Job, InputFormat, OutputFormat Sort example in Hadoop - Custom Partitioner Chaining jobs 2

Motivation: Large Scale Data Processing Many tasks: Process lots of data to produce other data Want to use hundreds or thousands of CPUs - but this needs to be easy MapReduce - Automatic parallelization and distributio - Fault-tolerance - I/O scheduling - Status and monitoring 3

Input & Output: each a set of key/value pairs Programmer specifies two functions: - map (in_key, in_value) -> list(out_key, intermediate_value) - Processes input key/value pair - Produces set of intermediate pairs - reduce (out_key, list(intermediate_value)) -> list(out_value) - Combines all intermediate values for a particular key - Produces a set of merged output values (usually just one) Inspired by similar primitives in LISP and other languages Programming model 4

map(string input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); Example: Count word occurrences reduce(string output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); 5

Model is widely applicable Example uses: distributed grep distributed sort web link-graph reversal term-vector per host web access log stats inverted index construction document clustering machine learning statistical machine translation Diagram from the original slides by Jeff Dean and Sanjay Ghemawat 6

Typical cluster: - 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory - Limited bisection bandwidth - Storage is on local IDE disks - GFS: distributed file system manages data Implementation Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines 7

MapReduce execution overview Data Locality Split 0 and 1 locate in the same worker machine, two map tasks are assigned to this worker. Input data is read locally! 1 GFS chunk may equal 1 or more splits Master Operation Master stores the state of each map and reduce tasks It receives intermediate file locations and push them to reduce tasks incrementally Diagram from the CACM version of the original MapReduce paper 8

Parallel Execution GFS Local disk The shuffle process uses RPC read GFS The partition function put all map output keys into R region, in this case R =2 and k2, k4,k5 is partitioned to region 1 while k1 and k3 are partitioned to region 2 The default partition function is hashing e.g. hash( key ) mod R Diagram from the original slides by Jeff Dean and Sanjay Ghemawat 9

Task Granularity And Pipelining Fine granularity tasks: many more map tasks than machines - Minimizes time for fault recovery, many workers (instead of one) take over the tasks on a failed worker - Can pipeline shuffling with map execution - Better dynamic load balancing Often use 200,000 map/5000 reduce tasks w/ 2000 machines Diagram from the original slides by Jeff Dean and Sanjay Ghemawat 10

A trace of a MapReduce job 11

A Real Word Count Example Input Data: user \t photo_id \t tags \t date \t place_id \n user_id: xyz@abc date: 2009-06-28 13:07:05 Place_id: ymji9yyya5u6djf41q tags Output: all tags that occurs more than 500 times and its corresponding frequency 12

Input: Sample input and output and Map/Reduce 2210682890 13772537@N08 2008-01-21 14:52:38 party music night photoshop disco festa OliN.OabAZ5BL3fBcw 2207053218 13772537@N08 2007-02-23 05:17:51 santa beach night la lanzarote playa playja YOTPODybCJ8C3IUN_g 2202530108 13772537@N08 2008-01-18 12:40:53 photoshop portait ubowh1cebpiaydi 2159607235 13772537@N08 2008-01-02 16:46:02 cap festa dany 2007 BXEugr.bAZ5Srgn39A 2159607231 13772537@N08 2008-01-02 16:46:02 cap festa dany 2007 BXEugr.bAZ5Srgn39A 2159607217 13772537@N08 2008-01-02 16:46:02 cap festa dany 2007 BXEugr.bAZ5Srgn39A 1921457510 13772537@N08 2007-11-08 09:58:49 city hdr ubowh1cebpiaydi Output: ianbramham 922 iceland 1818 iledefrance 1187 illustration 608 Map/Reduce jobs: Map: line -> (k:tag, v:1) Reduce: (k:tag,v:(v1,v2,..)) -> (K:tag, v: v sum ) The number is not always 1 as emitted by the map function because we normally apply a combiner after mapper job imcomk 3584 installation 4072 interest 742 Map/Reduce examples: Map: first line -> (party,1),(music,1),(night,1),,(festa,1) Reduce: (interest,(5,10,..)) -> (interest, 742) 13

A master node runs JobTracker (a MapReduce job) Hadoop Framework - In small or medial cluster (< 40 servers), it is OK to put the HDFS Namenode and MapReduce JobTracker in the same physical nodes - In large cluster (multiple racks), it is better to have dedicated Namenode and JobTracker Lots of slave node runs TaskTracker (a Mapper or Reducer task) - TaskTrackers run on Datanodes of HDFS - Locality, moving computation to data 14

The Important Java Classes In Hadoop Mapper task Reducer task 15

Important Java Classes In Hadoop The MapReduce Job 16

The Mapper public static class TagMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable ONE = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException { String[] dataarray = value.tostring().split("\t"); if (dataarray.length < 5){ // a not complete record with all data return; // don't emit anything String tagstring = dataarray[3]; if (tagstring.length() > 0){ String[] tagarray = tagstring.split(" "); for(string tag: tagarray) { word.set(tag); context.write(word, ONE); This example is based on the WordCount example comes with hadoop download 17

The Reducer public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); private final static int MINFREQ = 500; public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); if (sum > MINFREQ){ //only emit when sum is bigger than the threshold result.set(sum); context.write(key, result); 18

The Job setting and submitting public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); if (otherargs.length!= 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); Job job = new Job(conf, "word count"); job.setnumreducetasks(2); job.setjarbyclass(wordcount.class); job.setmapperclass(tagmapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true)? 0 : 1); 19

InputFormat Input data - How input files are split up and read is defined by the InputFormat. FileInputFormat is the abstract class for all file inputs. InputFormat Description Key Value TextInputFormat InputSplit Default format for plain text files; reads lines of textfiles Byte offset of the line KeyValueTextInputFormat Parse lines into key, val pairs Everything up to the first tab character SequenceFileInputFormat A Hadoop-specific highperformance binary format User defined The line content The remainder of the line User defined - An InputSplit describes a unit of work that comprises a single map task in a MapReduce job. - By default, the FileInputFormat and its descendants break a file up into 64 MB chunks (the same size as blocks in HDFS) - This can be modified by setting split size in configuration file or in code at run time. 20

RecordReader Input data - The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. - The RecordReader instance is defined by the InputFormat. The default InputFormat, TextInputFormat, provides a LineRecordReader, which treats each line of the input file as a new value. The key associated with each line is its byte offset in the file. - Developers can write their own RecordReader 21

Output Serialization is the process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage. Deserialization is the process of turning a byte stream back into a series of structured objects. Serialization is important for interprocess communication and for persistent storage in Hadoop - The Reducer use Remote Procedure Calls(RPC) to get intermediate data stored locally in Mapper nodes. - The RPC protocol uses serialization to render the message into a binary stream to be sent to the remote node, which then deserializes the binary stream into the original message. - The Reducer also write final results to HDFS Both interprocess communication and persistent storage requires serialization to be compact, fast, extensible and interoperable Hadoop uses its own serialization format, Writable - Text, IntWritable are all subclass of Writable. 22

Output Context is used to collect and write the ouput into intermediate as well as final files. The method context.write() takes (Key,Value) Partition and Shuffle - This process of moving map outputs to the reducers is known as shuffling. - The Partitioner class determines which partition a given (key, value) pair emitted by mapper will go to. The default one use hashing Sort: Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer. OutputFormat: The (key, value) pairs collected by Context are then written to output files. The way they are written is governed by the OutputFormat. The default TextOutputFormat writes lines in key \t value form 23

How Hadoop runs a MapReduce Job This include the jar file, which has a default replication factor 10 (the hotspot mentioned in GFS paper) Diagram from Tom White, Hadoop, the definitive Guide, O reilly, 2009, page 154 24

The communication between mapper and reducer Diagram from Tom White, Hadoop, the definitive Guide, O reilly, 2009, page 163 25

Sort in Hadoop Hadoop has a standard TeraSort example, the source can be found from hadoop distribution Each reducer sorts its input <key,v> pairs based on the keys - Primitive key type such as string, int has its build comparator - Custom key type should provide a comparator Reducer are identified by id [0,R-1] - Reducer s output are named as part-r-00000, part-r-00001, - The final output is a merge of all individual outputs in the order Default hash partitioner ensure map output keys are distributed to all reducers evenly To implement sort, a customized partitioner is required to make sure all keys in reducer k-1 is always smaller than keys in reducer k. 26

The initial output of word count is sorted on key We want the list to be sorted on frequency Example: sort tags based on frequency automobile 2700 aviation 1957 avion 1518 awesome 626 badajoz 734 bahia 701 barcelona 1305 barros 639 bc 2271 bcn 507 beach 6010 beachwedding 51 pomegranate 51 400m 51 digitalrebel 51 grandwesterncanal 51 sa 51 linna 52 jenbrian 52 3steps 52 gazerock 52 creeks 52 27

A second job is needed Map/Reduce jobs: Map: (tag, freq) -> (k:freq, v:tag) Reduce: (k:freq,v:(t1,t2,..)) -> (K:t1, v: freq), (K:t1, v: freq), apple: 5 pear: 6 banana:71 Mapper 1 5:apple 6: pear 71:banana 5:apple 6: pear 6: red 23: yellow 35:palace Reducer 0 apple: 5 pear: 6 red: 6 yellow: 23 palace: 35 beijing: 60 city: 600 palace: 35 Mapper 2 35:palace 60:beijing 600: city 60:beijing 71: banana Reducer 1 beijing: 60 banana:71 red: 6 green: 1002 yellow: 23 Mapper 3 6: red 23: yellow 1002: green 600:city 1002: green Reducer 2 city: 600 green:1002 28

The Mapper and Reducer public class TagFreqMapper extends Mapper<Object, Text, IntWritable, Text> { private Text word = new Text(); private IntWritable freq = new IntWritable(); //just reverse the key and value public void map(object key, Text value, Context context) throws IOException, InterruptedException{ String[] tagfreqarray = value.tostring().split("\t"); if (tagfreqarray.length == 2){ word.set(tagfreqarray[0]); try{ freq.set(integer.parseint(tagfreqarray[1])); context.write(freq, word); catch(numberformatexception e){ return; //not doing anything public class TagFreqReducer extends Reducer<IntWritable,Text,Text,IntWritable> { public void reduce(intwritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException{ for (Text val : values) { context.write(val, key); 29

A very simple custom partitioner package info5011sort; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.partitioner; public class SortPartitioner extends Partitioner<IntWritable, Text> { public int getpartition(intwritable key, Text value, int arg2) { // a simple and static partitioner to partition the key into 3 regions // [6,50) => 0, [50, 500) => 1, [500,-) => 2 int keyint = key.get(); if (keyint <50) return 0; if (keyint < 500 ) return 1; else return 2; Set the custom partitioner for this job: job.setpartitionerclass(sortpartitioner.class); 30

Problem of the simple partitioner The keys are not evenly distributed! In the TeraSort algorithm, the boundaries of each region is determined dynamically based on data distribution The distribution is estimated by taking a small sample of the data - E.g. for example, the tag frequency distribution is like a zipf shape 31

Frequency The distribution of tag frequencies 9000 8000 7000 Frequency Histogram 120.00% 100.00% 6000 80.00% 5000 4000 60.00% 3000 40.00% 2000 1000 20.00% 0 0.00% tag frequency range Bin Frequency Cumulative % 10 7719 31.87% 20 5153 53.15% 30 2524 63.57% 40 1503 69.78% The bounds 10 and 30 would roughly divide the keys evenly among three reducers 32

Chaining multiple jobs If we want to find the tag frequencies and sort tags based on frequencies, we need to run two jobs It is easy to chain jobs together in a form of - Map1 -> Reduce 1 -> Map2 -> Reduce 2 - The first job in the chain should write its output to a path which is then used as the input path for the second job Job countjob = new Job(conf, "count Tag"); countjob.setnumreducetasks(3); TextInputFormat.addInputPath(countJob, new Path(otherArgs[0])); TextOutputFormat.setOutputPath(countJob, new Path("temp")); countjob.waitforcompletion(true); Job sortjob = new Job(conf,"sort Tag"); TextInputFormat.addInputPath(sortJob, new Path("temp")); TextOutputFormat.setOutputPath(sortJob, new Path(args[1])); System.exit(sortJob.waitForCompletion(true)? 0 : 1); 33

input - Photo: user \t photo_id \t tags \t date \t place_id \n Output: - Place_id \t freq Your turn 34