INFO5011. Cloud Computing Semester 2, 2011 Lecture 6, MapReduce

INFO5011 Cloud Computing Semester 2, 2011 Lecture 6, MapReduce COMMONWEALTH OF Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of the university of Sydney pursuant to Part VB of the Copyright Act 1968 (the Act). The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. Do not remove this notice. The presentation is based on: Jeff Dean, Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. In OSDI'04. Original slides can be found from: http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Outline MapReduce Framework introduction Word Count Example in Hadoop - Mapper, Reducer, Job, InputFormat, OutputFormat Sort example in Hadoop - Custom Partitioner Chaining jobs 2

Motivation: Large Scale Data Processing Many tasks: Process lots of data to produce other data Want to use hundreds or thousands of CPUs - but this needs to be easy MapReduce - Automatic parallelization and distributio - Fault-tolerance - I/O scheduling - Status and monitoring 3

Input & Output: each a set of key/value pairs Programmer specifies two functions: - map (in_key, in_value) -> list(out_key, intermediate_value) - Processes input key/value pair - Produces set of intermediate pairs - reduce (out_key, list(intermediate_value)) -> list(out_value) - Combines all intermediate values for a particular key - Produces a set of merged output values (usually just one) Inspired by similar primitives in LISP and other languages Programming model 4

map(string input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); Example: Count word occurrences reduce(string output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); 5

Model is widely applicable Example uses: distributed grep distributed sort web link-graph reversal term-vector per host web access log stats inverted index construction document clustering machine learning statistical machine translation Diagram from the original slides by Jeff Dean and Sanjay Ghemawat 6

Typical cluster: - 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory - Limited bisection bandwidth - Storage is on local IDE disks - GFS: distributed file system manages data Implementation Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines 7

MapReduce execution overview Data Locality Split 0 and 1 locate in the same worker machine, two map tasks are assigned to this worker. Input data is read locally! 1 GFS chunk may equal 1 or more splits Master Operation Master stores the state of each map and reduce tasks It receives intermediate file locations and push them to reduce tasks incrementally Diagram from the CACM version of the original MapReduce paper 8

Parallel Execution GFS Local disk The shuffle process uses RPC read GFS The partition function put all map output keys into R region, in this case R =2 and k2, k4,k5 is partitioned to region 1 while k1 and k3 are partitioned to region 2 The default partition function is hashing e.g. hash( key ) mod R Diagram from the original slides by Jeff Dean and Sanjay Ghemawat 9

Task Granularity And Pipelining Fine granularity tasks: many more map tasks than machines - Minimizes time for fault recovery, many workers (instead of one) take over the tasks on a failed worker - Can pipeline shuffling with map execution - Better dynamic load balancing Often use 200,000 map/5000 reduce tasks w/ 2000 machines Diagram from the original slides by Jeff Dean and Sanjay Ghemawat 10

A trace of a MapReduce job 11

A Real Word Count Example Input Data: user \t photo_id \t tags \t date \t place_id \n user_id: xyz@abc date: 2009-06-28 13:07:05 Place_id: ymji9yyya5u6djf41q tags Output: all tags that occurs more than 500 times and its corresponding frequency 12

Input: Sample input and output and Map/Reduce 2210682890 13772537@N08 2008-01-21 14:52:38 party music night photoshop disco festa OliN.OabAZ5BL3fBcw 2207053218 13772537@N08 2007-02-23 05:17:51 santa beach night la lanzarote playa playja YOTPODybCJ8C3IUN_g 2202530108 13772537@N08 2008-01-18 12:40:53 photoshop portait ubowh1cebpiaydi 2159607235 13772537@N08 2008-01-02 16:46:02 cap festa dany 2007 BXEugr.bAZ5Srgn39A 2159607231 13772537@N08 2008-01-02 16:46:02 cap festa dany 2007 BXEugr.bAZ5Srgn39A 2159607217 13772537@N08 2008-01-02 16:46:02 cap festa dany 2007 BXEugr.bAZ5Srgn39A 1921457510 13772537@N08 2007-11-08 09:58:49 city hdr ubowh1cebpiaydi Output: ianbramham 922 iceland 1818 iledefrance 1187 illustration 608 Map/Reduce jobs: Map: line -> (k:tag, v:1) Reduce: (k:tag,v:(v1,v2,..)) -> (K:tag, v: v sum ) The number is not always 1 as emitted by the map function because we normally apply a combiner after mapper job imcomk 3584 installation 4072 interest 742 Map/Reduce examples: Map: first line -> (party,1),(music,1),(night,1),,(festa,1) Reduce: (interest,(5,10,..)) -> (interest, 742) 13

A master node runs JobTracker (a MapReduce job) Hadoop Framework - In small or medial cluster (< 40 servers), it is OK to put the HDFS Namenode and MapReduce JobTracker in the same physical nodes - In large cluster (multiple racks), it is better to have dedicated Namenode and JobTracker Lots of slave node runs TaskTracker (a Mapper or Reducer task) - TaskTrackers run on Datanodes of HDFS - Locality, moving computation to data 14

The Important Java Classes In Hadoop Mapper task Reducer task 15

Important Java Classes In Hadoop The MapReduce Job 16

The Mapper public static class TagMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable ONE = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException { String[] dataarray = value.tostring().split("\t"); if (dataarray.length < 5){ // a not complete record with all data return; // don't emit anything String tagstring = dataarray[3]; if (tagstring.length() > 0){ String[] tagarray = tagstring.split(" "); for(string tag: tagarray) { word.set(tag); context.write(word, ONE); This example is based on the WordCount example comes with hadoop download 17

The Reducer public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); private final static int MINFREQ = 500; public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); if (sum > MINFREQ){ //only emit when sum is bigger than the threshold result.set(sum); context.write(key, result); 18

The Job setting and submitting public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); if (otherargs.length!= 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); Job job = new Job(conf, "word count"); job.setnumreducetasks(2); job.setjarbyclass(wordcount.class); job.setmapperclass(tagmapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true)? 0 : 1); 19

InputFormat Input data - How input files are split up and read is defined by the InputFormat. FileInputFormat is the abstract class for all file inputs. InputFormat Description Key Value TextInputFormat InputSplit Default format for plain text files; reads lines of textfiles Byte offset of the line KeyValueTextInputFormat Parse lines into key, val pairs Everything up to the first tab character SequenceFileInputFormat A Hadoop-specific highperformance binary format User defined The line content The remainder of the line User defined - An InputSplit describes a unit of work that comprises a single map task in a MapReduce job. - By default, the FileInputFormat and its descendants break a file up into 64 MB chunks (the same size as blocks in HDFS) - This can be modified by setting split size in configuration file or in code at run time. 20

RecordReader Input data - The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. - The RecordReader instance is defined by the InputFormat. The default InputFormat, TextInputFormat, provides a LineRecordReader, which treats each line of the input file as a new value. The key associated with each line is its byte offset in the file. - Developers can write their own RecordReader 21

Output Serialization is the process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage. Deserialization is the process of turning a byte stream back into a series of structured objects. Serialization is important for interprocess communication and for persistent storage in Hadoop - The Reducer use Remote Procedure Calls(RPC) to get intermediate data stored locally in Mapper nodes. - The RPC protocol uses serialization to render the message into a binary stream to be sent to the remote node, which then deserializes the binary stream into the original message. - The Reducer also write final results to HDFS Both interprocess communication and persistent storage requires serialization to be compact, fast, extensible and interoperable Hadoop uses its own serialization format, Writable - Text, IntWritable are all subclass of Writable. 22

Output Context is used to collect and write the ouput into intermediate as well as final files. The method context.write() takes (Key,Value) Partition and Shuffle - This process of moving map outputs to the reducers is known as shuffling. - The Partitioner class determines which partition a given (key, value) pair emitted by mapper will go to. The default one use hashing Sort: Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer. OutputFormat: The (key, value) pairs collected by Context are then written to output files. The way they are written is governed by the OutputFormat. The default TextOutputFormat writes lines in key \t value form 23

How Hadoop runs a MapReduce Job This include the jar file, which has a default replication factor 10 (the hotspot mentioned in GFS paper) Diagram from Tom White, Hadoop, the definitive Guide, O reilly, 2009, page 154 24

The communication between mapper and reducer Diagram from Tom White, Hadoop, the definitive Guide, O reilly, 2009, page 163 25

Sort in Hadoop Hadoop has a standard TeraSort example, the source can be found from hadoop distribution Each reducer sorts its input <key,v> pairs based on the keys - Primitive key type such as string, int has its build comparator - Custom key type should provide a comparator Reducer are identified by id [0,R-1] - Reducer s output are named as part-r-00000, part-r-00001, - The final output is a merge of all individual outputs in the order Default hash partitioner ensure map output keys are distributed to all reducers evenly To implement sort, a customized partitioner is required to make sure all keys in reducer k-1 is always smaller than keys in reducer k. 26

The initial output of word count is sorted on key We want the list to be sorted on frequency Example: sort tags based on frequency automobile 2700 aviation 1957 avion 1518 awesome 626 badajoz 734 bahia 701 barcelona 1305 barros 639 bc 2271 bcn 507 beach 6010 beachwedding 51 pomegranate 51 400m 51 digitalrebel 51 grandwesterncanal 51 sa 51 linna 52 jenbrian 52 3steps 52 gazerock 52 creeks 52 27

A second job is needed Map/Reduce jobs: Map: (tag, freq) -> (k:freq, v:tag) Reduce: (k:freq,v:(t1,t2,..)) -> (K:t1, v: freq), (K:t1, v: freq), apple: 5 pear: 6 banana:71 Mapper 1 5:apple 6: pear 71:banana 5:apple 6: pear 6: red 23: yellow 35:palace Reducer 0 apple: 5 pear: 6 red: 6 yellow: 23 palace: 35 beijing: 60 city: 600 palace: 35 Mapper 2 35:palace 60:beijing 600: city 60:beijing 71: banana Reducer 1 beijing: 60 banana:71 red: 6 green: 1002 yellow: 23 Mapper 3 6: red 23: yellow 1002: green 600:city 1002: green Reducer 2 city: 600 green:1002 28

The Mapper and Reducer public class TagFreqMapper extends Mapper<Object, Text, IntWritable, Text> { private Text word = new Text(); private IntWritable freq = new IntWritable(); //just reverse the key and value public void map(object key, Text value, Context context) throws IOException, InterruptedException{ String[] tagfreqarray = value.tostring().split("\t"); if (tagfreqarray.length == 2){ word.set(tagfreqarray[0]); try{ freq.set(integer.parseint(tagfreqarray[1])); context.write(freq, word); catch(numberformatexception e){ return; //not doing anything public class TagFreqReducer extends Reducer<IntWritable,Text,Text,IntWritable> { public void reduce(intwritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException{ for (Text val : values) { context.write(val, key); 29

A very simple custom partitioner package info5011sort; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.partitioner; public class SortPartitioner extends Partitioner<IntWritable, Text> { public int getpartition(intwritable key, Text value, int arg2) { // a simple and static partitioner to partition the key into 3 regions // [6,50) => 0, [50, 500) => 1, [500,-) => 2 int keyint = key.get(); if (keyint <50) return 0; if (keyint < 500 ) return 1; else return 2; Set the custom partitioner for this job: job.setpartitionerclass(sortpartitioner.class); 30

Problem of the simple partitioner The keys are not evenly distributed! In the TeraSort algorithm, the boundaries of each region is determined dynamically based on data distribution The distribution is estimated by taking a small sample of the data - E.g. for example, the tag frequency distribution is like a zipf shape 31

Frequency The distribution of tag frequencies 9000 8000 7000 Frequency Histogram 120.00% 100.00% 6000 80.00% 5000 4000 60.00% 3000 40.00% 2000 1000 20.00% 0 0.00% tag frequency range Bin Frequency Cumulative % 10 7719 31.87% 20 5153 53.15% 30 2524 63.57% 40 1503 69.78% The bounds 10 and 30 would roughly divide the keys evenly among three reducers 32

Chaining multiple jobs If we want to find the tag frequencies and sort tags based on frequencies, we need to run two jobs It is easy to chain jobs together in a form of - Map1 -> Reduce 1 -> Map2 -> Reduce 2 - The first job in the chain should write its output to a path which is then used as the input path for the second job Job countjob = new Job(conf, "count Tag"); countjob.setnumreducetasks(3); TextInputFormat.addInputPath(countJob, new Path(otherArgs[0])); TextOutputFormat.setOutputPath(countJob, new Path("temp")); countjob.waitforcompletion(true); Job sortjob = new Job(conf,"sort Tag"); TextInputFormat.addInputPath(sortJob, new Path("temp")); TextOutputFormat.setOutputPath(sortJob, new Path(args[1])); System.exit(sortJob.waitForCompletion(true)? 0 : 1); 33

input - Photo: user \t photo_id \t tags \t date \t place_id \n Output: - Place_id \t freq Your turn 34