Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu

Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu

Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs. The MapReduce library expresses the computa)on as two func)ons Map and Reduce All data resides in files e.g. in the Google File System (GFS)

Func)on Prototype map (k1,v1) list(k2,v2) The map func)on takes a key/value pair and generates a list of new key/value pairs reduce (k2,list(v2)) list(v2) The reduce func)on takes a key/list pair and generates a list of resul)ng values

Example: Word Count map(string key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); The map func)on emits a word and an associated count The reduce func)on sums together all counts emiyed for a par)cular word

Example: Word Count Input File 1 Hello World Bye World Input File 2 Hello Hadoop Goodbye Hadoop First Combiner < Bye, 1> < Hello, 1> < World, 2> Second Combiner < Goodbye, 1> < Hadoop, 2> < Hello, 1> 1. Map Phase 2. Combiner Phase 3. Reduce Phase First Map < Hello, 1> < World, 1> < Bye, 1> < World, 1> Second Map < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> Reducer < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>

Implementa)on Details The Map invoca)ons are distributed across mul)ple machines by automa)cally par))oning the input data into a set of M splits Reduce invoca)ons are distributed by par))oning the intermediate key space into R pieces using a par))oning func)on (e.g., hash(key) mod R). The number of par))ons (R) and the par))oning func)on may be specified by the user

Execu)on Overview

Advantages Scalable and conducive to data intensive and data parallel applica)ons Fault tolerant by design workers can be restarted on failures Ability to run on non specialized commodity hardware

Common Complaints Need to write code to get an applica)on to conform to the MapReduce programming model Need to be able to script queries at run )me Need a higher level SQL like abstrac)on Hard to write complicated SQL type queries Too simplis)c The onus on op)miza)on falls on the programmer, not the database engine

Apache Hadoop Hadoop provides an Open Source implementa)on of MapReduce Uses the Hadoop Distributed File System (HDFS), which is a GFS clone Has been demonstrated on clusters with 2000 nodes

HDFS A distributed file system designed to run on commodity hardware Many similari)es with exis)ng distributed file systems Highly fault tolerant and is designed to be deployed on low cost hardware Provides high throughput access to applica)on data and is suitable for applica)ons that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

Hadoop: Modes of Opera)on Stand alone By default, Hadoop is configured to run in a non distributed mode, as a single Java process. Mostly useful for debugging Pseudo distributed Hadoop can also be run on a single node in a pseudo distributed mode where each Hadoop daemon runs in a separate Java process Fully distributed Typically involves unpacking the sokware on all the machines in the cluster One machine in the cluster is designated as the NameNode and another machine the as JobTracker, exclusively. These are the masters. The rest of the machines in the cluster act as both DataNode and TaskTracker. These are the slaves.

Hadoop Code: Word Count 12. public class WordCount { 13. 14. public sta)c class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { 15. private final sta)c IntWritable one = new IntWritable(1); 16. private Text word = new Text(); 17. 18. public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOExcep)on { 19. String line = value.tostring(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasmoretokens()) { 22. word.set(tokenizer.nexttoken()); 23. output.collect(word, one); 24. } 25. } 26. } 27. 28. public sta)c class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { 29. public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOExcep)on { 30. int sum = 0; 31. while (values.hasnext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. } Map Func)on Reduce Func)on

Hadoop Code: Contd.. 37. 38. public sta)c void main(string[] args) throws Excep)on { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setjobname("wordcount"); 41. 42. conf.setoutputkeyclass(text.class); 43. conf.setoutputvalueclass(intwritable.class); 44. 45. conf.setmapperclass(map.class); 46. conf.setcombinerclass(reduce.class); 47. conf.setreducerclass(reduce.class); 48. 49. conf.setinputformat(textinputformat.class); 50. conf.setoutputformat(textoutputformat.class); 51. 52. FileInputFormat.setInputPaths(conf, new Path(args[0])); 53. FileOutputFormat.setOutputPath(conf, new Path(args[1])); 54. 55. JobClient.runJob(conf); 57. } 58. } 59. Job setup and launch

References MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean & Sanjay Ghemawat. In proceedings of OSDI 2004 The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun Tak Leung. In proceedings of SOSP'03 Apache Hadoop: hyp://hadoop.apache.org/ HDFS: hyp://hadoop.apache.org/core/docs/current/hdfs_design.html