Recommended Literature

Transcription

1 COSC 6397 Big Data Analytics Introduction to Map Reduce (I) Edgar Gabriel Spring 2014 Recommended Literature Original MapReduce paper by google Fantastic resource for tutorials, examples etc: 1

2 Map Reduce Programming Model Input key/value pairs output a set of key/value pairs Map Input pair intermediate key/value pair (k1, v1) list(k2, v2) Reduce One key all associated intermediate values (k2, list(v2)) list(v3) Reduce stage starts when final map task is done Task Granularity and Pipelining Many tasks means Minimal time for fault recovery Better pipeline shuffling with map execution Better load balancing Slide based on Dan Weld s class at U. Washington (who in turn made his slides based on those by Jeff Dean, Sanjay Ghemawat, Google, Inc.) 2

3 Map Reduce Framework Takes care of distributed processing and coordination Scheduling Jobs are broken down into smaller chunks called tasks Scheduler assigns tasks to available resources Tasks Localization with data Framework strives to place tasks on the nodes that host the segment of data to be processed by that task Code is moved to data, not data to code Slide based on lecture Major Components User Components: Mapper Reducer Combiner (Optional) Partitioner (Optional) (Shuffle) Writable(s) (Optional) System Components: Master Input Splitter* Output Committer* * You can use your own if you really want! Image source: 3

4 Notes Mappers and Reducers are typically single threaded and deterministic Determinism allows for restarting of failed jobs, or speculative execution all independent of each other, can run on arbitrary number of nodes Mappers/Reducers run entirely independent of each other In Hadoop, they run in separate JVMs Input Splitter Is responsible for splitting input into multiple chunks These chunks are then used as input for your mappers Splits on logical boundaries Multiple splitters available in Hadoop, Break up data by line, fixed chunk etc. Application can write its on splitter Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa: MapReduce- Simplified Data Processing on Large Clusters 4

5 Mapper Reads in input pair <K,V> (a section as split by the input splitter) Outputs a pair <K, V > Example: for Word Count with the following input: The teacher went to the store. The store was closed; the store opens in the morning. The store opens at 9am. The output would be: <The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1> <store, 1> <the, 1> <store, 1> <was, 1> <closed, 1> <the, 1> <store, 1> <opens, 1> <in, 1> <the, 1> <morning, 1> <the 1> <store, 1> <opens, 1> <at, 1> <9am, 1> Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa: MapReduce- Simplified Data Processing on Large Clusters Reducer Accepts the Mapper output, and collects values on the key All inputs with the same key must go to the same reducer! Input is typically sorted, output is output exactly as is For our example, the reducer input would be: <The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1> <store, 1> <the, 1> <store, 1> <was, 1> <closed, 1> <the, 1> <store, 1> <opens, 1> <in, 1> <the, 1> <morning, 1> <the 1> <store, 1> <opens, 1> <at, 1> <9am, 1> The output would be: <The, 6> <teacher, 1> <went, 1> <to, 1> <store, 3> <was, 1> <closed, 1> <opens, 1> <morning, 1> <at, 1> <9am, 1> Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa: MapReduce- Simplified Data Processing on Large Clusters 5

6 Combiner Combiner is an intermediate reducer Optional Reduces output from each mapper, Reduces bandwidth and limits data size for sorting Cannot change the type of its input Input types must be the same as output types Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa: MapReduce- Simplified Data Processing on Large Clusters Output Committer Is responsible for taking the reduce output, and committing it to a file Typically, this committer needs a corresponding input splitter (so that another job can read the input) Again, usually built in splitters are good enough, unless you need to output a special kind of file Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa: MapReduce- Simplified Data Processing on Large Clusters 6

7 Partitioner (Shuffler) Decides which pairs are sent to which reducer Default is simply: Key.hashCode() % numofreducers User can override to: Provide (more) uniform distribution of load between reducers Some values might need to be sent to the same reducer Ex. To compute the relative frequency of a pair of words <W1, W2> you would need to make sure all of word W1 are sent to the same reducer Binning of results Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa: MapReduce- Simplified Data Processing on Large Clusters Master Responsible for scheduling & managing jobs Scheduled computation should be close to the data if possible Bandwidth is expensive! (and slow) This relies on a Distributed File System (GFS / HDFS)! If a task fails to report progress (such as reading input, writing output, etc), crashes, the machine goes down, etc, it is assumed to be stuck, and is killed, and the step is re-launched (with the same input) The Master is handled by the framework, no user code is necessary Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa: MapReduce- Simplified Data Processing on Large Clusters 7

8 Master (II) HDFS can replicate data to be local if necessary for scheduling (discussed later in course) Because our nodes are (or at least should be) deterministic The Master can restart failed nodes Nodes should have no side effects! If a node is the last step, and is completing slowly, the master can launch a second copy of that node First one to complete wins, then any other runs are killed Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa: MapReduce- Simplified Data Processing on Large Clusters Word Count 8

9 Map Reduce word count code sample Implement Mapper Input is text Tokenize the text and emit first character with a count of 1 - <token, 1> Implement Reducer Sum up counts for each letter Write out the result outputfile Configure the Job Specify Input, Output, Mapper, Reducer and Combiner Run the job Mapper Mapper Class Class has 4 Java Generics parameters (1) input key (2) input value (3) output key (4) output value Input and output utilizes hadoop s IO framework org.apache.hadoop.io map() method injects Context object, use to: Write output Create your own counters Slide based on lecture 9

10 Mapper code in Hadoop public static class TokenizerMapper extends Mapper<LongWritable, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { } StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } Writables Types that can be serialized / deserialized to a stream framework will serialize your data before writing it to disk Required to be input/output classes User can implement this interface, and use their own types for their input/output/intermediate values There are default for basic values, like Strings: Text Integers: IntWritable Long: LongWritable Float: FloatWritable 10

11 Reducer Similarly to Mapper generic class with four types (1) input key (2) input value (3) output key (4) output value The output types of map functions must match the input types of reduce function In this case Text and IntWritable Map/Reduce framework groups key-value pairs produced by mapper by key For each key there is a set of one or more values Input into a reducer is sorted by key Known as Shuffle and Sort Reduce function accepts key->setofvalues and outputs keyvalue pairs Slide based on lecture Reducer code in Hadoop public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); } public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); 11

12 public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); Job job = new Job(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setreducerclass(intsumreducer.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); job.setinputformatclass(textinputformat.class); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); } System.exit(job.waitForCompletion(true)? 0 : 1); Job class Encapsulates information about a job Controls execution of the job A job is packaged within a jar file Hadoop Framework distributes the jar on your behalf Needs to know which jar file to distribute The easiest way to specify the jar that your job resides in is by calling job.setjarbyclass job.setjarbyclass(getclass()); Hadoop will locate the jar file that contains the provided class Slide based on lecture 12

13 job.setinputformatclass(textinputformat.class); TextInputFormat.addInputPath(job, new Path(otherArgs[0])); Specify input data Can be a file or a directory Directory is converted to a list of files as an input Input is specified by implementation of InputFormat - in this case TextInputFormat Responsible for creating splits and a record reader Controls input types of key-value pairs, in this case LongWritable and Text File is broken into lines, mapper will receive 1 line at a time Slide based on lecture job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); OutputFormat defines specification for outputting data from Map/Reduce job Count job utilizes an implementation of OutputFormat TextOutputFormat Define output path where reducer should place its output If path already exists then the job will fail Each reducer task writes to its own file By default a job is configured to run with a single reducer Writes key-value pair as plain text Slide based on lecture 13

14 Specify the output key and value types for both mapper and reducer functions Many times the same type If types differ then use setmapoutputkeyclass(); setmapoutputvalueclass(); job.waitforcompletion(true) Submits and waits for completion The boolean parameter flag specifies whether output should be written to console If the job completes successfully true is returned, otherwise false is returned Slide based on lecture Combiner Combine data per Mapper task to reduce amount of data transferred to reduce phase Reducer can very often serve as a combiner Only works if reducer s output key-value pair types are the same as mapper s output types Combiners are not guaranteed to run Optimization only Not for critical logic Add to main file job.setcombinerclass(intsumreducer.class); Slide based on lecture 14

15 K-means in MapReduce Mapper: K= datatpoint ID, V=datapoint coordinates Calculate the distance for a datapoint to each centroid Determine closest cluster Write K,V with K = cluster ID and V =coordinates of the point Reducer: Each reducer gets all entry associated with one cluster Calculate sum of all point coordinates per cluster Recalculate cluster mean K-means in MapReduce One iteration of the k-means algorithm Multiple iterations within one job not supported by MapReduce -> Hadoop has a way on how to specify a sequence of jobs with dependencies How to distribute Cluster centroids to all map tasks? -> Hadoop distributed cache 15

16 K-means clustering Initial cluster centroids used has a huge impact on the number of iterations of k-means algorithm Especially important due to the complexities of handling multiple iterations in MapReduce Pre-clustering algorithm such as Canopy Clustering to determine initial cluster centroids Canopy clustering Start with a list of data points and two distances T1<T2 for processing Select any point (random) from the list to form a canopy center Calculate distance to all other points in the list Put all the points which fall within the distance threshold of T1 into a canopy Remove from the original list all the points which fall within the threshold of T2. These points are excluded from being the center of a new canopie. Repeat steps until original list is empty. 16