Map-Reduce and Hadoop 1
Introduction to Map-Reduce 2
3 Map Reduce operations Input data are (key, value) pairs 2 operations available : map and reduce Map Takes a (key, value) and generates other (key, value) Reduce Takes a key and all associated values Generates (key, value) pairs A map-reduce algorithm requires a mapper and a reducer
4 Map Reduce example Compute the average grade of students For each course, the professor provides us with a text file Text file format : lines of student grade Algorithm (non map-reduce) For each student, collect all grades and perform the average Algorithm (map-reduce) Mapper Assume the input file is parsed as (student, grade) pairs So do nothing! Reducer Perform the average of all values for a given key
5 Map Reduce example Course 1 Fabrice 20 Brian 10 Paul 15 Course 2 Fabrice 15 Brian 20 Paul 10 Course 3 Fabrice 10 Brian 15 Paul 20 Map (Fabrice, 20) (Brian, 10) (Paul, 15) (Fabrice, 15) (Brian, 20) (Paul, 10) (Fabrice, 10) (Brian, 15) (Paul, 20) (Fabrice, [20, 15, 10]) (Brian, [10, 15, 20]) (Paul, [15, 20, 10]) (Fabrice, 15) (Brian 15) (Paul, 15) Reduce
6 Map Reduce example too easy Ok, this was easy because We didn t care about technical details like reading inputs All keys are equals, no weighted average Now can we do something more complicated? Let s computed a weighted average Course 1 has weight 5 Course 2 has weight 2 Course 3 has weight 3 What is the problem now?
7 Map Reduce example Course 1 Fabrice 20 Brian 10 Paul 15 Course 2 Fabrice 15 Brian 20 Paul 10 Course 3 Fabrice 10 Brian 15 Paul 20 Map (Fabrice, 20) (Brian, 10) (Paul, 15) (Fabrice, 15) (Brian, 20) (Paul, 10) (Fabrice, 10) (Brian, 15) (Paul, 20) (Fabrice, [20, 15, 10]) (Brian, [10, 15, 20]) (Paul, [15, 20, 10]) (Fabrice, 15) (Brian 15) (Paul, 15) Reduce Should be able to discriminate between values
8 Map Reduce example - advanced How discriminate between values for a given key We can t unless the values look different New reducer Input : (Name, [course1_grade1, course2_grade2, course3_grade3]) Strip values from course indication and perform weighted average So, we need to change the input of the reducer which comes from the mapper New mapper Input : (Name, Grade) Output : (Name, coursename_grade) The mapper needs to be aware of the input file
9 Map Reduce example - 2 Course 1 Fabrice 20 Brian 10 Paul 15 Course 2 Fabrice 15 Brian 20 Paul 10 Course 3 Fabrice 10 Brian 15 Paul 20 Map (Fabrice, C1_20) (Brian, C1_10) (Paul, C1_15) (Fabrice, C2_15) (Brian, C2_20) (Paul, C2_10) (Fabrice, C3_10) (Brian, C3_15) (Paul, C3_20) (Fabrice, [C1_20, C2_15, C3_10]) (Brian, [C1_10, C2_15, C3_20]) (Paul, [C1_15, C2_20, C3_10]) (Fabrice, 16) (Brian, 14) (Paul, 14.5) Reduce
10 Introduction to Hadoop F. Huet, Oasis Seminar, 07/07/2010
11 What is Hadoop? A set of software developed by Apache for distributed computing Many different projects MapReduce HDFS : Hadoop Distributed File System Hbase : Distributed Database. Written in Java Can be deployed on any cluster easily
12 Hadoop Job An Hadoop job is composed of a map operation and (possibly) a reduce operation Map and reduce operations are implemented in a Mapper subclass and a Reducer subclass Hadoop will start many instances of Mapper and Reducer Decided at runtime but can be specified Each instance will work on a subset of the keys called a Splits
13 Map-Reduce workflow Source : Hadoop the definitive guide
14 Mapper Extend default class Mapper<K1, V1, K2, V2> K1, V1 : type of input (key,value) K2, V2 : type of output (key,value) Implements public void map(k1 key, V1 value, Context context) throws IOException, InterruptedException Output of values is done using context.write
15 Reducer Extend default class Reducer<K1, V1, K2, V2> K1, V1 : type of input (key,[values]) K2, V2 : type of output (key, value) Implements public void reduce(k1 key, V1 values, Context context) throws IOException, InterruptedException V1 is iterable Output of values is done using context.write
16 Input/Output Hadoop helps abstracting away data format and I/O from map/ reduce process InputFormat Validates data input format (user specified) Split-up the input file into Splits Provides an InputReader to read records from the Splits Default : TextInputFormat to read text file (key will be offset, value will be the line) OutputFormat Validate data output format Provides an OutputWriter to write records to the file system Default : TextOutputFormat to write plain text files
17 Hadoop Job example Configuration config = new Configuration(); Job job = new Job(config, "filesplittest"); job.setinputformatclass(textinputformat.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); job.setoutputformatclass(singletextoutputformat.class); Path outputdir = new Path(output); Path inputpath = new Path(input); FileInputFormat.setInputPaths(job, inputpath); FileOutputFormat.setOutputPath(job, outputdir); job.setmapperclass(mapsinglesortedfile.class); job.setreducerclass(reducer.class);
18 HDFS Hadoop Distributed File System Aggregate local storage Used by Hadoop workers to read input, store temporary data and final output Can be accessed using CLI $> hadoop fs command put : copy a local file to HDFS get : copy a HDFS file to a local directory Suitable for large files 64MB Block
Demo 19
20 Scenario Input : a text file made of RDF data (subject, predicate, object) Output : 3 files containing the input data sorted by subject, predicate or object Hadoop cluster eon 2-4 with HDFS Only need Hadoop conf files to use this cluster Monitor computation using web interface on eon2