Map-Reduce and Hadoop

Size: px

Start display at page:

Download "Map-Reduce and Hadoop"

Isaac Walters
9 years ago
Views:

1 Map-Reduce and Hadoop 1

2 Introduction to Map-Reduce 2

3 3 Map Reduce operations Input data are (key, value) pairs 2 operations available : map and reduce Map Takes a (key, value) and generates other (key, value) Reduce Takes a key and all associated values Generates (key, value) pairs A map-reduce algorithm requires a mapper and a reducer

other (key, value) Reduce Takes a key and all associated values

4 4 Map Reduce example Compute the average grade of students For each course, the professor provides us with a text file Text file format : lines of student grade Algorithm (non map-reduce) For each student, collect all grades and perform the average Algorithm (map-reduce) Mapper Assume the input file is parsed as (student, grade) pairs So do nothing! Reducer Perform the average of all values for a given key

student, collect all grades and perform the average Algorithm (map-reduce) Mapper Assume the input

5 5 Map Reduce example Course 1 Fabrice 20 Brian 10 Paul 15 Course 2 Fabrice 15 Brian 20 Paul 10 Course 3 Fabrice 10 Brian 15 Paul 20 Map (Fabrice, 20) (Brian, 10) (Paul, 15) (Fabrice, 15) (Brian, 20) (Paul, 10) (Fabrice, 10) (Brian, 15) (Paul, 20) (Fabrice, [20, 15, 10]) (Brian, [10, 15, 20]) (Paul, [15, 20, 10]) (Fabrice, 15) (Brian 15) (Paul, 15) Reduce

(Fabrice, 15) (Brian, 20) (Paul, 10) (Fabrice, 10) (Brian, 15) (Paul, 20) (Fabrice, [20,

6 6 Map Reduce example too easy Ok, this was easy because We didn t care about technical details like reading inputs All keys are equals, no weighted average Now can we do something more complicated? Let s computed a weighted average Course 1 has weight 5 Course 2 has weight 2 Course 3 has weight 3 What is the problem now?

Now can we do something more complicated?

7 7 Map Reduce example Course 1 Fabrice 20 Brian 10 Paul 15 Course 2 Fabrice 15 Brian 20 Paul 10 Course 3 Fabrice 10 Brian 15 Paul 20 Map (Fabrice, 20) (Brian, 10) (Paul, 15) (Fabrice, 15) (Brian, 20) (Paul, 10) (Fabrice, 10) (Brian, 15) (Paul, 20) (Fabrice, [20, 15, 10]) (Brian, [10, 15, 20]) (Paul, [15, 20, 10]) (Fabrice, 15) (Brian 15) (Paul, 15) Reduce Should be able to discriminate between values

20) (Paul, 10) (Fabrice, 10) (Brian, 15) (Paul, 20) (Fabrice, [20, 15, 10]) (Brian, [10, 15, 20])

8 8 Map Reduce example - advanced How discriminate between values for a given key We can t unless the values look different New reducer Input : (Name, [course1_grade1, course2_grade2, course3_grade3]) Strip values from course indication and perform weighted average So, we need to change the input of the reducer which comes from the mapper New mapper Input : (Name, Grade) Output : (Name, coursename_grade) The mapper needs to be aware of the input file

indication and perform weighted average So, we need to change the input of the reducer which comes from the

9 9 Map Reduce example - 2 Course 1 Fabrice 20 Brian 10 Paul 15 Course 2 Fabrice 15 Brian 20 Paul 10 Course 3 Fabrice 10 Brian 15 Paul 20 Map (Fabrice, C1_20) (Brian, C1_10) (Paul, C1_15) (Fabrice, C2_15) (Brian, C2_20) (Paul, C2_10) (Fabrice, C3_10) (Brian, C3_15) (Paul, C3_20) (Fabrice, [C1_20, C2_15, C3_10]) (Brian, [C1_10, C2_15, C3_20]) (Paul, [C1_15, C2_20, C3_10]) (Fabrice, 16) (Brian, 14) (Paul, 14.5) Reduce

(Brian, C2_20) (Paul, C2_10) (Fabrice, C3_10) (Brian, C3_15) (Paul, C3_20) (Fabrice, [C1_20, C2_15,

10 10 Introduction to Hadoop F. Huet, Oasis Seminar, 07/07/2010

11 11 What is Hadoop? A set of software developed by Apache for distributed computing Many different projects MapReduce HDFS : Hadoop Distributed File System Hbase : Distributed Database. Written in Java Can be deployed on any cluster easily

12 12 Hadoop Job An Hadoop job is composed of a map operation and (possibly) a reduce operation Map and reduce operations are implemented in a Mapper subclass and a Reducer subclass Hadoop will start many instances of Mapper and Reducer Decided at runtime but can be specified Each instance will work on a subset of the keys called a Splits

Reducer subclass Hadoop will start many instances of Mapper and Reducer Decided at

13 13 Map-Reduce workflow Source : Hadoop the definitive guide

14 14 Mapper Extend default class Mapper<K1, V1, K2, V2> K1, V1 : type of input (key,value) K2, V2 : type of output (key,value) Implements public void map(k1 key, V1 value, Context context) throws IOException, InterruptedException Output of values is done using context.write

public void map(k1 key, V1 value, Context context) throws

15 15 Reducer Extend default class Reducer<K1, V1, K2, V2> K1, V1 : type of input (key,[values]) K2, V2 : type of output (key, value) Implements public void reduce(k1 key, V1 values, Context context) throws IOException, InterruptedException V1 is iterable Output of values is done using context.write

public void reduce(k1 key, V1 values, Context context) throws IOException,

16 16 Input/Output Hadoop helps abstracting away data format and I/O from map/ reduce process InputFormat Validates data input format (user specified) Split-up the input file into Splits Provides an InputReader to read records from the Splits Default : TextInputFormat to read text file (key will be offset, value will be the line) OutputFormat Validate data output format Provides an OutputWriter to write records to the file system Default : TextOutputFormat to write plain text files

Splits Default : TextInputFormat to read text file (key will be offset, value will be the line) OutputFormat Validate

17 17 Hadoop Job example Configuration config = new Configuration(); Job job = new Job(config, "filesplittest"); job.setinputformatclass(textinputformat.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); job.setoutputformatclass(singletextoutputformat.class); Path outputdir = new Path(output); Path inputpath = new Path(input); FileInputFormat.setInputPaths(job, inputpath); FileOutputFormat.setOutputPath(job, outputdir); job.setmapperclass(mapsinglesortedfile.class); job.setreducerclass(reducer.class);

class); Path outputdir = new Path(output); Path inputpath = new Path(input); FileInputFormat.

18 18 HDFS Hadoop Distributed File System Aggregate local storage Used by Hadoop workers to read input, store temporary data and final output Can be accessed using CLI $> hadoop fs command put : copy a local file to HDFS get : copy a HDFS file to a local directory Suitable for large files 64MB Block

accessed using CLI $> hadoop fs command put : copy a local file to HDFS

19 Demo 19

20 20 Scenario Input : a text file made of RDF data (subject, predicate, object) Output : 3 files containing the input data sorted by subject, predicate or object Hadoop cluster eon 2-4 with HDFS Only need Hadoop conf files to use this cluster Monitor computation using web interface on eon2

predicate or object Hadoop cluster eon 2-4 with HDFS Only need Hadoop

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless