Cloud Computing. Lectures 7 and 8 Map Reduce

Transcription

1 Cloud Computing Lectures 7 and 8 Map Reduce

2 Up until now Introduction Definition of Cloud Computing Grid Computing Content Distribution Networks Cycle-Sharing 2

3 Outline Map Reduce: What is it? Concepts An Example: PageRank 3

4 MapReduce: What is it? Typical Challenges in Parallel Processing: How to assign tasks to the workers? What if we have more tasks than workers? What if the workers need to share partial results? How do we aggregate partial results? How do we know whether the workers have finished? What if the workers fail? Functional Programming + Distributed Processing Platform 4

5 Origins: Functional Programming An old idea from the 50s. What is functional programming? Computing as the composition/sequencing of a set of functions. The theoretical foundations are lambda calculus. What is the difference to imperative programming? The concepts of data and instructions are blended. The flow of data is implicit. Execution order less relevant in the program design. 5

6 For example, in Scheme (define (foo x y) (sqrt (+ (* x x) (* y y)))) (foo 3 4) 5 (define (bar f x) (f (f x))) (define (baz x) (* x x)) (bar baz 2) 16 6

7 So... What does this have to do with MapReduce? Scheme and Lisp are strongly modelled on list processing. They use two basic concepts from functional programming: Map: apply the same operation to all the elements in a list. Fold: use an operator to combine all the elements of a list. 7

8 Map Mapis a second order function: It receives another function as a parameter. It works by: Applying the parameter function to all the elements of a list. Thereby generating a new list. f f f f f 8

9 Reduce Fold (Reduce)is also a second order function. It works by: Initializing an accumulator. Applying the parameter function to the accumulator and the first element of the list. The result is stored in the accumulator. The operation is repeated for each element of the list. The result is the final value of the accumulator. f f f f f final value Initial value 9

10 Map/Fold Example Simple map example: (map (lambda (x) (* x x)) '( )) '( ) Simple fold example: (fold + 0 '( )) 15 (fold * 1 '( )) 120 Sum of squares: (define (squares-sum v) (fold + 0 (map (lambda (x) (* x x)) v))) (squares-sum '( )) 55 10

11 MapReduce Map+fold over lists of <key, value> pairs. Map: operates on <key1, value1>pairs resulting in lists of <key2, value2> pairs. Reduce: Receives all <key2, value2>for a specific key2 and generates <key3, value3> pairs. 11

12 MapReduce Input Data Master Partitioned Output 12

13 Good and Bad Examples MapReduce is good for: Log indexing. Ordering large amounts of data. Analysing images. MapReduce is bad for: Calculating digits of π. Calculating sequences of Fibonacci numbers. Replacing a relational database. 13

14 Real Examples Implementing scalable learning algorithms. Graph algorithms, e.g. travelling salesman. Gathering and analysing medical information. Detecting face similarities in large sets of images. Web crawling. 14

15 Hadoop: Map/Reduce Hadoop: FLOSS Apache project that reimplements several of Google s cloud components, for example MapReduce. Example: the HelloWorldof distributed processing, word counting. 15

16 Wordcount: Map public class Map extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(writablecomparable key, Writable values, OutputCollector output, Reporter reporter) throws IOException { String line = values.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, one); }}} 16

17 Wordcount: Reduce public class Reduce extends MapReduceBase implements Reducer public void reduce(writablecomparable _key, Iterator _values, OutputCollector output, Reporter reporter) throws IOException { Iterator<IntWritable> values = (Iterator<IntWritable>) _values; int sum = 0; while (values.hasnext()) { sum += values.next().get(); } output.collect(_key, new IntWritable(sum)); }} 17

18 Datatypes Writable Defines a serialization protocol. All datatypes are Writable. WritableComparable Define an ordering criteria. All keys must be of this type, but not the values. IntWritable LongWritable Text Concrete classes for the classic datatypes. 18

19 Basic Datatypes IntWritable DoubleWritable FloatWritable BooleanWritable ArrayWritable BytesWritable MapWritable VLongWritable VIntWritable 19

20 Complex Datatypes The easy way: Code them in text, e.g. (a, b) = a:b. Use regular expressions to parse and extract the data. It works but is bad software engineering. The not so easy way: Define an implementation of WritableComparable. You must implement: readfields, write, compareto. Computationally efficient. 20

21 Writable public class MyWritable implements Writable { private int counter; private long timestamp; public void write(dataoutput out) throws IOException { out.writeint(counter); out.writelong(timestamp); } public void readfields(datainput in) throws IOException { counter = in.readint(); timestamp = in.readlong(); } public static MyWritable read(datainput in) throws IOException { MyWritable w = new MyWritable(); w.readfields(in); return w; } } 21

22 WritableComparable public class MyWritableComparable implements WritableComparable { private int counter; private long timestamp; public void write(dataoutput out) throws IOException { out.writeint(counter); out.writelong(timestamp); } public void readfields(datainput in) throws IOException { counter = in.readint(); timestamp = in.readlong(); } public int compareto(mywritablecomparable w) { int thisvalue = this.counter; int thatvalue = ((MyWritableComparable)w).counter; return (thisvalue < thatvalue? -1 : (thisvalue==thatvalue? 0 : 1)); }} 22

23 Wordcount: Main public static void main(string[] args) { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); if (otherargs.length!= 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true)? 0 : 1); } 23

24 PageRank: Random Walks in the Web PageRank is Google s famous web indexing algorithm. Rationale: If a user starts on a random web page and starts clicking what is the probability that he will reach a given page? A high probability signals an important page. This is the basic principle of PageRank: More pointed to pages get more points. 24

25 PageRank: Visually en.wikipedia.org 25

26 PageRank: Formula Given a page A, and other pages T 1 up to T n with links to A, PageRank is defined as: PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) PR(T n )/C(T n )) C(P) is the cardinality of P (number of outbound links) d is called the dampening of the randomness of the page choice (normally 0.85). It is the probability of choosing a page without coming from a link. 26

27 PageRank: Intuition The calculation is iterative: PR i+1 is based in PR i Each page distributes its PR i to all the pages it is linked to. The pages that receive fragments of PR add all the received fragments to generate their PR i+1 27

28 PageRank: Problems Will PageRank converge? How fast? What is the correct value of d and how sensitive is the algorithm to this value? What is the most efficient algorithm for PageRank? 28

29 PageRank: First Implementation Create two tables (currentand next) with the PageRank of each page. Fill the currenttable with the initial values of PR and the adjacent pages. Traverse the graph distributing currentpr among the next PR elements. current := next; next := new_table (); Iterate again or stop. 29

30 Algorithm Distribution Parallelization intuition: Each line of the nexttable depends on the current table but not on other lines of the nexttable. Individual lines of the adjacency tables, the tables of neighbours, can be processed in parallel. The lines in sparse matrices are relatively small compared to the number of nodes (pages). 30

31 Algorithm Distribution Consequences of this approach: We can map each line of the currenttable thereby generating PR fragments to be distributed to all pages pointed by the page. Fragments can be reduced (added) to a single value of PR. 31

32 Map step: break page rank into even fragments to distribute to link targets Reduce step: add together fragments into next PageRank Iterate for next step... 32

33 Phase 1: Reading the HTML Map task reads (URL, page-content)and transforms it into (URL,(PR init, url-list)) PR init is the initial PageRank value for the particular URL. url-list contains all the pages to which the page points. In this iteration, the reducer is the identity function (pass-through reducer). 33

34 Phase 2: PageRank Distribution (i) Map reads (URL, (current_pr, url-list)) For each u in url-list, output(u, current_pr/ urllist ) Outputs (URL, url-list)to pass the graph arcs to the next iterations. PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) PR(T n )/C(T n )) 34

35 Phase 2: PageRank Distribution (ii) Reduce receives (URL, url-list)and many (URL, PR_frag): Adds all vals and adjust the calculation with d. Generates (URL, (new_pr, url-list)) PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) PR(T n )/C(T n )) A non-parallel component decides whether the algorithm has converged. (Fixed number of iterations? Comparison of critical values?) 35

36 The Original Challenges Scheduling: Assigns map and reduce to the workers. Distribution: Workers are moved to the data. (Will be important later on.) Synchronization: Gathering, ordering and distributes intermediate results. Fault Tolerance: Detection and restart of failed tasks (next time). All about a distributed file system (next lecture). 36

37 Next time MapReduce: systems perspective & other applications 37