Big Data Analytics. Lucas Rego Drumond

Transcription

1 Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33

2 Outline 1. Introduction 2. Learning Algorithms with MapReduce 3. PageRank 4. PageRank and MapReduce 5. Pregel MapReduce II 1 / 33

3 1. Introduction Outline 1. Introduction 2. Learning Algorithms with MapReduce 3. PageRank 4. PageRank and MapReduce 5. Pregel MapReduce II 1 / 33

4 1. Introduction Overview Part III Machine Learning Algorithms Part II Large Scale Computational Models Part I Distributed Database Distributed File System MapReduce II 1 / 33

5 1. Introduction MapReduce - Review 1. Each mapper transforms a set key-value pairs into a list of output keys and intermediate value pairs 2. all intermediate values are grouped according to their output keys 3. each reducer receives all the intermediate values associated with a given keys 4. each reducer associates one final value to each key MapReduce II 2 / 33

6 2. Learning Algorithms with MapReduce Outline 1. Introduction 2. Learning Algorithms with MapReduce 3. PageRank 4. PageRank and MapReduce 5. Pregel MapReduce II 3 / 33

7 2. Learning Algorithms with MapReduce Statistical Query and Summation Form The Statistical Query Model: The learning algorithm access the problem only through a Statistical Query Oracle The algorithm does not operate on specific data points Instead it only works on sums of the data Two-step process: 1. Compute sufficient statistics by summing functions over the data 2. Perform calculation on sums to compute the predictive model Chu, C., Kim, S. K., Lin, Y. A., Yu, Y., Bradski, G., Ng, A. Y., Olukotun, K. (2007). Map-reduce for machine learning on multicore. Advances in neural information processing systems, 19, 281. MapReduce II 3 / 33

8 2. Learning Algorithms with MapReduce Statistical Query and MapReduce The Statistical Query Model lends itself perfectly to the MapReduce Framework: 1. Map: Parallel computing sums over parts of the data 2. Reduce: Perform calculation on sums to compute the predictive model MapReduce II 4 / 33

9 2. Learning Algorithms with MapReduce Example: Linear Regression Closed form solution: ˆβ := arg min X β y λ β 2 2 β Where: A = X T X = m i=1 x ix i T b = X T y = m i=1 x iy i ˆβ = (X T X + λi) 1 X T y = (A + λi) 1 b MapReduce II 5 / 33

10 2. Learning Algorithms with MapReduce Data Partition X Y MapReduce II 6 / 33

11 2. Learning Algorithms with MapReduce Example: Linear Regression Assume we have N partitions, each with J datapoints: Map: ˆβ := arg min β N X (p) β y (p) λ β 2 2 p=1 Input: Key: Partition Number p Value: (X (p), y (p) ) Output: (a, J (b, J i=1 x(p) i x (p) T i ) i=1 x(p) i y (p) i ) Reduce: Input: J (a, i=1 x(p) i J (b, i=1 x(p) i Output: x (p) T i ) y (p) i ) (A, N p=1 J i=1 x(p) (b, N p=1 where: X R m n, y R m, X (p) R J n and y (p) R J i J i=1 x(p) i x (p) T i ) y (p) i ) MapReduce II 7 / 33

12 2. Learning Algorithms with MapReduce Example: Linear Regression Assume we have N partitions, each with J datapoints: ˆβ := arg min β N X (p) β y (p) λ β 2 2 p=1 1. (A, b) MapReduce ( (p, (X (p), y (p) )) p=1,...,n ) 2. ˆβ = (A + λi) 1 b MapReduce II 8 / 33

13 3. PageRank Outline 1. Introduction 2. Learning Algorithms with MapReduce 3. PageRank 4. PageRank and MapReduce 5. Pregel MapReduce II 9 / 33

14 3. PageRank Google s Pagerank Problem: Measure the importance of Websites in Google s search engine results Relevant Websites are more likely to have more links to it It is important to consider the quality of the websites that link a give page MapReduce II 9 / 33

15 3. PageRank Web graph The Web can be represented as a graph containing: A set of Webpages W A set of hyperlinks between webpages H W W Task: Assign a numerical weight r : W R to each element W The weight r(w) of a webpage w W should reflect its relative importance to the other webpages w 1 w 2 w 3 w 4 MapReduce II 10 / 33

16 3. PageRank Pagerank - The Random Surfer Imagine a web surfer that: Jumps from a web page to another one randomly: The next link to follow is chosen with uniform probability The surfer will browse the chosen link with some probability β It will occasionaly jump to a random page with some small probability 1 β The Pagerank PR(w) of a web page w models how likely is that the random surfer will visit it MapReduce II 11 / 33

17 3. PageRank Pagerank - The Random Surfer Be: in(w) := {v (v, w) H} the set of pages that link to w out(w) := {v (w, v) H} the set of pages linked by w 1 β the probability that the surfer jumps to a random page Example: w 1 w 2 in(w 1 ) = {w 3 } Example: out(w 1 ) = {w 2, w 3 } w 3 w 4 Pagerank of a webpage w i : MapReduce II 12 / 33

18 3. PageRank Pagerank - Algorithm 1: procedure Pagerank input: A web graph (W, H), hyperparameter β output: Pagerank values PR R W 2: PR {0} W 3: for w W do 4: PR(w) Random Value 5: end for 6: repeat 7: for w W do 8: PR(w) (1 β) + β w j in(w i ) 9: end for 10: until convergence 11: return PR 12: end procedure PR(w j ) out(w j ) MapReduce II 13 / 33

19 3. PageRank Pagerank - Algorithm w 1 w 2 PR(w 1 ) = (1 β) + β PR(w 2) out(w 2 ) β = 0.85 PR(w 2 ) = (1 β) + β PR(w 1) out(w 1 ) PR(w 1 ) = 0.9 PR(w 2 ) = 1.1 PR(w 1 ) = = PR(w 2 ) = = PR(w 1 ) = = PR(w 1 ) = PR(w 2 ) = PR(w 2 ) = = MapReduce II 14 / 33

20 3. PageRank Pagerank Algorithm w 1 w 2 PR(w 1 ) = 1.49 PR(w 2 ) = 0.78 PR(w 3 ) = 1.58 PR(w 4 ) = 0.15 w 3 w 4 MapReduce II 15 / 33

21 4. PageRank and MapReduce Outline 1. Introduction 2. Learning Algorithms with MapReduce 3. PageRank 4. PageRank and MapReduce 5. Pregel MapReduce II 16 / 33

22 4. PageRank and MapReduce MapReduce Implementation of Pagerank How to implement Pagerank on MapReduce??? MapReduce II 16 / 33

23 4. PageRank and MapReduce MapReduce Limitations iterations: Pipeline sequence of a Map phase and a Reduce phase MapReduce does not naturally support iterative algorithms dependency between data points: Data points should be independent of each other Not suitable for graph-like data MapReduce II 17 / 33

24 4. PageRank and MapReduce Pagerank with MapReduce MapReduce II 18 / 33

25 4. PageRank and MapReduce Pagerank with MapReduce Two phases: Initialize: complete MapReduce Iteration for initializing the pages with random pagerank values Compute pagerank: repeated series of pagerank computations MapReduce II 19 / 33

26 4. PageRank and MapReduce Pagerank with MapReduce - Initialize Map: Input: (u, page content) Output: (u, (init, out(u))) init: initial pagerank value u: url of a webpage out(u): pages linked by u Reduce: Input: (u, (init, out(u))) Output: (u, (init, out(u))) MapReduce II 20 / 33

27 4. PageRank and MapReduce Pagerank with MapReduce - Compute Ranks Map: Input: (u, (PR(u), out(u))) Output: For each v out(u): (v, PR(u) (u, out(u)) out(u) ) Reduce: Input: (u, out(u)), {(u, val)} Sums up all val s and compute new pagerank PR(u) for u Output: (u, (PR(u), out(u))) MapReduce II 21 / 33

28 4. PageRank and MapReduce Extensions of MapReduce Support repeated MapReduce Haloop, Twister Graph Models: Giraph, Pregel, Graphlab MapReduce II 22 / 33

29 5. Pregel Outline 1. Introduction 2. Learning Algorithms with MapReduce 3. PageRank 4. PageRank and MapReduce 5. Pregel MapReduce II 23 / 33

30 5. Pregel Pregel Distributed graph processing programming model from Google Bulk Synchronous Parallel Computation: Synchronous iterations (called supersteps) In a superstep: Each vertex asynchronously executes some user defined function in parallel Message passing: Vertices exchange messages with neighbors MapReduce II 23 / 33

31 5. Pregel Pregel Programming Model A data graph is defined where: 1. each vertex performs some computation 2. a vertex sends messages to neighboring vertices 3. each vertex process incoming messages The graph is distributed across computing nodes: MapReduce II 24 / 33

32 5. Pregel Pregel Programming Model We have two types of entities: Vertex: 1. Unique identifier 2. Has a modifiable, user defined value Edge: 1. Source and target vertex identifiers 2. Has a modifiable, user defined value MapReduce II 25 / 33

33 5. Pregel Pregel Program Program is loaded: each vertex executes computation and sends messages to out neighbors Each vertex: 1. receives messages from in-neighbors 2. processes messages 3. decides whether or not to send new messages 4. decides whether or not to halt The process is repeated until all vertices are at halt MapReduce II 26 / 33

34 5. Pregel Pregel - State of a vertex Message received Active Inactive Vote to halt MapReduce II 27 / 33

35 5. Pregel Pregel Model MapReduce II 28 / 33

36 5. Pregel Pregel - Aggregators Pregel supports global communication through Aggregators Each vertex can provide a value to the Aggregator in each superstep The Aggregator produces a single aggregated value Each vertex has access to the aggregated value of the previous superstep MapReduce II 29 / 33

37 5. Pregel Pregel Example finding the max value in the graph : procedure VertexUpdate input: Current value v, mesages M 2: Send v on all outgoing edges 3: flag true 4: for m M do 5: if m > v then 6: v m 7: flag false 8: end if 9: end for 10: if flag then halt 11: end if 12: return v 13: end procedure MapReduce II 30 / 33

38 5. Pregel Pagerank using Pregel Pagerank iteration: Be PR(w i ) = (1 β) + β w j in(w i ) m j : the the incoming message from node j v the current node Vertex update function: 1. send PR(wv ) out(w v ) to all outgoing edges 2. collect incoming messages m j 3. PR(w v ) = (1 β) + β PR(w j ) out(w j ) m j M m j MapReduce II 31 / 33

39 5. Pregel Pagerank using Pregel The algorithm can stop if it reaches a predefined maximum number of supersteps An aggregator is used to determine convergence Aggregator: Gathers from all nodes the difference between the value in the current superstep and in the last one The aggregated value is the sum of all the differences MapReduce II 32 / 33

40 5. Pregel Pagerank using Pregel w 1 w 2 w 3 w 4 1: procedure VertexUpdate input: Current value PR(v), number of outgoing edges out(v), mesages M, aggregated value a 2: if a > 0 then 3: Send PR(v) out(v) on all outgoing edges 4: old PR(v) 5: PR(v) (1 β) + β m j M 6: Send PR(v) old to the aggregator 7: else 8: halt 9: end if 10: end procedure m j MapReduce II 33 / 33