MapReduce and Distributed Data Analysis Google Research 1
Dealing With Massive Data 2 2
Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3
Dealing With Massive Data Polynomial Memory Sublinear Processors Single Multiple RAM PRAM Sketches External Memory Property Testing 4 4
Dealing With Massive Data Polynomial Memory Sublinear Processors Single Multiple RAM PRAM Sketches External Memory Property Testing MapReduce Distributed Sketches 5 5
What about Data Streams? Batch Input Online RAM Data Streams 6 6
What about Data Streams? Batch Input Online Processors Single Multiple RAM MapReduce Data Streams Distributed Sketches Active DHTs 7 7
Today s Focus: MapReduce Multiple Processors: 10s to 10,000s processors Sublinear Memory A few Gb of memory/machine, even for Tb+ datasets Unlike PRAMs: memory is not shared Batch Processing Analysis of existing data Extensions used for incremental updates, online algorithms 8 8
Why MapReduce? Practice: Used very widely for large data analysis Google, Yahoo!, Amazon, Facebook, Netflix, LinkedIn, New York Times, eharmony,... Is it a Fad?: No! (Do Fads last 10 years?) Many similar implementations and abstractions on top of MR: Hadoop, Pig, Hive, Flume, Pregel,... Same computational model underneath Why? Practice that needs theory Makes data locality explicit (which sometimes leads to faster sequential algorithms)! 9 9
MR Flow Computation proceeds in rounds. Each round: Receive Input CPU Compute Produce output 10 10
MR Flow Computation proceeds in rounds. Each round: Receive Input: Input must fit into memory CPU Compute: Full TM Produce output: Annotated with address of machine in next round 11 11
MR Multiple Rounds Many Machines...and multiple rounds Round 1: Arbitrary communication topology (depends on the output of round 1) Round 2: 12 12
Data Streams vs. MapReduce Distributed Sum: Given a set of n numbers: a 1,a 2,...,a n 2 R, find S = X i a i Stream: Maintain a partial sum S j = X iapplej update with every element a i MapReduce: Compute M j = a jk + a jk+1 +...+ a j(k+1) 1 for k = p n in Round 1 p Round 2: add the n partial sums. 13 13
MR Modeling [Sketch] Distributed Sum in MR Round 1: Round 2: 14 14
Modeling For an input of size n : 15 15
Modeling For an input of size n : Memory Cannot store the data in memory Insist on sublinear memory per machine: n 1 for some >0 16 16
Modeling For an input of size n : Memory Cannot store the data in memory Insist on sublinear memory per machine: n 1 for some >0 Machines Machines in a cluster do not share memory Insist on sublinear number of machines: n 1 for some >0 17 17
Modeling For an input of size n : Memory Cannot store the data in memory Insist on sublinear memory per machine: n 1 for some >0 Machines Machines in a cluster do not share memory Insist on sublinear number of machines: n 1 for some >0 Synchronization Computation proceeds in rounds Count the number of rounds Aim for O(1) rounds 18 18
Not Modeling Lies, Damned Lies, Statistics And big-o notation And Competitive Analysis And... 19 19
Not Modeling Lies, Damned Lies, Statistics And big-o notation And Competitive Analysis And... Communication: Very important, makes a big difference 20 20
Not Modeling Lies, Damned Lies, Statistics And big-o notation And Competitive Analysis And... Communication: Very important, makes a big difference Order of magnitude improvements due to Move code to data (and not data to code) Working with graphs: save graph structure locally between rounds Job scheduling (same rack / different racks, etc) 21 21
Not Modeling Lies, Damned Lies, Statistics And big-o notation And Competitive Analysis And... Communication: Very important, makes a big difference Order of magnitude improvements due to Move code to data (and not data to code) Working with graphs: save graph structure locally between rounds Job scheduling (same rack / different racks, etc) n 2 2 Bounded by (total memory of the system) in the model Minimizing communication always a goal 22 22
How Powerful is this Model? Compared to PRAMs: Can (formally) simulate PRAM algorithms with MR In practice can use same idea without formal simulation One round of MR per round of PRAM: O(log n) rounds total Hard to break below o(log n), need new ideas 23 23
How Powerful is this Model? Different Tradeoffs from PRAM: PRAM: LOTS of very simple cores, communication every round MR: Many full TM cores, batch communication. Formally: Can simulate PRAM algorithms with MR In practice can use same idea without formal simulation One round of MR per round of PRAM: O(log n) rounds total Hard to break below o(log n), need new ideas! 24 24
How Powerful is this Model? Compared to Data Streams: Solving different problems (batch vs. online) But can use similar ideas (e.g. sketching) 25 25
How Powerful is this Model? Compared to Data Streams: Solving different problems (batch vs. online) But can use similar ideas (e.g. sketching) Compared to BSP: Closest in spirit Do not optimize parameters in algorithm design phase 26 26
Outline Introduction MapReduce MR Algorithmics Connected Components Matchings Greedy Algorithms Open Problems 27 27
Algorithmics Find the core of the problem: Reduce the problem size in parallel Solve the smaller instance sequentially 28 28
Algorithmics Find the core of the problem: Reduce the problem size in parallel Solve the smaller instance sequentially Roadmap: Identify redundant information Filter out redundancy to reduce input size Solve the smaller problem Use the solution as a seed for the larger problem 29 29
Connected Components Given a graph: 30 30
Connected Components Given a graph: 1. Partition (randomly) Machine 1 Machine 2 31 31
Connected Components Given a graph: 1. Partition (randomly) 2. Summarize ( keep apple n 1 edges per partition) Machine 1 Machine 2 32 32
Connected Components Given a graph: 1. Partition (randomly) 2. Summarize ( keep apple n 1 edges per partition) 3. Recombine 33 33
Connected Components Given a graph: 1. Partition (randomly) 2. Summarize ( keep apple n 1 edges per partition) 3. Recombine 4. Compute CC s 34 34
Analysis Given: k machines: Total Runtime: O(( m /k + nk) (n)) Memory per machine: O( m /k + nk) Actually, can stream through edges so 2 Rounds total O(n) suffices 35 35
Outline Introduction MapReduce MR Algorithmics Connected Components Matchings Greedy Algorithms Open Problems 36 36
Matchings Finding matchings Given an undirected graph Find a maximum matching G =(V,E) 37 37
Matchings Finding matchings Given an undirected graph Find a maximum matching Find a maximal matching G =(V,E) 38 38
Matchings Finding matchings Given an undirected graph Find a maximum matching Find a maximal matching G =(V,E) Try random partitions: Find a matching on each partition Compute a matching on the matchings Does not work: may make very limited progress 39 39
Looking for redundancy Matching: Could drop the edge if an endpoint already matched Idea: Find a seed matching (on a sample) Remove all dead edges Recurse on remaining edges 40 40
Algorithm Given a graph: 1. Take a random sample 41 41
Algorithm Given a graph: 1. Take a random sample 42 42
Algorithm Given a graph: 1. Take a random sample 2. Find a maximal matching on sample 43 43
Algorithm Given a graph: 1. Take a random sample 2. Find a maximal matching on sample 3. Look at original graph 44 44
Algorithm Given a graph: 1. Take a random sample 2. Find a maximal matching on sample 3. Look at original graph, drop dead edges 45 45
Algorithm Given a graph: 1. Take a random sample 2. Find a maximal matching on sample 3. Look at original graph, drop dead edges 4. Find matching on remaining edges 46 46
Analysis Key Lemma: Suppose the sampling rate is p = n1+c for some c>0. m Then with high probability the number of edges remaining after the prune step is at most: 2n p = 2m n c 47 47
Analysis Key Lemma: Suppose the sampling rate is p = n1+c for some c>0. m Then with high probability the number of edges remaining after the prune step is at most: 2n p = 2m n c Proof [Sketch]: Suppose I V are unmatched after prune step I must be an independent set in the sampled graph If E[I] >O( n /p) then it is an I.S. with probability at most Union bound over all 2 n sets I. e n 48 48
Analysis Key Lemma: Suppose the sampling rate is p = n1+c for some c>0. m Then with high probability the number of edges remaining after the prune step is at most: 2n p = 2m n c Corollaries: Given memory, algorithm requires O(1) rounds log n Given O(n log n) memory, algorithm requires O( rounds. log log n ) PRAM simulations: (log n) rounds n 1+c 49 49
Greedy Algorithms Max k-cover: Given U = {u 1,u 2,...,u n } and sets S = {S 1,S 2,...,S m } [ Select S S with S = k, that maximizes Classical Algorithm: Greedy. Iteratively add largest set Remove all of the covered elements 1 1/e S2S S Guarantees a approximation to the optimal 50 50
Outline Introduction MapReduce MR Algorithmics Connected Components Matchings Greedy Algorithms Open Problems 51 51
Not so Greedy Algorithms Not so Greedy Algorithm: Iteratively add a set within (1 + ) of the largest Remove all of the covered elements 52 52
Not so Greedy Algorithms Not so Greedy Algorithm: Iteratively add a set within (1 + ) of the largest Remove all of the covered elements Correctness: Make almost as much progress as regular greedy algorithm Get an 1 1/e O( ) approximation 53 53
Streaming Algorithms Intuition: Look for large sets first, then progressively look for smaller sets. Natural algorithm: n Define a threshold: v i = (1 + ) i Stream through elements, add sets of value at least v 1 Stream through elements, add sets of value at least v 2...until completion 54 54
Streaming Algorithms Intuition: Look for large sets first, then progressively look for smaller sets. Natural algorithm: n Define a threshold: v i = (1 + ) i Stream through elements, add sets of value at least v 1 Stream through elements, add sets of value at least v 2...until completion Analysis: Every time make a decision within (1 + ) of OPT log max S Need O( ) passes 55 55
Parallel Algorithms Simulate the Streaming Algorithm Given a threshold, v i, find a sequence of sets each covering at least elements Similar to maximal matching: Run the streaming algorithm on a sample Remove the sets whose weight falls below the threshold Repeat until no sets meet the threshold v i 56 56
Parallel Algorithms Simulate the Streaming Algorithm Given a threshold, v i, find a sequence of sets each covering at least elements Similar to maximal matching: Run the streaming algorithm on a sample Remove the sets whose weight falls below the threshold Repeat until no sets meet the threshold v i Theorem [MKVV 12]: If each set is sampled with probability p = km each phase of the streaming algorithm can be simulated in O( 1 / ) rounds 57 57
Sample&Prune High-level primitive: Take a sample of the input Evaluate the function on the sample Prune the input Repeat 58 58
Sample&Prune High-level primitive: Take a sample of the input Evaluate the function on the sample Prune the input Repeat Much more general: General class of greedy algorithms (Sub)modular function maximization subject to set of constraints matroid constraints, multiple knapsack constraints 59 59
Outline Introduction MapReduce MR Algorithmics Connected Components Matchings Greedy Algorithms Open Problems 60 60
Overview MapReduce: Interleaves sequential and parallel computation Many algorithms embarrassingly parallel and don t use the full power of intermediate steps. 61 61
Overview MapReduce: Interleaves sequential and parallel computation Many algorithms embarrassingly parallel and don t use the full power of intermediate steps. What are the limits of the model: Can simulate PRAM algorithms. But are there bigger speedups to be had? No known lower bounds! Even a simpler model is interesting e.g. assume data is partitioned and no replication as allowed 62 62
Classical Questions Prefix Sum: Given an array A[1...n] compute B[1...n] where B[i]= Relatively simple in MR (2 rounds suffices) X japplei A[j] 63 63
Classical Questions Prefix Sum: Given an array A[1...n] compute B[1...n] where B[i]= Relatively simple in MR (2 rounds suffices) X japplei A[j] List Ranking: Given a permutation as a set of pointers to the next element: P[i] = j Find rank of each element (number of hops to reach it from P[0] ) Easy in O(log n) rounds (simulate PRAM algorithm) Can you do it faster? 64 64
A Puzzle (essentially list ranking) I give you an undirected graph on n nodes: As a list of edges With a promise that every vertex has degree exactly 2 You have: n 2 /3 n 2 /3 machines, each with memory Your Mission: Tell me if the graph is connected in o(log n) rounds......or prove that it cannot be done 65 65
Thank You 66