MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research



Similar documents
Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

CSE-E5430 Scalable Cloud Computing Lecture 2

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

A Model of Computation for MapReduce

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

An Empirical Study of Two MIS Algorithms

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Introduction to Scheduling Theory

Evaluating partitioning of big graphs

Open source Google-style large scale data analysis with Hadoop

6.852: Distributed Algorithms Fall, Class 2

ONLINE DEGREE-BOUNDED STEINER NETWORK DESIGN. Sina Dehghani Saeed Seddighin Ali Shafahi Fall 2015

Hadoop and Map-Reduce. Swati Gore

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

Big Data Systems CS 5965/6965 FALL 2015

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop Ecosystem B Y R A H I M A.


Big Data With Hadoop

Complexity Theory. IE 661: Scheduling Theory Fall 2003 Satyaki Ghosh Dastidar

Duke University

Private Approximation of Clustering and Vertex Cover

Distance Degree Sequences for Network Analysis

Advanced Big Data Analytics with R and Hadoop

BIG DATA TRENDS AND TECHNOLOGIES

Big Graph Processing: Some Background

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

map/reduce connected components

Reminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new

Hadoop Parallel Data Processing

Fairness in Routing and Load Balancing

Big Data Course Highlights

Introduction to DISC and Hadoop

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Big Data and Apache Hadoop s MapReduce

Developing MapReduce Programs

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

On the k-path cover problem for cacti

Presto/Blockus: Towards Scalable R Data Analysis

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Maximizing Hadoop Performance with Hardware Compression

Large-Scale Data Processing

How Companies are! Using Spark

Introduction to Hadoop

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Distributed Computing over Communication Networks: Maximal Independent Set

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.

Lecture 1: Course overview, circuits, and formulas

Hadoop. Sunday, November 25, 12

1 Approximating Set Cover

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Scheduling Shop Scheduling. Tim Nieberg

arxiv: v1 [cs.dc] 7 Dec 2014

How To Cluster Of Complex Systems

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

Protein Protein Interaction Networks

Algorithm Design and Analysis

Network Algorithms for Homeland Security

Approximation Algorithms

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

Machine Learning over Big Data

CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma

Hadoop: Embracing future hardware

Scaling Out With Apache Spark. DTL Meeting Slides based on

<Insert Picture Here> Big Data

Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework

Intro to Map/Reduce a.k.a. Hadoop

Decentralized Utility-based Sensor Network Design

CSE-E5430 Scalable Cloud Computing Lecture 11

Big Data Processing with Google s MapReduce. Alexandru Costan

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

RevoScaleR Speed and Scalability

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop Architecture. Part 1

Single machine parallel batch scheduling with unbounded capacity

Classification On The Clouds Using MapReduce

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research

Transcription:

MapReduce and Distributed Data Analysis Google Research 1

Dealing With Massive Data 2 2

Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3

Dealing With Massive Data Polynomial Memory Sublinear Processors Single Multiple RAM PRAM Sketches External Memory Property Testing 4 4

Dealing With Massive Data Polynomial Memory Sublinear Processors Single Multiple RAM PRAM Sketches External Memory Property Testing MapReduce Distributed Sketches 5 5

What about Data Streams? Batch Input Online RAM Data Streams 6 6

What about Data Streams? Batch Input Online Processors Single Multiple RAM MapReduce Data Streams Distributed Sketches Active DHTs 7 7

Today s Focus: MapReduce Multiple Processors: 10s to 10,000s processors Sublinear Memory A few Gb of memory/machine, even for Tb+ datasets Unlike PRAMs: memory is not shared Batch Processing Analysis of existing data Extensions used for incremental updates, online algorithms 8 8

Why MapReduce? Practice: Used very widely for large data analysis Google, Yahoo!, Amazon, Facebook, Netflix, LinkedIn, New York Times, eharmony,... Is it a Fad?: No! (Do Fads last 10 years?) Many similar implementations and abstractions on top of MR: Hadoop, Pig, Hive, Flume, Pregel,... Same computational model underneath Why? Practice that needs theory Makes data locality explicit (which sometimes leads to faster sequential algorithms)! 9 9

MR Flow Computation proceeds in rounds. Each round: Receive Input CPU Compute Produce output 10 10

MR Flow Computation proceeds in rounds. Each round: Receive Input: Input must fit into memory CPU Compute: Full TM Produce output: Annotated with address of machine in next round 11 11

MR Multiple Rounds Many Machines...and multiple rounds Round 1: Arbitrary communication topology (depends on the output of round 1) Round 2: 12 12

Data Streams vs. MapReduce Distributed Sum: Given a set of n numbers: a 1,a 2,...,a n 2 R, find S = X i a i Stream: Maintain a partial sum S j = X iapplej update with every element a i MapReduce: Compute M j = a jk + a jk+1 +...+ a j(k+1) 1 for k = p n in Round 1 p Round 2: add the n partial sums. 13 13

MR Modeling [Sketch] Distributed Sum in MR Round 1: Round 2: 14 14

Modeling For an input of size n : 15 15

Modeling For an input of size n : Memory Cannot store the data in memory Insist on sublinear memory per machine: n 1 for some >0 16 16

Modeling For an input of size n : Memory Cannot store the data in memory Insist on sublinear memory per machine: n 1 for some >0 Machines Machines in a cluster do not share memory Insist on sublinear number of machines: n 1 for some >0 17 17

Modeling For an input of size n : Memory Cannot store the data in memory Insist on sublinear memory per machine: n 1 for some >0 Machines Machines in a cluster do not share memory Insist on sublinear number of machines: n 1 for some >0 Synchronization Computation proceeds in rounds Count the number of rounds Aim for O(1) rounds 18 18

Not Modeling Lies, Damned Lies, Statistics And big-o notation And Competitive Analysis And... 19 19

Not Modeling Lies, Damned Lies, Statistics And big-o notation And Competitive Analysis And... Communication: Very important, makes a big difference 20 20

Not Modeling Lies, Damned Lies, Statistics And big-o notation And Competitive Analysis And... Communication: Very important, makes a big difference Order of magnitude improvements due to Move code to data (and not data to code) Working with graphs: save graph structure locally between rounds Job scheduling (same rack / different racks, etc) 21 21

Not Modeling Lies, Damned Lies, Statistics And big-o notation And Competitive Analysis And... Communication: Very important, makes a big difference Order of magnitude improvements due to Move code to data (and not data to code) Working with graphs: save graph structure locally between rounds Job scheduling (same rack / different racks, etc) n 2 2 Bounded by (total memory of the system) in the model Minimizing communication always a goal 22 22

How Powerful is this Model? Compared to PRAMs: Can (formally) simulate PRAM algorithms with MR In practice can use same idea without formal simulation One round of MR per round of PRAM: O(log n) rounds total Hard to break below o(log n), need new ideas 23 23

How Powerful is this Model? Different Tradeoffs from PRAM: PRAM: LOTS of very simple cores, communication every round MR: Many full TM cores, batch communication. Formally: Can simulate PRAM algorithms with MR In practice can use same idea without formal simulation One round of MR per round of PRAM: O(log n) rounds total Hard to break below o(log n), need new ideas! 24 24

How Powerful is this Model? Compared to Data Streams: Solving different problems (batch vs. online) But can use similar ideas (e.g. sketching) 25 25

How Powerful is this Model? Compared to Data Streams: Solving different problems (batch vs. online) But can use similar ideas (e.g. sketching) Compared to BSP: Closest in spirit Do not optimize parameters in algorithm design phase 26 26

Outline Introduction MapReduce MR Algorithmics Connected Components Matchings Greedy Algorithms Open Problems 27 27

Algorithmics Find the core of the problem: Reduce the problem size in parallel Solve the smaller instance sequentially 28 28

Algorithmics Find the core of the problem: Reduce the problem size in parallel Solve the smaller instance sequentially Roadmap: Identify redundant information Filter out redundancy to reduce input size Solve the smaller problem Use the solution as a seed for the larger problem 29 29

Connected Components Given a graph: 30 30

Connected Components Given a graph: 1. Partition (randomly) Machine 1 Machine 2 31 31

Connected Components Given a graph: 1. Partition (randomly) 2. Summarize ( keep apple n 1 edges per partition) Machine 1 Machine 2 32 32

Connected Components Given a graph: 1. Partition (randomly) 2. Summarize ( keep apple n 1 edges per partition) 3. Recombine 33 33

Connected Components Given a graph: 1. Partition (randomly) 2. Summarize ( keep apple n 1 edges per partition) 3. Recombine 4. Compute CC s 34 34

Analysis Given: k machines: Total Runtime: O(( m /k + nk) (n)) Memory per machine: O( m /k + nk) Actually, can stream through edges so 2 Rounds total O(n) suffices 35 35

Outline Introduction MapReduce MR Algorithmics Connected Components Matchings Greedy Algorithms Open Problems 36 36

Matchings Finding matchings Given an undirected graph Find a maximum matching G =(V,E) 37 37

Matchings Finding matchings Given an undirected graph Find a maximum matching Find a maximal matching G =(V,E) 38 38

Matchings Finding matchings Given an undirected graph Find a maximum matching Find a maximal matching G =(V,E) Try random partitions: Find a matching on each partition Compute a matching on the matchings Does not work: may make very limited progress 39 39

Looking for redundancy Matching: Could drop the edge if an endpoint already matched Idea: Find a seed matching (on a sample) Remove all dead edges Recurse on remaining edges 40 40

Algorithm Given a graph: 1. Take a random sample 41 41

Algorithm Given a graph: 1. Take a random sample 42 42

Algorithm Given a graph: 1. Take a random sample 2. Find a maximal matching on sample 43 43

Algorithm Given a graph: 1. Take a random sample 2. Find a maximal matching on sample 3. Look at original graph 44 44

Algorithm Given a graph: 1. Take a random sample 2. Find a maximal matching on sample 3. Look at original graph, drop dead edges 45 45

Algorithm Given a graph: 1. Take a random sample 2. Find a maximal matching on sample 3. Look at original graph, drop dead edges 4. Find matching on remaining edges 46 46

Analysis Key Lemma: Suppose the sampling rate is p = n1+c for some c>0. m Then with high probability the number of edges remaining after the prune step is at most: 2n p = 2m n c 47 47

Analysis Key Lemma: Suppose the sampling rate is p = n1+c for some c>0. m Then with high probability the number of edges remaining after the prune step is at most: 2n p = 2m n c Proof [Sketch]: Suppose I V are unmatched after prune step I must be an independent set in the sampled graph If E[I] >O( n /p) then it is an I.S. with probability at most Union bound over all 2 n sets I. e n 48 48

Analysis Key Lemma: Suppose the sampling rate is p = n1+c for some c>0. m Then with high probability the number of edges remaining after the prune step is at most: 2n p = 2m n c Corollaries: Given memory, algorithm requires O(1) rounds log n Given O(n log n) memory, algorithm requires O( rounds. log log n ) PRAM simulations: (log n) rounds n 1+c 49 49

Greedy Algorithms Max k-cover: Given U = {u 1,u 2,...,u n } and sets S = {S 1,S 2,...,S m } [ Select S S with S = k, that maximizes Classical Algorithm: Greedy. Iteratively add largest set Remove all of the covered elements 1 1/e S2S S Guarantees a approximation to the optimal 50 50

Outline Introduction MapReduce MR Algorithmics Connected Components Matchings Greedy Algorithms Open Problems 51 51

Not so Greedy Algorithms Not so Greedy Algorithm: Iteratively add a set within (1 + ) of the largest Remove all of the covered elements 52 52

Not so Greedy Algorithms Not so Greedy Algorithm: Iteratively add a set within (1 + ) of the largest Remove all of the covered elements Correctness: Make almost as much progress as regular greedy algorithm Get an 1 1/e O( ) approximation 53 53

Streaming Algorithms Intuition: Look for large sets first, then progressively look for smaller sets. Natural algorithm: n Define a threshold: v i = (1 + ) i Stream through elements, add sets of value at least v 1 Stream through elements, add sets of value at least v 2...until completion 54 54

Streaming Algorithms Intuition: Look for large sets first, then progressively look for smaller sets. Natural algorithm: n Define a threshold: v i = (1 + ) i Stream through elements, add sets of value at least v 1 Stream through elements, add sets of value at least v 2...until completion Analysis: Every time make a decision within (1 + ) of OPT log max S Need O( ) passes 55 55

Parallel Algorithms Simulate the Streaming Algorithm Given a threshold, v i, find a sequence of sets each covering at least elements Similar to maximal matching: Run the streaming algorithm on a sample Remove the sets whose weight falls below the threshold Repeat until no sets meet the threshold v i 56 56

Parallel Algorithms Simulate the Streaming Algorithm Given a threshold, v i, find a sequence of sets each covering at least elements Similar to maximal matching: Run the streaming algorithm on a sample Remove the sets whose weight falls below the threshold Repeat until no sets meet the threshold v i Theorem [MKVV 12]: If each set is sampled with probability p = km each phase of the streaming algorithm can be simulated in O( 1 / ) rounds 57 57

Sample&Prune High-level primitive: Take a sample of the input Evaluate the function on the sample Prune the input Repeat 58 58

Sample&Prune High-level primitive: Take a sample of the input Evaluate the function on the sample Prune the input Repeat Much more general: General class of greedy algorithms (Sub)modular function maximization subject to set of constraints matroid constraints, multiple knapsack constraints 59 59

Outline Introduction MapReduce MR Algorithmics Connected Components Matchings Greedy Algorithms Open Problems 60 60

Overview MapReduce: Interleaves sequential and parallel computation Many algorithms embarrassingly parallel and don t use the full power of intermediate steps. 61 61

Overview MapReduce: Interleaves sequential and parallel computation Many algorithms embarrassingly parallel and don t use the full power of intermediate steps. What are the limits of the model: Can simulate PRAM algorithms. But are there bigger speedups to be had? No known lower bounds! Even a simpler model is interesting e.g. assume data is partitioned and no replication as allowed 62 62

Classical Questions Prefix Sum: Given an array A[1...n] compute B[1...n] where B[i]= Relatively simple in MR (2 rounds suffices) X japplei A[j] 63 63

Classical Questions Prefix Sum: Given an array A[1...n] compute B[1...n] where B[i]= Relatively simple in MR (2 rounds suffices) X japplei A[j] List Ranking: Given a permutation as a set of pointers to the next element: P[i] = j Find rank of each element (number of hops to reach it from P[0] ) Easy in O(log n) rounds (simulate PRAM algorithm) Can you do it faster? 64 64

A Puzzle (essentially list ranking) I give you an undirected graph on n nodes: As a list of edges With a promise that every vertex has degree exactly 2 You have: n 2 /3 n 2 /3 machines, each with memory Your Mission: Tell me if the graph is connected in o(log n) rounds......or prove that it cannot be done 65 65

Thank You 66