Big Data Analytics. Lucas Rego Drumond



Similar documents
LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Machine Learning over Big Data

Graph Processing and Social Networks

Systems and Algorithms for Big Data Analytics

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Evaluating partitioning of big graphs

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Big Data Analytics. Lucas Rego Drumond

Big Data Processing with Google s MapReduce. Alexandru Costan

Analysis of MapReduce Algorithms

HIGH PERFORMANCE BIG DATA ANALYTICS

Big Data Analytics. Lucas Rego Drumond

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Big Graph Processing: Some Background

Map-Reduce for Machine Learning on Multicore

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Map-Based Graph Analysis on MapReduce

Using Map-Reduce for Large Scale Analysis of Graph-Based Data

Large-Scale Data Processing

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond

Big Data and Scripting Systems beyond Hadoop

Classification On The Clouds Using MapReduce

MapReduce/Bigtable for Distributed Optimization

Big Data Analytics. Lucas Rego Drumond

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Large Scale Graph Processing with Apache Giraph

The PageRank Citation Ranking: Bring Order to the Web

What is Analytic Infrastructure and Why Should You Care?

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein

Big Data and Scripting Systems build on top of Hadoop

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Machine Learning Big Data using Map Reduce

Practical Graph Mining with R. 5. Link Analysis

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Part 1: Link Analysis & Page Rank

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Searching frequent itemsets by clustering data

Apache Hama Design Document v0.6

From GWS to MapReduce: Google s Cloud Technology in the Early Days

BSPCloud: A Hybrid Programming Library for Cloud Computing *

Presto/Blockus: Towards Scalable R Data Analysis

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Challenges for Data Driven Systems

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Introduction to DISC and Hadoop

Trust and Reputation Management

Course: Model, Learning, and Inference: Lecture 5

Radoop: Analyzing Big Data with RapidMiner and Hadoop

Information Processing, Big Data, and the Cloud

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

Assignment 2: More MapReduce with Hadoop

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Google Cloud Data Platform & Services. Gregor Hohpe

Estimating PageRank Values of Wikipedia Articles using MapReduce

Distributed SociaLite: A Datalog-Based Language for Large-Scale Graph Analysis

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

DATA ANALYSIS II. Matrix Algorithms

Parallelism and Cloud Computing

The Stratosphere Big Data Analytics Platform

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Hadoop MapReduce using Cache for Big Data Processing

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

An Empirical Study of Two MIS Algorithms

Why? A central concept in Computer Science. Algorithms are ubiquitous.

1 o Semestre 2007/2008

Towards running complex models on big data

Spark. Fast, Interactive, Language- Integrated Cluster Computing

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Network Algorithms for Homeland Security

WTF: The Who to Follow Service at Twitter

Big Data looks Tiny from the Stratosphere

Incremental PageRank for Twitter Data Using Hadoop

Enhancing the Ranking of a Web Page in the Ocean of Data


The Power of Relationships

Hadoop WordCount Explained! IT332 Distributed Systems

CIEL A universal execution engine for distributed data-flow computing

Data Processing in the Era of Big Data

Unified Big Data Processing with Apache Spark. Matei

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Transcription:

Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33

Outline 1. Introduction 2. Learning Algorithms with MapReduce 3. PageRank 4. PageRank and MapReduce 5. Pregel MapReduce II 1 / 33

1. Introduction Outline 1. Introduction 2. Learning Algorithms with MapReduce 3. PageRank 4. PageRank and MapReduce 5. Pregel MapReduce II 1 / 33

1. Introduction Overview Part III Machine Learning Algorithms Part II Large Scale Computational Models Part I Distributed Database Distributed File System MapReduce II 1 / 33

1. Introduction MapReduce - Review 1. Each mapper transforms a set key-value pairs into a list of output keys and intermediate value pairs 2. all intermediate values are grouped according to their output keys 3. each reducer receives all the intermediate values associated with a given keys 4. each reducer associates one final value to each key MapReduce II 2 / 33

2. Learning Algorithms with MapReduce Outline 1. Introduction 2. Learning Algorithms with MapReduce 3. PageRank 4. PageRank and MapReduce 5. Pregel MapReduce II 3 / 33

2. Learning Algorithms with MapReduce Statistical Query and Summation Form The Statistical Query Model: The learning algorithm access the problem only through a Statistical Query Oracle The algorithm does not operate on specific data points Instead it only works on sums of the data Two-step process: 1. Compute sufficient statistics by summing functions over the data 2. Perform calculation on sums to compute the predictive model Chu, C., Kim, S. K., Lin, Y. A., Yu, Y., Bradski, G., Ng, A. Y., Olukotun, K. (2007). Map-reduce for machine learning on multicore. Advances in neural information processing systems, 19, 281. MapReduce II 3 / 33

2. Learning Algorithms with MapReduce Statistical Query and MapReduce The Statistical Query Model lends itself perfectly to the MapReduce Framework: 1. Map: Parallel computing sums over parts of the data 2. Reduce: Perform calculation on sums to compute the predictive model MapReduce II 4 / 33

2. Learning Algorithms with MapReduce Example: Linear Regression Closed form solution: ˆβ := arg min X β y 2 2 + λ β 2 2 β Where: A = X T X = m i=1 x ix i T b = X T y = m i=1 x iy i ˆβ = (X T X + λi) 1 X T y = (A + λi) 1 b MapReduce II 5 / 33

2. Learning Algorithms with MapReduce Data Partition X Y MapReduce II 6 / 33

2. Learning Algorithms with MapReduce Example: Linear Regression Assume we have N partitions, each with J datapoints: Map: ˆβ := arg min β N X (p) β y (p) 2 2 + λ β 2 2 p=1 Input: Key: Partition Number p Value: (X (p), y (p) ) Output: (a, J (b, J i=1 x(p) i x (p) T i ) i=1 x(p) i y (p) i ) Reduce: Input: J (a, i=1 x(p) i J (b, i=1 x(p) i Output: x (p) T i ) y (p) i ) (A, N p=1 J i=1 x(p) (b, N p=1 where: X R m n, y R m, X (p) R J n and y (p) R J i J i=1 x(p) i x (p) T i ) y (p) i ) MapReduce II 7 / 33

2. Learning Algorithms with MapReduce Example: Linear Regression Assume we have N partitions, each with J datapoints: ˆβ := arg min β N X (p) β y (p) 2 2 + λ β 2 2 p=1 1. (A, b) MapReduce ( (p, (X (p), y (p) )) p=1,...,n ) 2. ˆβ = (A + λi) 1 b MapReduce II 8 / 33

3. PageRank Outline 1. Introduction 2. Learning Algorithms with MapReduce 3. PageRank 4. PageRank and MapReduce 5. Pregel MapReduce II 9 / 33

3. PageRank Google s Pagerank Problem: Measure the importance of Websites in Google s search engine results Relevant Websites are more likely to have more links to it It is important to consider the quality of the websites that link a give page MapReduce II 9 / 33

3. PageRank Web graph The Web can be represented as a graph containing: A set of Webpages W A set of hyperlinks between webpages H W W Task: Assign a numerical weight r : W R to each element W The weight r(w) of a webpage w W should reflect its relative importance to the other webpages w 1 w 2 w 3 w 4 MapReduce II 10 / 33

3. PageRank Pagerank - The Random Surfer Imagine a web surfer that: Jumps from a web page to another one randomly: The next link to follow is chosen with uniform probability The surfer will browse the chosen link with some probability β It will occasionaly jump to a random page with some small probability 1 β The Pagerank PR(w) of a web page w models how likely is that the random surfer will visit it MapReduce II 11 / 33

3. PageRank Pagerank - The Random Surfer Be: in(w) := {v (v, w) H} the set of pages that link to w out(w) := {v (w, v) H} the set of pages linked by w 1 β the probability that the surfer jumps to a random page Example: w 1 w 2 in(w 1 ) = {w 3 } Example: out(w 1 ) = {w 2, w 3 } w 3 w 4 Pagerank of a webpage w i : MapReduce II 12 / 33

3. PageRank Pagerank - Algorithm 1: procedure Pagerank input: A web graph (W, H), hyperparameter β output: Pagerank values PR R W 2: PR {0} W 3: for w W do 4: PR(w) Random Value 5: end for 6: repeat 7: for w W do 8: PR(w) (1 β) + β w j in(w i ) 9: end for 10: until convergence 11: return PR 12: end procedure PR(w j ) out(w j ) MapReduce II 13 / 33

3. PageRank Pagerank - Algorithm w 1 w 2 PR(w 1 ) = (1 β) + β PR(w 2) out(w 2 ) β = 0.85 PR(w 2 ) = (1 β) + β PR(w 1) out(w 1 ) PR(w 1 ) = 0.9 PR(w 2 ) = 1.1 PR(w 1 ) = 0.15 + 0.85 1.1 1 = 1.085 PR(w 2 ) = 0.15+0.85 1.085 1 = 1.07225 PR(w 1 ) = 0.15 + 0.85 1.07225 1 = PR(w 1 ) = 1.085 1.061412 PR(w 2 ) = 1.07225 PR(w 2 ) = 0.15 + 0.85 1.061412 1 = 1.0522 MapReduce II 14 / 33

3. PageRank Pagerank Algorithm w 1 w 2 PR(w 1 ) = 1.49 PR(w 2 ) = 0.78 PR(w 3 ) = 1.58 PR(w 4 ) = 0.15 w 3 w 4 MapReduce II 15 / 33

4. PageRank and MapReduce Outline 1. Introduction 2. Learning Algorithms with MapReduce 3. PageRank 4. PageRank and MapReduce 5. Pregel MapReduce II 16 / 33

4. PageRank and MapReduce MapReduce Implementation of Pagerank How to implement Pagerank on MapReduce??? MapReduce II 16 / 33

4. PageRank and MapReduce MapReduce Limitations iterations: Pipeline sequence of a Map phase and a Reduce phase MapReduce does not naturally support iterative algorithms dependency between data points: Data points should be independent of each other Not suitable for graph-like data MapReduce II 17 / 33

4. PageRank and MapReduce Pagerank with MapReduce MapReduce II 18 / 33

4. PageRank and MapReduce Pagerank with MapReduce Two phases: Initialize: complete MapReduce Iteration for initializing the pages with random pagerank values Compute pagerank: repeated series of pagerank computations MapReduce II 19 / 33

4. PageRank and MapReduce Pagerank with MapReduce - Initialize Map: Input: (u, page content) Output: (u, (init, out(u))) init: initial pagerank value u: url of a webpage out(u): pages linked by u Reduce: Input: (u, (init, out(u))) Output: (u, (init, out(u))) MapReduce II 20 / 33

4. PageRank and MapReduce Pagerank with MapReduce - Compute Ranks Map: Input: (u, (PR(u), out(u))) Output: For each v out(u): (v, PR(u) (u, out(u)) out(u) ) Reduce: Input: (u, out(u)), {(u, val)} Sums up all val s and compute new pagerank PR(u) for u Output: (u, (PR(u), out(u))) MapReduce II 21 / 33

4. PageRank and MapReduce Extensions of MapReduce Support repeated MapReduce Haloop, Twister Graph Models: Giraph, Pregel, Graphlab MapReduce II 22 / 33

5. Pregel Outline 1. Introduction 2. Learning Algorithms with MapReduce 3. PageRank 4. PageRank and MapReduce 5. Pregel MapReduce II 23 / 33

5. Pregel Pregel Distributed graph processing programming model from Google Bulk Synchronous Parallel Computation: Synchronous iterations (called supersteps) In a superstep: Each vertex asynchronously executes some user defined function in parallel Message passing: Vertices exchange messages with neighbors MapReduce II 23 / 33

5. Pregel Pregel Programming Model A data graph is defined where: 1. each vertex performs some computation 2. a vertex sends messages to neighboring vertices 3. each vertex process incoming messages The graph is distributed across computing nodes: MapReduce II 24 / 33

5. Pregel Pregel Programming Model We have two types of entities: Vertex: 1. Unique identifier 2. Has a modifiable, user defined value Edge: 1. Source and target vertex identifiers 2. Has a modifiable, user defined value MapReduce II 25 / 33

5. Pregel Pregel Program Program is loaded: each vertex executes computation and sends messages to out neighbors Each vertex: 1. receives messages from in-neighbors 2. processes messages 3. decides whether or not to send new messages 4. decides whether or not to halt The process is repeated until all vertices are at halt MapReduce II 26 / 33

5. Pregel Pregel - State of a vertex Message received Active Inactive Vote to halt MapReduce II 27 / 33

5. Pregel Pregel Model MapReduce II 28 / 33

5. Pregel Pregel - Aggregators Pregel supports global communication through Aggregators Each vertex can provide a value to the Aggregator in each superstep The Aggregator produces a single aggregated value Each vertex has access to the aggregated value of the previous superstep MapReduce II 29 / 33

5. Pregel Pregel Example finding the max value in the graph 3 6 2 1 6 6 2 6 6 6 6 6 6 6 6 6 1: procedure VertexUpdate input: Current value v, mesages M 2: Send v on all outgoing edges 3: flag true 4: for m M do 5: if m > v then 6: v m 7: flag false 8: end if 9: end for 10: if flag then halt 11: end if 12: return v 13: end procedure MapReduce II 30 / 33

5. Pregel Pagerank using Pregel Pagerank iteration: Be PR(w i ) = (1 β) + β w j in(w i ) m j : the the incoming message from node j v the current node Vertex update function: 1. send PR(wv ) out(w v ) to all outgoing edges 2. collect incoming messages m j 3. PR(w v ) = (1 β) + β PR(w j ) out(w j ) m j M m j MapReduce II 31 / 33

5. Pregel Pagerank using Pregel The algorithm can stop if it reaches a predefined maximum number of supersteps An aggregator is used to determine convergence Aggregator: Gathers from all nodes the difference between the value in the current superstep and in the last one The aggregated value is the sum of all the differences MapReduce II 32 / 33

5. Pregel Pagerank using Pregel w 1 w 2 w 3 w 4 1: procedure VertexUpdate input: Current value PR(v), number of outgoing edges out(v), mesages M, aggregated value a 2: if a > 0 then 3: Send PR(v) out(v) on all outgoing edges 4: old PR(v) 5: PR(v) (1 β) + β m j M 6: Send PR(v) old to the aggregator 7: else 8: halt 9: end if 10: end procedure m j MapReduce II 33 / 33