Big Data Analytics. Lucas Rego Drumond

Size: px
Start display at page:

Download "Big Data Analytics. Lucas Rego Drumond"

Transcription

1 Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33

2 Outline 1. Introduction 2. Learning Algorithms with MapReduce 3. PageRank 4. PageRank and MapReduce 5. Pregel MapReduce II 1 / 33

3 1. Introduction Outline 1. Introduction 2. Learning Algorithms with MapReduce 3. PageRank 4. PageRank and MapReduce 5. Pregel MapReduce II 1 / 33

4 1. Introduction Overview Part III Machine Learning Algorithms Part II Large Scale Computational Models Part I Distributed Database Distributed File System MapReduce II 1 / 33

5 1. Introduction MapReduce - Review 1. Each mapper transforms a set key-value pairs into a list of output keys and intermediate value pairs 2. all intermediate values are grouped according to their output keys 3. each reducer receives all the intermediate values associated with a given keys 4. each reducer associates one final value to each key MapReduce II 2 / 33

6 2. Learning Algorithms with MapReduce Outline 1. Introduction 2. Learning Algorithms with MapReduce 3. PageRank 4. PageRank and MapReduce 5. Pregel MapReduce II 3 / 33

7 2. Learning Algorithms with MapReduce Statistical Query and Summation Form The Statistical Query Model: The learning algorithm access the problem only through a Statistical Query Oracle The algorithm does not operate on specific data points Instead it only works on sums of the data Two-step process: 1. Compute sufficient statistics by summing functions over the data 2. Perform calculation on sums to compute the predictive model Chu, C., Kim, S. K., Lin, Y. A., Yu, Y., Bradski, G., Ng, A. Y., Olukotun, K. (2007). Map-reduce for machine learning on multicore. Advances in neural information processing systems, 19, 281. MapReduce II 3 / 33

8 2. Learning Algorithms with MapReduce Statistical Query and MapReduce The Statistical Query Model lends itself perfectly to the MapReduce Framework: 1. Map: Parallel computing sums over parts of the data 2. Reduce: Perform calculation on sums to compute the predictive model MapReduce II 4 / 33

9 2. Learning Algorithms with MapReduce Example: Linear Regression Closed form solution: ˆβ := arg min X β y λ β 2 2 β Where: A = X T X = m i=1 x ix i T b = X T y = m i=1 x iy i ˆβ = (X T X + λi) 1 X T y = (A + λi) 1 b MapReduce II 5 / 33

10 2. Learning Algorithms with MapReduce Data Partition X Y MapReduce II 6 / 33

11 2. Learning Algorithms with MapReduce Example: Linear Regression Assume we have N partitions, each with J datapoints: Map: ˆβ := arg min β N X (p) β y (p) λ β 2 2 p=1 Input: Key: Partition Number p Value: (X (p), y (p) ) Output: (a, J (b, J i=1 x(p) i x (p) T i ) i=1 x(p) i y (p) i ) Reduce: Input: J (a, i=1 x(p) i J (b, i=1 x(p) i Output: x (p) T i ) y (p) i ) (A, N p=1 J i=1 x(p) (b, N p=1 where: X R m n, y R m, X (p) R J n and y (p) R J i J i=1 x(p) i x (p) T i ) y (p) i ) MapReduce II 7 / 33

12 2. Learning Algorithms with MapReduce Example: Linear Regression Assume we have N partitions, each with J datapoints: ˆβ := arg min β N X (p) β y (p) λ β 2 2 p=1 1. (A, b) MapReduce ( (p, (X (p), y (p) )) p=1,...,n ) 2. ˆβ = (A + λi) 1 b MapReduce II 8 / 33

13 3. PageRank Outline 1. Introduction 2. Learning Algorithms with MapReduce 3. PageRank 4. PageRank and MapReduce 5. Pregel MapReduce II 9 / 33

14 3. PageRank Google s Pagerank Problem: Measure the importance of Websites in Google s search engine results Relevant Websites are more likely to have more links to it It is important to consider the quality of the websites that link a give page MapReduce II 9 / 33

15 3. PageRank Web graph The Web can be represented as a graph containing: A set of Webpages W A set of hyperlinks between webpages H W W Task: Assign a numerical weight r : W R to each element W The weight r(w) of a webpage w W should reflect its relative importance to the other webpages w 1 w 2 w 3 w 4 MapReduce II 10 / 33

16 3. PageRank Pagerank - The Random Surfer Imagine a web surfer that: Jumps from a web page to another one randomly: The next link to follow is chosen with uniform probability The surfer will browse the chosen link with some probability β It will occasionaly jump to a random page with some small probability 1 β The Pagerank PR(w) of a web page w models how likely is that the random surfer will visit it MapReduce II 11 / 33

17 3. PageRank Pagerank - The Random Surfer Be: in(w) := {v (v, w) H} the set of pages that link to w out(w) := {v (w, v) H} the set of pages linked by w 1 β the probability that the surfer jumps to a random page Example: w 1 w 2 in(w 1 ) = {w 3 } Example: out(w 1 ) = {w 2, w 3 } w 3 w 4 Pagerank of a webpage w i : MapReduce II 12 / 33

18 3. PageRank Pagerank - Algorithm 1: procedure Pagerank input: A web graph (W, H), hyperparameter β output: Pagerank values PR R W 2: PR {0} W 3: for w W do 4: PR(w) Random Value 5: end for 6: repeat 7: for w W do 8: PR(w) (1 β) + β w j in(w i ) 9: end for 10: until convergence 11: return PR 12: end procedure PR(w j ) out(w j ) MapReduce II 13 / 33

19 3. PageRank Pagerank - Algorithm w 1 w 2 PR(w 1 ) = (1 β) + β PR(w 2) out(w 2 ) β = 0.85 PR(w 2 ) = (1 β) + β PR(w 1) out(w 1 ) PR(w 1 ) = 0.9 PR(w 2 ) = 1.1 PR(w 1 ) = = PR(w 2 ) = = PR(w 1 ) = = PR(w 1 ) = PR(w 2 ) = PR(w 2 ) = = MapReduce II 14 / 33

20 3. PageRank Pagerank Algorithm w 1 w 2 PR(w 1 ) = 1.49 PR(w 2 ) = 0.78 PR(w 3 ) = 1.58 PR(w 4 ) = 0.15 w 3 w 4 MapReduce II 15 / 33

21 4. PageRank and MapReduce Outline 1. Introduction 2. Learning Algorithms with MapReduce 3. PageRank 4. PageRank and MapReduce 5. Pregel MapReduce II 16 / 33

22 4. PageRank and MapReduce MapReduce Implementation of Pagerank How to implement Pagerank on MapReduce??? MapReduce II 16 / 33

23 4. PageRank and MapReduce MapReduce Limitations iterations: Pipeline sequence of a Map phase and a Reduce phase MapReduce does not naturally support iterative algorithms dependency between data points: Data points should be independent of each other Not suitable for graph-like data MapReduce II 17 / 33

24 4. PageRank and MapReduce Pagerank with MapReduce MapReduce II 18 / 33

25 4. PageRank and MapReduce Pagerank with MapReduce Two phases: Initialize: complete MapReduce Iteration for initializing the pages with random pagerank values Compute pagerank: repeated series of pagerank computations MapReduce II 19 / 33

26 4. PageRank and MapReduce Pagerank with MapReduce - Initialize Map: Input: (u, page content) Output: (u, (init, out(u))) init: initial pagerank value u: url of a webpage out(u): pages linked by u Reduce: Input: (u, (init, out(u))) Output: (u, (init, out(u))) MapReduce II 20 / 33

27 4. PageRank and MapReduce Pagerank with MapReduce - Compute Ranks Map: Input: (u, (PR(u), out(u))) Output: For each v out(u): (v, PR(u) (u, out(u)) out(u) ) Reduce: Input: (u, out(u)), {(u, val)} Sums up all val s and compute new pagerank PR(u) for u Output: (u, (PR(u), out(u))) MapReduce II 21 / 33

28 4. PageRank and MapReduce Extensions of MapReduce Support repeated MapReduce Haloop, Twister Graph Models: Giraph, Pregel, Graphlab MapReduce II 22 / 33

29 5. Pregel Outline 1. Introduction 2. Learning Algorithms with MapReduce 3. PageRank 4. PageRank and MapReduce 5. Pregel MapReduce II 23 / 33

30 5. Pregel Pregel Distributed graph processing programming model from Google Bulk Synchronous Parallel Computation: Synchronous iterations (called supersteps) In a superstep: Each vertex asynchronously executes some user defined function in parallel Message passing: Vertices exchange messages with neighbors MapReduce II 23 / 33

31 5. Pregel Pregel Programming Model A data graph is defined where: 1. each vertex performs some computation 2. a vertex sends messages to neighboring vertices 3. each vertex process incoming messages The graph is distributed across computing nodes: MapReduce II 24 / 33

32 5. Pregel Pregel Programming Model We have two types of entities: Vertex: 1. Unique identifier 2. Has a modifiable, user defined value Edge: 1. Source and target vertex identifiers 2. Has a modifiable, user defined value MapReduce II 25 / 33

33 5. Pregel Pregel Program Program is loaded: each vertex executes computation and sends messages to out neighbors Each vertex: 1. receives messages from in-neighbors 2. processes messages 3. decides whether or not to send new messages 4. decides whether or not to halt The process is repeated until all vertices are at halt MapReduce II 26 / 33

34 5. Pregel Pregel - State of a vertex Message received Active Inactive Vote to halt MapReduce II 27 / 33

35 5. Pregel Pregel Model MapReduce II 28 / 33

36 5. Pregel Pregel - Aggregators Pregel supports global communication through Aggregators Each vertex can provide a value to the Aggregator in each superstep The Aggregator produces a single aggregated value Each vertex has access to the aggregated value of the previous superstep MapReduce II 29 / 33

37 5. Pregel Pregel Example finding the max value in the graph : procedure VertexUpdate input: Current value v, mesages M 2: Send v on all outgoing edges 3: flag true 4: for m M do 5: if m > v then 6: v m 7: flag false 8: end if 9: end for 10: if flag then halt 11: end if 12: return v 13: end procedure MapReduce II 30 / 33

38 5. Pregel Pagerank using Pregel Pagerank iteration: Be PR(w i ) = (1 β) + β w j in(w i ) m j : the the incoming message from node j v the current node Vertex update function: 1. send PR(wv ) out(w v ) to all outgoing edges 2. collect incoming messages m j 3. PR(w v ) = (1 β) + β PR(w j ) out(w j ) m j M m j MapReduce II 31 / 33

39 5. Pregel Pagerank using Pregel The algorithm can stop if it reaches a predefined maximum number of supersteps An aggregator is used to determine convergence Aggregator: Gathers from all nodes the difference between the value in the current superstep and in the last one The aggregated value is the sum of all the differences MapReduce II 32 / 33

40 5. Pregel Pagerank using Pregel w 1 w 2 w 3 w 4 1: procedure VertexUpdate input: Current value PR(v), number of outgoing edges out(v), mesages M, aggregated value a 2: if a > 0 then 3: Send PR(v) out(v) on all outgoing edges 4: old PR(v) 5: PR(v) (1 β) + β m j M 6: Send PR(v) old to the aggregator 7: else 8: halt 9: end if 10: end procedure m j MapReduce II 33 / 33

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D. LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD Dr. Buğra Gedik, Ph.D. MOTIVATION Graph data is everywhere Relationships between people, systems, and the nature Interactions between people, systems,

More information

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing /35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of

More information

Machine Learning over Big Data

Machine Learning over Big Data Machine Learning over Big Presented by Fuhao Zou [email protected] Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed

More information

Graph Processing and Social Networks

Graph Processing and Social Networks Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph

More information

Systems and Algorithms for Big Data Analytics

Systems and Algorithms for Big Data Analytics Systems and Algorithms for Big Data Analytics YAN, Da Email: [email protected] My Research Graph Data Distributed Graph Processing Spatial Data Spatial Query Processing Uncertain Data Querying & Mining

More information

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction

More information

Evaluating partitioning of big graphs

Evaluating partitioning of big graphs Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist [email protected], [email protected], [email protected] Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed

More information

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Analysis of Web Archives. Vinay Goel Senior Data Engineer Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Going For Large Scale 1

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model

More information

HIGH PERFORMANCE BIG DATA ANALYTICS

HIGH PERFORMANCE BIG DATA ANALYTICS HIGH PERFORMANCE BIG DATA ANALYTICS Kunle Olukotun Electrical Engineering and Computer Science Stanford University June 2, 2014 Explosion of Data Sources Sensors DoD is swimming in sensors and drowning

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 36 Outline

More information

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013 MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing Iftekhar Naim Outline Motivation MapReduce Overview Design Issues & Abstractions Examples and Results Pros and Cons

More information

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis Pelle Jakovits Outline Problem statement State of the art Approach Solutions and contributions Current work Conclusions

More information

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012 Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group) 06-08-2012 Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships

More information

Big Graph Processing: Some Background

Big Graph Processing: Some Background Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs

More information

Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

More information

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook Part 2:Mining using MapReduce Mining algorithms using MapReduce

More information

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing : A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat Karim Awara Amani Alonazi Hani Jamjoom Dan Williams Panos Kalnis King Abdullah University of Science and Technology,

More information

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014 Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)

More information

Map-Based Graph Analysis on MapReduce

Map-Based Graph Analysis on MapReduce 2013 IEEE International Conference on Big Data Map-Based Graph Analysis on MapReduce Upa Gupta, Leonidas Fegaras University of Texas at Arlington, CSE Arlington, TX 76019 {upa.gupta,fegaras}@uta.edu Abstract

More information

Using Map-Reduce for Large Scale Analysis of Graph-Based Data

Using Map-Reduce for Large Scale Analysis of Graph-Based Data Using Map-Reduce for Large Scale Analysis of Graph-Based Data NAN GONG KTH Information and Communication Technology Master of Science Thesis Stockholm, Sweden 2011 TRITA-ICT-EX-2011:218 Using Map-Reduce

More information

Large-Scale Data Processing

Large-Scale Data Processing Large-Scale Data Processing Eiko Yoneki [email protected] http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Application Scenario: Recommender

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1

More information

Big Data and Scripting Systems beyond Hadoop

Big Data and Scripting Systems beyond Hadoop Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid

More information

Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]

More information

MapReduce/Bigtable for Distributed Optimization

MapReduce/Bigtable for Distributed Optimization MapReduce/Bigtable for Distributed Optimization Keith B. Hall Google Inc. [email protected] Scott Gilpin Google Inc. [email protected] Gideon Mann Google Inc. [email protected] Abstract With large data

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 21 Outline

More information

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59 Introduction Contents 1 Introduction

More information

Large Scale Graph Processing with Apache Giraph

Large Scale Graph Processing with Apache Giraph Large Scale Graph Processing with Apache Giraph Sebastian Schelter Invited talk at GameDuell Berlin 29th May 2012 the mandatory about me slide PhD student at the Database Systems and Information Management

More information

The PageRank Citation Ranking: Bring Order to the Web

The PageRank Citation Ranking: Bring Order to the Web The PageRank Citation Ranking: Bring Order to the Web presented by: Xiaoxi Pang 25.Nov 2010 1 / 20 Outline Introduction A ranking for every page on the Web Implementation Convergence Properties Personalized

More information

What is Analytic Infrastructure and Why Should You Care?

What is Analytic Infrastructure and Why Should You Care? What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group [email protected] ABSTRACT We define analytic infrastructure to be the services,

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki [email protected]

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki hossein@databricks.com Common Patterns and Pitfalls for Implementing Algorithms in Spark Hossein Falaki @mhfalaki [email protected] Challenges of numerical computation over big data When applying any algorithm to big data

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

Practical Graph Mining with R. 5. Link Analysis

Practical Graph Mining with R. 5. Link Analysis Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities

More information

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing. Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Exam on the 5th of February, 216, 14. to 16. If you wish to attend, please

More information

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage Big Graph Analytics on Neo4j with Apache Spark Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage My background I only make it to the Open Stages :) Probably because Apache Neo4j

More information

Searching frequent itemsets by clustering data

Searching frequent itemsets by clustering data Towards a parallel approach using MapReduce Maria Malek Hubert Kadima LARIS-EISTI Ave du Parc, 95011 Cergy-Pontoise, FRANCE [email protected], [email protected] 1 Introduction and Related Work

More information

Apache Hama Design Document v0.6

Apache Hama Design Document v0.6 Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop

More information

BSPCloud: A Hybrid Programming Library for Cloud Computing *

BSPCloud: A Hybrid Programming Library for Cloud Computing * BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China [email protected],

More information

Presto/Blockus: Towards Scalable R Data Analysis

Presto/Blockus: Towards Scalable R Data Analysis /Blockus: Towards Scalable R Data Analysis Andrew A. Chien University of Chicago and Argonne ational Laboratory IRIA-UIUC-AL Joint Institute Potential Collaboration ovember 19, 2012 ovember 19, 2012 Andrew

More information

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit

More information

Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

More information

Trust and Reputation Management

Trust and Reputation Management Trust and Reputation Management Omer Rana School of Computer Science and Welsh escience Centre, Cardiff University, UK Omer Rana (CS, Cardiff, UK) CM0356/CMT606 1 / 28 Outline 1 Context Defining Trust

More information

Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 [email protected] Abstract Probability distributions on structured representation.

More information

Radoop: Analyzing Big Data with RapidMiner and Hadoop

Radoop: Analyzing Big Data with RapidMiner and Hadoop Radoop: Analyzing Big Data with RapidMiner and Hadoop Zoltán Prekopcsák, Gábor Makrai, Tamás Henk, Csaba Gáspár-Papanek Budapest University of Technology and Economics, Hungary Abstract Working with large

More information

Information Processing, Big Data, and the Cloud

Information Processing, Big Data, and the Cloud Information Processing, Big Data, and the Cloud James Horey Computational Sciences & Engineering Oak Ridge National Laboratory Fall Creek Falls 2010 Information Processing Systems Model Parameters Data-intensive

More information

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction Human connectome. Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011 2 NA = 6.022 1023 mol 1 Paul Burkhardt, Chris Waring An NSA Big Graph experiment Fast Iterative Graph Computation with Resource

More information

Assignment 2: More MapReduce with Hadoop

Assignment 2: More MapReduce with Hadoop Assignment 2: More MapReduce with Hadoop Jean-Pierre Lozi February 5, 2015 Provided files following URL: An archive that contains all files you will need for this assignment can be found at the http://sfu.ca/~jlozi/cmpt732/assignment2.tar.gz

More information

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

Google Cloud Data Platform & Services. Gregor Hohpe

Google Cloud Data Platform & Services. Gregor Hohpe Google Cloud Data Platform & Services Gregor Hohpe All About Data We Have More of It Internet data more easily available Logs user & system behavior Cheap Storage keep more of it 3 Beyond just Relational

More information

Estimating PageRank Values of Wikipedia Articles using MapReduce

Estimating PageRank Values of Wikipedia Articles using MapReduce Estimating PageRank Values of Wikipedia Articles using MapReduce Due: Sept. 30 Wednesday 5:00PM Submission: via Canvas, individual submission Instructor: Sangmi Pallickara Web page: http://www.cs.colostate.edu/~cs535/assignments.html

More information

Distributed SociaLite: A Datalog-Based Language for Large-Scale Graph Analysis

Distributed SociaLite: A Datalog-Based Language for Large-Scale Graph Analysis Distributed SociaLite: A Datalog-Based Language for Large-Scale Graph Analysis Jiwon Seo Stanford University [email protected] Jongsoo Park Intel Corporation [email protected] Jaeho Shin Stanford

More information

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

Parallelism and Cloud Computing

Parallelism and Cloud Computing Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication

More information

The Stratosphere Big Data Analytics Platform

The Stratosphere Big Data Analytics Platform The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science [email protected] June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data

More information

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden

More information

Hadoop MapReduce using Cache for Big Data Processing

Hadoop MapReduce using Cache for Big Data Processing Hadoop MapReduce using Cache for Big Data Processing Janani.J Dept. of computer science and engineering Arasu Engineering College Kumbakonam Kalaivani.K Dept. of computer science and engineering Arasu

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

An Empirical Study of Two MIS Algorithms

An Empirical Study of Two MIS Algorithms An Empirical Study of Two MIS Algorithms Email: Tushar Bisht and Kishore Kothapalli International Institute of Information Technology, Hyderabad Hyderabad, Andhra Pradesh, India 32. [email protected],

More information

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Why? A central concept in Computer Science. Algorithms are ubiquitous. Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online

More information

1 o Semestre 2007/2008

1 o Semestre 2007/2008 Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction

More information

Towards running complex models on big data

Towards running complex models on big data Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation

More information

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark. Fast, Interactive, Language- Integrated Cluster Computing Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC

More information

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,

More information

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme Big Data Analytics Prof. Dr. Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie,

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Network Algorithms for Homeland Security

Network Algorithms for Homeland Security Network Algorithms for Homeland Security Mark Goldberg and Malik Magdon-Ismail Rensselaer Polytechnic Institute September 27, 2004. Collaborators J. Baumes, M. Krishmamoorthy, N. Preston, W. Wallace. Partially

More information

WTF: The Who to Follow Service at Twitter

WTF: The Who to Follow Service at Twitter WTF: The Who to Follow Service at Twitter Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, Reza Zadeh Twitter, Inc. @pankaj @ashishgoel @lintool @aneeshs @dongwang218 @reza_zadeh ABSTRACT

More information

Big Data looks Tiny from the Stratosphere

Big Data looks Tiny from the Stratosphere Volker Markl http://www.user.tu-berlin.de/marklv [email protected] Big Data looks Tiny from the Stratosphere Data and analyses are becoming increasingly complex! Size Freshness Format/Media Type

More information

Incremental PageRank for Twitter Data Using Hadoop

Incremental PageRank for Twitter Data Using Hadoop Incremental PageRank for Twitter Data Using Hadoop Ibrahim Bin Abdullah E H U N I V E R S I T Y T O H F G R E D I N B U Master of Science Artificial Intelligence School of Informatics University of Edinburgh

More information

Enhancing the Ranking of a Web Page in the Ocean of Data

Enhancing the Ranking of a Web Page in the Ocean of Data Database Systems Journal vol. IV, no. 3/2013 3 Enhancing the Ranking of a Web Page in the Ocean of Data Hitesh KUMAR SHARMA University of Petroleum and Energy Studies, India [email protected] In today

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

The Power of Relationships

The Power of Relationships The Power of Relationships Opportunities and Challenges in Big Data Intel Labs Cluster Computing Architecture Legal Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO

More information

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

More information

CIEL A universal execution engine for distributed data-flow computing

CIEL A universal execution engine for distributed data-flow computing Reviewing: CIEL A universal execution engine for distributed data-flow computing Presented by Niko Stahl for R202 Outline 1. Motivation 2. Goals 3. Design 4. Fault Tolerance 5. Performance 6. Related Work

More information

Data Processing in the Era of Big Data

Data Processing in the Era of Big Data Department of Computer Science and Information Engineering National Taiwan University October 3, 2014 Big Data a New Jargon Importance Importance Big data is a collection of data sets so large and complex

More information

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing

More information

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University [email protected]

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University [email protected] 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

More information