Big Data Analytics. Lucas Rego Drumond

Size: px
Start display at page:

Download "Big Data Analytics. Lucas Rego Drumond"

Transcription

1 Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 33

2 Outline 1. Introduction 2. GraphLab 2.1 GraphLab Data Model 2.2 Update Function 2.3 Sync Mechanism 2.4 PageRank Example 2.5 Execution Model and Data Consistency Big Data Analytics 1 / 33

3 1. Introduction Outline 1. Introduction 2. GraphLab 2.1 GraphLab Data Model 2.2 Update Function 2.3 Sync Mechanism 2.4 PageRank Example 2.5 Execution Model and Data Consistency Big Data Analytics 1 / 33

4 1. Introduction Overview Part III Machine Learning Algorithms Part II Large Scale Computational Models Part I Distributed Database Distributed File System Big Data Analytics 1 / 33

5 1. Introduction MapReduce - Review 1. Each mapper transforms a set key-value pairs into a list of output keys and intermediate value pairs 2. all intermediate values are grouped according to their output keys 3. each reducer receives all the intermediate values associated with a given keys 4. each reducer associates one final value to each key Big Data Analytics 2 / 33

6 1. Introduction Pregel - Review Big Data Analytics 3 / 33

7 2. GraphLab Outline 1. Introduction 2. GraphLab 2.1 GraphLab Data Model 2.2 Update Function 2.3 Sync Mechanism 2.4 PageRank Example 2.5 Execution Model and Data Consistency Big Data Analytics 4 / 33

8 2. GraphLab GraphLab A new framework for parallel machine learning Big Data Analytics 4 / 33

9 2. GraphLab GraphLab A new framework for parallel machine learning Represents data as a directed graph Big Data Analytics 4 / 33

10 2. GraphLab GraphLab A new framework for parallel machine learning Represents data as a directed graph Blocks of data stored in vertices and edges Big Data Analytics 4 / 33

11 2. GraphLab GraphLab A new framework for parallel machine learning Represents data as a directed graph Blocks of data stored in vertices and edges Globally shared data table Big Data Analytics 4 / 33

12 2. GraphLab GraphLab A new framework for parallel machine learning Represents data as a directed graph Blocks of data stored in vertices and edges Globally shared data table Shared memory (GraphLab 1.0) distributed (GraphLab 2.1) Big Data Analytics 4 / 33

13 2. GraphLab GraphLab A new framework for parallel machine learning Represents data as a directed graph Blocks of data stored in vertices and edges Globally shared data table Shared memory (GraphLab 1.0) distributed (GraphLab 2.1) Available for download at: Big Data Analytics 4 / 33

14 2. GraphLab GraphLab Software Stack Big Data Analytics 5 / 33

15 2. GraphLab GraphLab System Overview Big Data Analytics 6 / 33

16 2. GraphLab GraphLab Abstraction The GraphLab abstraction consists of the following main components: Big Data Analytics 7 / 33

17 2. GraphLab GraphLab Abstraction The GraphLab abstraction consists of the following main components: A Data Graph provides a high level representation for data and complex computational dependencies Big Data Analytics 7 / 33

18 2. GraphLab GraphLab Abstraction The GraphLab abstraction consists of the following main components: A Data Graph provides a high level representation for data and complex computational dependencies Configurable consistency models for automatically guaranteeing data consistancy Big Data Analytics 7 / 33

19 2. GraphLab GraphLab Abstraction The GraphLab abstraction consists of the following main components: A Data Graph provides a high level representation for data and complex computational dependencies Configurable consistency models for automatically guaranteeing data consistancy Scheduling Primitives able to express iterative parallel algorithms Big Data Analytics 7 / 33

20 2. GraphLab GraphLab Abstraction The GraphLab abstraction consists of the following main components: A Data Graph provides a high level representation for data and complex computational dependencies Configurable consistency models for automatically guaranteeing data consistancy Scheduling Primitives able to express iterative parallel algorithms User defined computation consisting of: Big Data Analytics 7 / 33

21 2. GraphLab GraphLab Abstraction The GraphLab abstraction consists of the following main components: A Data Graph provides a high level representation for data and complex computational dependencies Configurable consistency models for automatically guaranteeing data consistancy Scheduling Primitives able to express iterative parallel algorithms User defined computation consisting of: Update Functions: local computation Big Data Analytics 7 / 33

22 2. GraphLab GraphLab Abstraction The GraphLab abstraction consists of the following main components: A Data Graph provides a high level representation for data and complex computational dependencies Configurable consistency models for automatically guaranteeing data consistancy Scheduling Primitives able to express iterative parallel algorithms User defined computation consisting of: Update Functions: local computation Sync Mechanism: global aggregation Big Data Analytics 7 / 33

23 2.1 GraphLab Data Model Outline 1. Introduction 2. GraphLab 2.1 GraphLab Data Model 2.2 Update Function 2.3 Sync Mechanism 2.4 PageRank Example 2.5 Execution Model and Data Consistency Big Data Analytics 8 / 33

24 2.1 GraphLab Data Model Graphlab Data Model Consists of two parts: Big Data Analytics 8 / 33

25 2.1 GraphLab Data Model Graphlab Data Model Consists of two parts: 1. A directed data graph: G := (V, E) where V is the set of vertices and E V V the directed edges Big Data Analytics 8 / 33

26 2.1 GraphLab Data Model Graphlab Data Model Consists of two parts: 1. A directed data graph: G := (V, E) where V is the set of vertices and E V V the directed edges 2. A globally shared data table: T[Key] Value which maps keys to arbitrary blocks of data Big Data Analytics 8 / 33

27 2.1 GraphLab Data Model Graphlab Data Model Users can associate with each graph vertice and edge: Arbitrary data blocks Program or model parameters Some notation: Big Data Analytics 9 / 33

28 2.1 GraphLab Data Model Graphlab Data Model Users can associate with each graph vertice and edge: Arbitrary data blocks Program or model parameters Some notation: Data associated with a vertex v: D v Big Data Analytics 9 / 33

29 2.1 GraphLab Data Model Graphlab Data Model Users can associate with each graph vertice and edge: Arbitrary data blocks Program or model parameters Some notation: Data associated with a vertex v: D v Data associated with an edge (u, v): D u,v Big Data Analytics 9 / 33

30 2.1 GraphLab Data Model Graphlab Data Model Users can associate with each graph vertice and edge: Arbitrary data blocks Program or model parameters Some notation: Data associated with a vertex v: D v Data associated with an edge (u, v): D u,v Set of all outbound edges from v: (v ) Big Data Analytics 9 / 33

31 2.1 GraphLab Data Model Graphlab Data Model Users can associate with each graph vertice and edge: Arbitrary data blocks Program or model parameters Some notation: Data associated with a vertex v: D v Data associated with an edge (u, v): D u,v Set of all outbound edges from v: (v ) Set of all inbound edges from v: ( v) Big Data Analytics 9 / 33

32 2.1 GraphLab Data Model Graphlab Data Model Users can associate with each graph vertice and edge: Arbitrary data blocks Program or model parameters Some notation: Data associated with a vertex v: D v Data associated with an edge (u, v): D u,v Set of all outbound edges from v: (v ) Set of all inbound edges from v: ( v) Neighboring vertices of v: N v := {u (u, v) E (v, u) E} Big Data Analytics 9 / 33

33 2.1 GraphLab Data Model GraphLab - User defined computation There are two types of user defined computation on GraphLab: Big Data Analytics 10 / 33

34 2.1 GraphLab Data Model GraphLab - User defined computation There are two types of user defined computation on GraphLab: Update function: Big Data Analytics 10 / 33

35 2.1 GraphLab Data Model GraphLab - User defined computation There are two types of user defined computation on GraphLab: Update function: local computation Big Data Analytics 10 / 33

36 2.1 GraphLab Data Model GraphLab - User defined computation There are two types of user defined computation on GraphLab: Update function: local computation executed on small neighborhoods Big Data Analytics 10 / 33

37 2.1 GraphLab Data Model GraphLab - User defined computation There are two types of user defined computation on GraphLab: Update function: local computation executed on small neighborhoods different functions may access and modify overlapping contexts Big Data Analytics 10 / 33

38 2.1 GraphLab Data Model GraphLab - User defined computation There are two types of user defined computation on GraphLab: Update function: local computation executed on small neighborhoods different functions may access and modify overlapping contexts Sync Mechanism: global aggregation runs concurrently with the update functions Big Data Analytics 10 / 33

39 2.2 Update Function Outline 1. Introduction 2. GraphLab 2.1 GraphLab Data Model 2.2 Update Function 2.3 Sync Mechanism 2.4 PageRank Example 2.5 Execution Model and Data Consistency Big Data Analytics 11 / 33

40 2.2 Update Function Update Function - Scope of a Vertex The Update Function is a stateless user defined function Big Data Analytics 11 / 33

41 2.2 Update Function Update Function - Scope of a Vertex The Update Function is a stateless user defined function It operates on the scope S v of a vertex v: S v := {v} (v ) ( v) N v Big Data Analytics 11 / 33

42 2.2 Update Function Update Function - Scope of a Vertex The Update Function is a stateless user defined function It operates on the scope S v of a vertex v: S v := {v} (v ) ( v) N v Big Data Analytics 11 / 33

43 2.2 Update Function Update Function - Scope of a Vertex The Update Function is a stateless user defined function It operates on the scope S v of a vertex v: S v := {v} (v ) ( v) N v Once executed on a vertex v, it may trigger the execution of the update functions on other vertices T Big Data Analytics 11 / 33

44 2.2 Update Function Update Function Be S v the scope of a vertex v and T the globally shared table. Big Data Analytics 12 / 33

45 2.2 Update Function Update Function Be S v the scope of a vertex v and T the globally shared table. The application of an update function f to a vertex v can be defined as (D Sv, T ) f (D Sv, T) Big Data Analytics 12 / 33

46 2.2 Update Function Update Function Be S v the scope of a vertex v and T the globally shared table. The application of an update function f to a vertex v can be defined as (D Sv, T ) f (D Sv, T) Where: Big Data Analytics 12 / 33

47 2.2 Update Function Update Function Be S v the scope of a vertex v and T the globally shared table. The application of an update function f to a vertex v can be defined as (D Sv, T ) f (D Sv, T) Where: D Sv is the data associated with the scope of v Big Data Analytics 12 / 33

48 2.2 Update Function Update Function Be S v the scope of a vertex v and T the globally shared table. The application of an update function f to a vertex v can be defined as (D Sv, T ) f (D Sv, T) Where: D Sv is the data associated with the scope of v f has read-only access to T Big Data Analytics 12 / 33

49 2.2 Update Function Update Function Be S v the scope of a vertex v and T the globally shared table. The application of an update function f to a vertex v can be defined as (D Sv, T ) f (D Sv, T) Where: D Sv is the data associated with the scope of v f has read-only access to T T is a set of vertices Big Data Analytics 12 / 33

50 2.2 Update Function Update Function and Vertex Programs In the most recent version of the GraphLab abstraction, Update Functions are called Vertex Programs Big Data Analytics 13 / 33

51 2.2 Update Function Update Function and Vertex Programs In the most recent version of the GraphLab abstraction, Update Functions are called Vertex Programs A Vertex Program implements the Gather-Apply-Scatter (GAS) model, and has three phases: Big Data Analytics 13 / 33

52 2.2 Update Function Update Function and Vertex Programs In the most recent version of the GraphLab abstraction, Update Functions are called Vertex Programs A Vertex Program implements the Gather-Apply-Scatter (GAS) model, and has three phases: Big Data Analytics 13 / 33

53 2.2 Update Function Update Function and Vertex Programs In the most recent version of the GraphLab abstraction, Update Functions are called Vertex Programs A Vertex Program implements the Gather-Apply-Scatter (GAS) model, and has three phases: Gather: executed in parallel on each edge of the current vertex (usually to read data) Big Data Analytics 13 / 33

54 2.2 Update Function Update Function and Vertex Programs In the most recent version of the GraphLab abstraction, Update Functions are called Vertex Programs A Vertex Program implements the Gather-Apply-Scatter (GAS) model, and has three phases: Gather: executed in parallel on each edge of the current vertex (usually to read data) Apply: atomic function that modifies the vertex data Big Data Analytics 13 / 33

55 2.2 Update Function Update Function and Vertex Programs In the most recent version of the GraphLab abstraction, Update Functions are called Vertex Programs A Vertex Program implements the Gather-Apply-Scatter (GAS) model, and has three phases: Gather: executed in parallel on each edge of the current vertex (usually to read data) Apply: atomic function that modifies the vertex data Scatter: Executed in parallel on each edge of the current vertex (usually to update edge data and signaling neighboring nodes) Big Data Analytics 13 / 33

56 2.2 Update Function Update Function and Vertex Programs In the most recent version of the GraphLab abstraction, Update Functions are called Vertex Programs A Vertex Program implements the Gather-Apply-Scatter (GAS) model, and has three phases: Gather: executed in parallel on each edge of the current vertex (usually to read data) Apply: atomic function that modifies the vertex data Scatter: Executed in parallel on each edge of the current vertex (usually to update edge data and signaling neighboring nodes) This decomposition allows us to execute a single vertex program on several machines simultaneously and move computation to the data. Big Data Analytics 13 / 33

57 2.2 Update Function Structure of an Update Function 1: procedure UpdateFunction input: vertex v, scope S v, data D Sv, shared table T 2: GatherType: g stores the results of the gather phase 3: for (u, v) ( v) do In Parallel 4: g g + gather ((u, v)) 5: end for 6: apply (g, D v ) 7: for (v, u) (v ) do In Parallel 8: scatter ((v, u)) 9: end for 10: end procedure Big Data Analytics 14 / 33

58 2.3 Sync Mechanism Outline 1. Introduction 2. GraphLab 2.1 GraphLab Data Model 2.2 Update Function 2.3 Sync Mechanism 2.4 PageRank Example 2.5 Execution Model and Data Consistency Big Data Analytics 15 / 33

59 2.3 Sync Mechanism Sync Mechanism Aggregates data accross all vertices in the graph Big Data Analytics 15 / 33

60 2.3 Sync Mechanism Sync Mechanism Aggregates data accross all vertices in the graph Associates the result with a particular entry in T Big Data Analytics 15 / 33

61 2.3 Sync Mechanism Sync Mechanism Aggregates data accross all vertices in the graph Associates the result with a particular entry in T The user provides: Big Data Analytics 15 / 33

62 2.3 Sync Mechanism Sync Mechanism Aggregates data accross all vertices in the graph Associates the result with a particular entry in T The user provides: a key k Big Data Analytics 15 / 33

63 2.3 Sync Mechanism Sync Mechanism Aggregates data accross all vertices in the graph Associates the result with a particular entry in T The user provides: a key k an initial value rk 0 Big Data Analytics 15 / 33

64 2.3 Sync Mechanism Sync Mechanism Aggregates data accross all vertices in the graph Associates the result with a particular entry in T The user provides: a key k an initial value rk 0 a fold function: r i+1 ( k Fold k Dv, rk i ) Big Data Analytics 15 / 33

65 2.3 Sync Mechanism Sync Mechanism Aggregates data accross all vertices in the graph Associates the result with a particular entry in T The user provides: a key k an initial value rk 0 a fold function: r i+1 ( k Fold k Dv, rk i ) an optional merge function: r l k Merge k ( ) rk i, r j k Big Data Analytics 15 / 33

66 2.3 Sync Mechanism Sync Mechanism Aggregates data accross all vertices in the graph Associates the result with a particular entry in T The user provides: a key k an initial value rk 0 a fold function: r i+1 ( k Fold k Dv, rk i ) an optional merge function: rk l Merge k ( ) an apply function: T[k] Apply k r V k ( ) rk i, r j k Big Data Analytics 15 / 33

67 2.3 Sync Mechanism Sync Mechanism 1: procedure Sync Algorithm input: key k, vertices V, data {D v } v V, initial value rk 0, shared table T 2: t r 0 k 3: for v V do 4: t Fold k (D v, t) 5: end for 6: T[k] Apply k (t) 7: end procedure Big Data Analytics 16 / 33

68 2.4 PageRank Example Outline 1. Introduction 2. GraphLab 2.1 GraphLab Data Model 2.2 Update Function 2.3 Sync Mechanism 2.4 PageRank Example 2.5 Execution Model and Data Consistency Big Data Analytics 17 / 33

69 2.4 PageRank Example Pagerank: Data Model w 1 w 2 G := {w 1, w 2, w 3, w 4 } w 3 w 4 Big Data Analytics 17 / 33

70 2.4 PageRank Example Pagerank: Data Model w 1 w 2 G := {w 1, w 2, w 3, w 4 } E := {(w 1, w 2 ), (w 1, w 3 ), (w 3, w 1 ), (w 2, w 3 ), (w 4, w 3 )} w 3 w 4 Big Data Analytics 17 / 33

71 2.4 PageRank Example Pagerank: Data Model w 1 w 2 G := {w 1, w 2, w 3, w 4 } E := {(w 1, w 2 ), (w 1, w 3 ), (w 3, w 1 ), (w 2, w 3 ), (w 4, w 3 )} D v := PR(v) w 3 w 4 Big Data Analytics 17 / 33

72 2.4 PageRank Example Pagerank: Data Model G := {w 1, w 2, w 3, w 4 } E := {(w 1, w 2 ), (w 1, w 3 ), (w 3, w 1 ), (w 2, w 3 ), (w 4, w 3 )} D v := PR(v) D v,u := w 1 w 2 w 3 w 4 Big Data Analytics 17 / 33

73 2.4 PageRank Example Pagerank: Update function 1: procedure PageRankUpdateFunction input: vertex v, scope S v, ingoing edges ( v) 2: T 3: Dv old D v 4: p 0 5: for u {u (u, v) ( v)} do 6: p p + Du (u ) 7: end for 8: D v (1 β) + β p 9: if D v D old > ɛ then 10: T N v 11: end if 12: return (D Sv, T ) 13: end procedure v Big Data Analytics 18 / 33

74 2.4 PageRank Example Pagerank: Vertex Program 1: procedure PageRankGather input: vertex v, scope S v, ingoing edge (u v) 2: return D u (u ) 3: end procedure Big Data Analytics 19 / 33

75 2.4 PageRank Example Pagerank: Vertex Program 1: procedure PageRankGather input: vertex v, scope S v, ingoing edge (u v) 2: return D u (u ) 3: end procedure 1: procedure PageRankApply input: vertex v, scope S v, gather result p, D v 3: D v (1 β) + β p 4: if D v Dv old > ɛ then 5: Signal to perform scatter 6: end if 2: Dv old 7: end procedure Big Data Analytics 19 / 33

76 2.4 PageRank Example Pagerank: Scatter 1: procedure PageRankScatter input: outgoing edge (v u), set of vertices to be rescheduled T 2: Add u to T 3: end procedure Big Data Analytics 20 / 33

77 2.4 PageRank Example PageRank Implementation: Data Types s t r u c t web_page { std : : string pagename ; double pagerank ; e x p l i c i t web_page ( std : : string name ) : pagename ( name ), pagerank ( 0. 0 ) { } void save ( graphlab : : oarchive& oarc ) const { oarc << pagename << pagerank ; } void load ( graphlab : : iarchive& iarc ) { iarc >> pagename >> pagerank ; } } ; typedef graphlab : : distributed_graph <web_page, graphlab : : empty> graph_type ; Big Data Analytics 21 / 33

78 2.4 PageRank Example v o i d scatter ( icontext_type& context, const vertex_type& vertex, edge_type& edge ) ; } ; Big Data Analytics 22 / 33 PageRank Implementation: Vertex Program c l a s s pagerank_program : p u b l i c graphlab : : ivertex_program<graph_type, double >, p u b l i c graphlab : : IS_POD_TYPE { p u b l i c : bool perform_scatter ; edge_dir_type gather_edges ( icontext_type& context, const vertex_type& vertex ) ; edge_dir_type scatter_edges ( icontext_type& context, const vertex_type& vertex ) ; double gather ( icontext_type& context, const vertex_type& vertex, edge_type& edge ) ; v o i d apply ( icontext_type& context, vertex_type& vertex, const gather_type& total ) ;

79 2.4 PageRank Example PageRank Implementation: Gather // we a r e g o i n g to g a t h e r on a l l the in edges edge_dir_type gather_edges ( icontext_type& context, const vertex_type& vertex ) const { } r e t u r n graphlab : : IN_EDGES ; // f o r each in edge g a t h e r the w e i g h t e d sum o f the edge. double gather ( icontext_type& context, const vertex_type& vertex, edge_type& edge ) const { r e t u r n edge. source ( ). data ( ). pagerank / edge. source ( ). num_out_edges ( ) ; } Big Data Analytics 23 / 33

80 2.4 PageRank Example PageRank Implementation: Apply // Use the t o t a l rank o f a d j a c e n t pages // to update t h i s page v o i d apply ( icontext_type& context, vertex_type& vertex, const gather_type& total ) { } double newval = total ; double oldval = vertex. data ( ). pagerank ; vertex. data ( ). pagerank = newval ; perform_scatter = ( std : : fabs ( prevval oldval ) > 1E 3); Big Data Analytics 24 / 33

81 2.4 PageRank Example PageRank Implementation: Scatter edge_dir_type scatter_edges ( icontext_type& context, const vertex_type& vertex ) { } i f ( perform_scatter ) r e t u r n graphlab : : OUT_EDGES ; e l s e r e t u r n graphlab : : NO_EDGES ; v o i d scatter ( icontext_type& context, const vertex_type& vertex, edge_type& edge ) const { } context. signal ( edge. target ( ) ) ; Big Data Analytics 25 / 33

82 2.5 Execution Model and Data Consistency Outline 1. Introduction 2. GraphLab 2.1 GraphLab Data Model 2.2 Update Function 2.3 Sync Mechanism 2.4 PageRank Example 2.5 Execution Model and Data Consistency Big Data Analytics 26 / 33

83 2.5 Execution Model and Data Consistency Execution Model 1: procedure GraphLabExec input: Graph G := (V, E), data D, initial vertex set T V 2: while T > 0 do 3: v RemoveNext(T ) 4: (T, D Sv ) f (D Sv, T) 5: T T T 6: end while 7: end procedure Big Data Analytics 26 / 33

84 2.5 Execution Model and Data Consistency Serializability The GraphLab abstraction provides a rich sequential model Big Data Analytics 27 / 33

85 2.5 Execution Model and Data Consistency Serializability The GraphLab abstraction provides a rich sequential model Parallel Execution: Big Data Analytics 27 / 33

86 2.5 Execution Model and Data Consistency Serializability The GraphLab abstraction provides a rich sequential model Parallel Execution: Multiple processors to execute the same loop on the same graph Big Data Analytics 27 / 33

87 2.5 Execution Model and Data Consistency Serializability The GraphLab abstraction provides a rich sequential model Parallel Execution: Multiple processors to execute the same loop on the same graph Simutaneous execution of update functions on different vertices Big Data Analytics 27 / 33

88 2.5 Execution Model and Data Consistency Serializability The GraphLab abstraction provides a rich sequential model Parallel Execution: Multiple processors to execute the same loop on the same graph Simutaneous execution of update functions on different vertices We desire a serializable execution: Big Data Analytics 27 / 33

89 2.5 Execution Model and Data Consistency Serializability The GraphLab abstraction provides a rich sequential model Parallel Execution: Multiple processors to execute the same loop on the same graph Simutaneous execution of update functions on different vertices We desire a serializable execution: Definition A GraphLab program is sequentially consistent if for every parallel execution, there exists a sequential execution of update functions that produces the same values in the data graph Big Data Analytics 27 / 33

90 2.5 Execution Model and Data Consistency Data Consistency To ensure Serializability we need consistency models GraphLab supports three types of consistency models: Big Data Analytics 28 / 33

91 2.5 Execution Model and Data Consistency Data Consistency To ensure Serializability we need consistency models GraphLab supports three types of consistency models: Full Consistency: the scope of two concurrently executing update functions do not overlap Big Data Analytics 28 / 33

92 2.5 Execution Model and Data Consistency Data Consistency To ensure Serializability we need consistency models GraphLab supports three types of consistency models: Full Consistency: the scope of two concurrently executing update functions do not overlap Edge Consistency: each update function has R/W access to its vertex and adjacent edges but only read access to neighboring vertices Vertex Consistency: only ensures that no concurrently update functions are executed on the same vertice Big Data Analytics 28 / 33

93 2.5 Execution Model and Data Consistency Data Consistency - Exclusion Sets Big Data Analytics 29 / 33

94 2.5 Execution Model and Data Consistency Data Consistency Big Data Analytics 30 / 33

95 2.5 Execution Model and Data Consistency Data Consistency GraphLab guarantees sequential consistency under the following three conditions: The full consistency model is used Big Data Analytics 31 / 33

96 2.5 Execution Model and Data Consistency Data Consistency GraphLab guarantees sequential consistency under the following three conditions: The full consistency model is used The edge consistency model is used and update functions do not modify data in adjacent vertices Big Data Analytics 31 / 33

97 2.5 Execution Model and Data Consistency Data Consistency GraphLab guarantees sequential consistency under the following three conditions: The full consistency model is used The edge consistency model is used and update functions do not modify data in adjacent vertices The vertex consistency model is used and update functions only access local vertex data Big Data Analytics 31 / 33

98 2.5 Execution Model and Data Consistency Benchmarks - Horizontal Scaling Dataset 1 Dataset 2 Dataset V E Size K 50.9 M 655 MB M 1.8 B 31 GB Yong Guo, Marcin Biczak, Ana Lucia Varbanescu, Alexandru Iosup, Claudio Martella, Theodore L. Willke. Towards Benchmarking Graph-Processing Platforms. Super Computing 2013 Big Data Analytics 32 / 33

99 2.5 Execution Model and Data Consistency Summary A GraphLab program consists of: A data graph Big Data Analytics 33 / 33

100 2.5 Execution Model and Data Consistency Summary A GraphLab program consists of: A data graph An update function Big Data Analytics 33 / 33

101 2.5 Execution Model and Data Consistency Summary A GraphLab program consists of: A data graph An update function Gather Function Apply Function Scatter Function A sync mechanism Big Data Analytics 33 / 33

102 2.5 Execution Model and Data Consistency Summary A GraphLab program consists of: A data graph An update function Gather Function Apply Function Scatter Function A sync mechanism A consistency model Big Data Analytics 33 / 33

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33 Outline 1. Introduction

More information

Big Graph Processing: Some Background

Big Graph Processing: Some Background Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs

More information

Machine Learning over Big Data

Machine Learning over Big Data Machine Learning over Big Presented by Fuhao Zou fuhao@hust.edu.cn Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed

More information

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013 MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing Iftekhar Naim Outline Motivation MapReduce Overview Design Issues & Abstractions Examples and Results Pros and Cons

More information

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D. LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD Dr. Buğra Gedik, Ph.D. MOTIVATION Graph data is everywhere Relationships between people, systems, and the nature Interactions between people, systems,

More information

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014 Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1

More information

MapGraph. A High Level API for Fast Development of High Performance Graphic Analytics on GPUs. http://mapgraph.io

MapGraph. A High Level API for Fast Development of High Performance Graphic Analytics on GPUs. http://mapgraph.io MapGraph A High Level API for Fast Development of High Performance Graphic Analytics on GPUs http://mapgraph.io Zhisong Fu, Michael Personick and Bryan Thompson SYSTAP, LLC Outline Motivations MapGraph

More information

Graph Processing and Social Networks

Graph Processing and Social Networks Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph

More information

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction Human connectome. Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011 2 NA = 6.022 1023 mol 1 Paul Burkhardt, Chris Waring An NSA Big Graph experiment Fast Iterative Graph Computation with Resource

More information

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing /35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of

More information

Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

More information

GRAPHALYTICS http://bl.ocks.org/mbostock/4062045 A Big Data Benchmark for Graph-Processing Platforms

GRAPHALYTICS http://bl.ocks.org/mbostock/4062045 A Big Data Benchmark for Graph-Processing Platforms GRAPHALYTICS http://bl.ocks.org/mbostock/4062045 A Big Data Benchmark for Graph-Processing Platforms Mihai Capotã, Yong Guo, Ana Lucia Varbanescu, Tim Hegeman, Jorai Rijsdijk, Alexandru Iosup, GRAPHALYTICS

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12 MapReduce Algorithms A Sense of Scale At web scales... Mail: Billions of messages per day Search: Billions of searches per day Social: Billions of relationships 2 A Sense of Scale At web scales... Mail:

More information

Big Data and Scripting Systems beyond Hadoop

Big Data and Scripting Systems beyond Hadoop Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid

More information

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 36 Outline

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

Report: Declarative Machine Learning on MapReduce (SystemML)

Report: Declarative Machine Learning on MapReduce (SystemML) Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,

More information

The Hadoop Implementation. Thomas Zimmermann Philipp Berger

The Hadoop Implementation. Thomas Zimmermann Philipp Berger Link Analysis goes MapReduce The Hadoop Implementation Thomas Zimmermann Philipp Berger Flashback 2 Overview 3 1. Pre- / Postprocessing 2. Our Jobs 3. Evaluation Overview 4 1. Pre- / Postprocessing 2.

More information

Assignment 2: More MapReduce with Hadoop

Assignment 2: More MapReduce with Hadoop Assignment 2: More MapReduce with Hadoop Jean-Pierre Lozi February 5, 2015 Provided files following URL: An archive that contains all files you will need for this assignment can be found at the http://sfu.ca/~jlozi/cmpt732/assignment2.tar.gz

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Distributed File Systems and NoSQL Database Distributed

More information

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

More information

Social Media Mining. Network Measures

Social Media Mining. Network Measures Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the like-minded users

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 21 Outline

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

InfiniteGraph: The Distributed Graph Database

InfiniteGraph: The Distributed Graph Database A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086

More information

A1 and FARM scalable graph database on top of a transactional memory layer

A1 and FARM scalable graph database on top of a transactional memory layer A1 and FARM scalable graph database on top of a transactional memory layer Miguel Castro, Aleksandar Dragojević, Dushyanth Narayanan, Ed Nightingale, Alex Shamis Richie Khanna, Matt Renzelmann Chiranjeeb

More information

Systems and Algorithms for Big Data Analytics

Systems and Algorithms for Big Data Analytics Systems and Algorithms for Big Data Analytics YAN, Da Email: yanda@cse.cuhk.edu.hk My Research Graph Data Distributed Graph Processing Spatial Data Spatial Query Processing Uncertain Data Querying & Mining

More information

Presto/Blockus: Towards Scalable R Data Analysis

Presto/Blockus: Towards Scalable R Data Analysis /Blockus: Towards Scalable R Data Analysis Andrew A. Chien University of Chicago and Argonne ational Laboratory IRIA-UIUC-AL Joint Institute Potential Collaboration ovember 19, 2012 ovember 19, 2012 Andrew

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

Apache Hama Design Document v0.6

Apache Hama Design Document v0.6 Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora {mbalassi, gyfora}@apache.org The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache

More information

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012 Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group) 06-08-2012 Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships

More information

DATA STRUCTURES USING C

DATA STRUCTURES USING C DATA STRUCTURES USING C QUESTION BANK UNIT I 1. Define data. 2. Define Entity. 3. Define information. 4. Define Array. 5. Define data structure. 6. Give any two applications of data structures. 7. Give

More information

FastForward I/O and Storage: ACG 6.6 Demonstration

FastForward I/O and Storage: ACG 6.6 Demonstration FastForward I/O and Storage: ACG 6.6 Demonstration NOTICE: THIS MANUSCRIPT HAS BEEN AUTHORED BY INTEL UNDER ITS SUBCONTRACT WITH LAWRENCE LIVERMORE NATIONAL SECURITY, LLC WHO IS THE OPERATOR AND MANAGER

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Going For Large Scale 1

More information

The PageRank Citation Ranking: Bring Order to the Web

The PageRank Citation Ranking: Bring Order to the Web The PageRank Citation Ranking: Bring Order to the Web presented by: Xiaoxi Pang 25.Nov 2010 1 / 20 Outline Introduction A ranking for every page on the Web Implementation Convergence Properties Personalized

More information

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Why? A central concept in Computer Science. Algorithms are ubiquitous. Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Big Data looks Tiny from the Stratosphere

Big Data looks Tiny from the Stratosphere Volker Markl http://www.user.tu-berlin.de/marklv volker.markl@tu-berlin.de Big Data looks Tiny from the Stratosphere Data and analyses are becoming increasingly complex! Size Freshness Format/Media Type

More information

Trinity: A Distributed Graph Engine on a Memory Cloud

Trinity: A Distributed Graph Engine on a Memory Cloud Trinity: A Distributed Graph Engine on a Memory Cloud Bin Shao Microsoft Research Asia Beijing, China binshao@microsoft.com Haixun Wang Microsoft Research Asia Beijing, China haixunw@microsoft.com Yatao

More information

Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud

Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud Yucheng Low Carnegie Mellon University ylow@cs.cmu.edu Danny Bickson Carnegie Mellon University bickson@cs.cmu.edu Joseph

More information

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki hossein@databricks.com

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki hossein@databricks.com Common Patterns and Pitfalls for Implementing Algorithms in Spark Hossein Falaki @mhfalaki hossein@databricks.com Challenges of numerical computation over big data When applying any algorithm to big data

More information

Can the Elephants Handle the NoSQL Onslaught?

Can the Elephants Handle the NoSQL Onslaught? Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented

More information

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes. 1. The advantage of.. is that they solve the problem if sequential storage representation. But disadvantage in that is they are sequential lists. [A] Lists [B] Linked Lists [A] Trees [A] Queues 2. The

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme Big Data Analytics Prof. Dr. Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie,

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

Replication-based Fault-tolerance for Large-scale Graph Processing

Replication-based Fault-tolerance for Large-scale Graph Processing Replication-based Fault-tolerance for Large-scale Graph Processing Peng Wang Kaiyuan Zhang Rong Chen Haibo Chen Haibing Guan Shanghai Key Laboratory of Scalable Computing and Systems Institute of Parallel

More information

BP2SAN From Business Processes to Stochastic Automata Networks

BP2SAN From Business Processes to Stochastic Automata Networks BP2SAN From Business Processes to Stochastic Automata Networks Kelly Rosa Braghetto Department of Computer Science University of São Paulo kellyrb@ime.usp.br March, 2011 Contents 1 Introduction 1 2 Instructions

More information

Relationship collections are well expressed

Relationship collections are well expressed C loud C omputing Graph Twiddling in a MapReduce World The easily distributed sorting primitives that constitute MapReduce jobs have shown great value in processing large data volumes If useful graph operations

More information

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is

More information

How To Create A Graphlab Framework For Parallel Machine Learning

How To Create A Graphlab Framework For Parallel Machine Learning This work is supported by NSF grants CCF-1149252, CCF-1337215, and STARnet, a Semiconductor Research Corporation Program, sponsored by MARCO and DARPA. Datacenter Simulation Methodologies: GraphLab Tamara

More information

Data Structure [Question Bank]

Data Structure [Question Bank] Unit I (Analysis of Algorithms) 1. What are algorithms and how they are useful? 2. Describe the factor on best algorithms depends on? 3. Differentiate: Correct & Incorrect Algorithms? 4. Write short note:

More information

Graph Database Proof of Concept Report

Graph Database Proof of Concept Report Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment

More information

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

Programming Abstractions and Security for Cloud Computing Systems

Programming Abstractions and Security for Cloud Computing Systems Programming Abstractions and Security for Cloud Computing Systems By Sapna Bedi A MASTERS S ESSAY SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit

More information

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59 Introduction Contents 1 Introduction

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Practical Graph Mining with R. 5. Link Analysis

Practical Graph Mining with R. 5. Link Analysis Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

PDQCollections*: A Data-Parallel Programming Model and Library for Associative Containers

PDQCollections*: A Data-Parallel Programming Model and Library for Associative Containers PDQCollections*: A Data-Parallel Programming Model and Library for Associative Containers Maneesh Varshney Vishwa Goudar Christophe Beaumont Jan, 2013 * PDQ could stand for Pretty Darn Quick or Processes

More information

MapReduce Evaluator: User Guide

MapReduce Evaluator: User Guide University of A Coruña Computer Architecture Group MapReduce Evaluator: User Guide Authors: Jorge Veiga, Roberto R. Expósito, Guillermo L. Taboada and Juan Touriño December 9, 2014 Contents 1 Overview

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Reminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new

Reminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new Reminder: Complexity (1) Parallel Complexity Theory Lecture 6 Number of steps or memory units required to compute some result In terms of input size Using a single processor O(1) says that regardless of

More information

Reminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new GAP (2) Graph Accessibility Problem (GAP) (1)

Reminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new GAP (2) Graph Accessibility Problem (GAP) (1) Reminder: Complexity (1) Parallel Complexity Theory Lecture 6 Number of steps or memory units required to compute some result In terms of input size Using a single processor O(1) says that regardless of

More information

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective 2014-2015

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective 2014-2015 Cloud Computing Lectures 10 and 11 Map Reduce: System Perspective 2014-2015 1 MapReduce in More Detail 2 Master (i) Execution is controlled by the master process: Input data are split into 64MB blocks.

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

How To Cluster Of Complex Systems

How To Cluster Of Complex Systems Entropy based Graph Clustering: Application to Biological and Social Networks Edward C Kenley Young-Rae Cho Department of Computer Science Baylor University Complex Systems Definition Dynamically evolving

More information

The Stratosphere Big Data Analytics Platform

The Stratosphere Big Data Analytics Platform The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science amir@sics.se June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

Social Media Mining. Graph Essentials

Social Media Mining. Graph Essentials Graph Essentials Graph Basics Measures Graph and Essentials Metrics 2 2 Nodes and Edges A network is a graph nodes, actors, or vertices (plural of vertex) Connections, edges or ties Edge Node Measures

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

A computational model for MapReduce job flow

A computational model for MapReduce job flow A computational model for MapReduce job flow Tommaso Di Noia, Marina Mongiello, Eugenio Di Sciascio Dipartimento di Ingegneria Elettrica e Dell informazione Politecnico di Bari Via E. Orabona, 4 70125

More information

An Empirical Study of Two MIS Algorithms

An Empirical Study of Two MIS Algorithms An Empirical Study of Two MIS Algorithms Email: Tushar Bisht and Kishore Kothapalli International Institute of Information Technology, Hyderabad Hyderabad, Andhra Pradesh, India 32. tushar.bisht@research.iiit.ac.in,

More information

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez (200950006) Sameer Kumar (200950031)

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez (200950006) Sameer Kumar (200950031) using Hadoop MapReduce framework Binoy A Fernandez (200950006) Sameer Kumar (200950031) Objective To demonstrate how the hadoop mapreduce framework can be extended to work with image data for distributed

More information

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Analysis of Web Archives. Vinay Goel Senior Data Engineer Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner

More information

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Distributed Dynamic Load Balancing for Iterative-Stencil Applications Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,

More information

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing. Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,

More information

Graphalytics: A Big Data Benchmark for Graph-Processing Platforms

Graphalytics: A Big Data Benchmark for Graph-Processing Platforms Graphalytics: A Big Data Benchmark for Graph-Processing Platforms Mihai Capotă, Tim Hegeman, Alexandru Iosup, Arnau Prat-Pérez, Orri Erling, Peter Boncz Delft University of Technology Universitat Politècnica

More information

Distributed Computing over Communication Networks: Maximal Independent Set

Distributed Computing over Communication Networks: Maximal Independent Set Distributed Computing over Communication Networks: Maximal Independent Set What is a MIS? MIS An independent set (IS) of an undirected graph is a subset U of nodes such that no two nodes in U are adjacent.

More information

Lecture 10 - Functional programming: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional

More information

Big Data Systems CS 5965/6965 FALL 2015

Big Data Systems CS 5965/6965 FALL 2015 Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

More information

Big Data Science. Prof. Lise Getoor University of Maryland, College Park. http://www.cs.umd.edu/~getoor. October 17, 2013

Big Data Science. Prof. Lise Getoor University of Maryland, College Park. http://www.cs.umd.edu/~getoor. October 17, 2013 Big Data Science Prof Lise Getoor University of Maryland, College Park October 17, 2013 http://wwwcsumdedu/~getoor BIG Data is not flat 2004-2013 lonnitaylor Data is multi-modal, multi-relational, spatio-temporal,

More information

BOĞAZİÇİ UNIVERSITY COMPUTER ENGINEERING

BOĞAZİÇİ UNIVERSITY COMPUTER ENGINEERING Parallel l Tetrahedral Mesh Refinement Mehmet Balman Computer Engineering, Boğaziçi University Adaptive Mesh Refinement (AMR) A computation ti technique used to improve the efficiency i of numerical systems

More information

GraphChi: Large-Scale Graph Computation on Just a PC

GraphChi: Large-Scale Graph Computation on Just a PC GraphChi: Large-Scale Graph Computation on Just a PC Aapo Kyrola Carnegie Mellon University akyrola@cs.cmu.edu Guy Blelloch Carnegie Mellon University guyb@cs.cmu.edu Carlos Guestrin University of Washington

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

arxiv:1405.1499v3 [cs.db] 30 Sep 2015

arxiv:1405.1499v3 [cs.db] 30 Sep 2015 Noname manuscript No. (will be inserted by the editor) NScale: Neighborhood-centric Large-Scale Graph Analytics in the Cloud Abdul Quamar Amol Deshpande Jimmy Lin arxiv:145.1499v3 [cs.db] 3 Sep 215 the

More information

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.

More information

Project DALKIT (informal working title)

Project DALKIT (informal working title) Project DALKIT (informal working title) Michael Axtmann, Timo Bingmann, Peter Sanders, Sebastian Schlag, and Students 2015-03-27 INSTITUTE OF THEORETICAL INFORMATICS ALGORITHMICS KIT University of the

More information