Big Data Analytics. Lucas Rego Drumond
|
|
- Shannon Walton
- 8 years ago
- Views:
Transcription
1 Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 33
2 Outline 1. Introduction 2. GraphLab 2.1 GraphLab Data Model 2.2 Update Function 2.3 Sync Mechanism 2.4 PageRank Example 2.5 Execution Model and Data Consistency Big Data Analytics 1 / 33
3 1. Introduction Outline 1. Introduction 2. GraphLab 2.1 GraphLab Data Model 2.2 Update Function 2.3 Sync Mechanism 2.4 PageRank Example 2.5 Execution Model and Data Consistency Big Data Analytics 1 / 33
4 1. Introduction Overview Part III Machine Learning Algorithms Part II Large Scale Computational Models Part I Distributed Database Distributed File System Big Data Analytics 1 / 33
5 1. Introduction MapReduce - Review 1. Each mapper transforms a set key-value pairs into a list of output keys and intermediate value pairs 2. all intermediate values are grouped according to their output keys 3. each reducer receives all the intermediate values associated with a given keys 4. each reducer associates one final value to each key Big Data Analytics 2 / 33
6 1. Introduction Pregel - Review Big Data Analytics 3 / 33
7 2. GraphLab Outline 1. Introduction 2. GraphLab 2.1 GraphLab Data Model 2.2 Update Function 2.3 Sync Mechanism 2.4 PageRank Example 2.5 Execution Model and Data Consistency Big Data Analytics 4 / 33
8 2. GraphLab GraphLab A new framework for parallel machine learning Big Data Analytics 4 / 33
9 2. GraphLab GraphLab A new framework for parallel machine learning Represents data as a directed graph Big Data Analytics 4 / 33
10 2. GraphLab GraphLab A new framework for parallel machine learning Represents data as a directed graph Blocks of data stored in vertices and edges Big Data Analytics 4 / 33
11 2. GraphLab GraphLab A new framework for parallel machine learning Represents data as a directed graph Blocks of data stored in vertices and edges Globally shared data table Big Data Analytics 4 / 33
12 2. GraphLab GraphLab A new framework for parallel machine learning Represents data as a directed graph Blocks of data stored in vertices and edges Globally shared data table Shared memory (GraphLab 1.0) distributed (GraphLab 2.1) Big Data Analytics 4 / 33
13 2. GraphLab GraphLab A new framework for parallel machine learning Represents data as a directed graph Blocks of data stored in vertices and edges Globally shared data table Shared memory (GraphLab 1.0) distributed (GraphLab 2.1) Available for download at: Big Data Analytics 4 / 33
14 2. GraphLab GraphLab Software Stack Big Data Analytics 5 / 33
15 2. GraphLab GraphLab System Overview Big Data Analytics 6 / 33
16 2. GraphLab GraphLab Abstraction The GraphLab abstraction consists of the following main components: Big Data Analytics 7 / 33
17 2. GraphLab GraphLab Abstraction The GraphLab abstraction consists of the following main components: A Data Graph provides a high level representation for data and complex computational dependencies Big Data Analytics 7 / 33
18 2. GraphLab GraphLab Abstraction The GraphLab abstraction consists of the following main components: A Data Graph provides a high level representation for data and complex computational dependencies Configurable consistency models for automatically guaranteeing data consistancy Big Data Analytics 7 / 33
19 2. GraphLab GraphLab Abstraction The GraphLab abstraction consists of the following main components: A Data Graph provides a high level representation for data and complex computational dependencies Configurable consistency models for automatically guaranteeing data consistancy Scheduling Primitives able to express iterative parallel algorithms Big Data Analytics 7 / 33
20 2. GraphLab GraphLab Abstraction The GraphLab abstraction consists of the following main components: A Data Graph provides a high level representation for data and complex computational dependencies Configurable consistency models for automatically guaranteeing data consistancy Scheduling Primitives able to express iterative parallel algorithms User defined computation consisting of: Big Data Analytics 7 / 33
21 2. GraphLab GraphLab Abstraction The GraphLab abstraction consists of the following main components: A Data Graph provides a high level representation for data and complex computational dependencies Configurable consistency models for automatically guaranteeing data consistancy Scheduling Primitives able to express iterative parallel algorithms User defined computation consisting of: Update Functions: local computation Big Data Analytics 7 / 33
22 2. GraphLab GraphLab Abstraction The GraphLab abstraction consists of the following main components: A Data Graph provides a high level representation for data and complex computational dependencies Configurable consistency models for automatically guaranteeing data consistancy Scheduling Primitives able to express iterative parallel algorithms User defined computation consisting of: Update Functions: local computation Sync Mechanism: global aggregation Big Data Analytics 7 / 33
23 2.1 GraphLab Data Model Outline 1. Introduction 2. GraphLab 2.1 GraphLab Data Model 2.2 Update Function 2.3 Sync Mechanism 2.4 PageRank Example 2.5 Execution Model and Data Consistency Big Data Analytics 8 / 33
24 2.1 GraphLab Data Model Graphlab Data Model Consists of two parts: Big Data Analytics 8 / 33
25 2.1 GraphLab Data Model Graphlab Data Model Consists of two parts: 1. A directed data graph: G := (V, E) where V is the set of vertices and E V V the directed edges Big Data Analytics 8 / 33
26 2.1 GraphLab Data Model Graphlab Data Model Consists of two parts: 1. A directed data graph: G := (V, E) where V is the set of vertices and E V V the directed edges 2. A globally shared data table: T[Key] Value which maps keys to arbitrary blocks of data Big Data Analytics 8 / 33
27 2.1 GraphLab Data Model Graphlab Data Model Users can associate with each graph vertice and edge: Arbitrary data blocks Program or model parameters Some notation: Big Data Analytics 9 / 33
28 2.1 GraphLab Data Model Graphlab Data Model Users can associate with each graph vertice and edge: Arbitrary data blocks Program or model parameters Some notation: Data associated with a vertex v: D v Big Data Analytics 9 / 33
29 2.1 GraphLab Data Model Graphlab Data Model Users can associate with each graph vertice and edge: Arbitrary data blocks Program or model parameters Some notation: Data associated with a vertex v: D v Data associated with an edge (u, v): D u,v Big Data Analytics 9 / 33
30 2.1 GraphLab Data Model Graphlab Data Model Users can associate with each graph vertice and edge: Arbitrary data blocks Program or model parameters Some notation: Data associated with a vertex v: D v Data associated with an edge (u, v): D u,v Set of all outbound edges from v: (v ) Big Data Analytics 9 / 33
31 2.1 GraphLab Data Model Graphlab Data Model Users can associate with each graph vertice and edge: Arbitrary data blocks Program or model parameters Some notation: Data associated with a vertex v: D v Data associated with an edge (u, v): D u,v Set of all outbound edges from v: (v ) Set of all inbound edges from v: ( v) Big Data Analytics 9 / 33
32 2.1 GraphLab Data Model Graphlab Data Model Users can associate with each graph vertice and edge: Arbitrary data blocks Program or model parameters Some notation: Data associated with a vertex v: D v Data associated with an edge (u, v): D u,v Set of all outbound edges from v: (v ) Set of all inbound edges from v: ( v) Neighboring vertices of v: N v := {u (u, v) E (v, u) E} Big Data Analytics 9 / 33
33 2.1 GraphLab Data Model GraphLab - User defined computation There are two types of user defined computation on GraphLab: Big Data Analytics 10 / 33
34 2.1 GraphLab Data Model GraphLab - User defined computation There are two types of user defined computation on GraphLab: Update function: Big Data Analytics 10 / 33
35 2.1 GraphLab Data Model GraphLab - User defined computation There are two types of user defined computation on GraphLab: Update function: local computation Big Data Analytics 10 / 33
36 2.1 GraphLab Data Model GraphLab - User defined computation There are two types of user defined computation on GraphLab: Update function: local computation executed on small neighborhoods Big Data Analytics 10 / 33
37 2.1 GraphLab Data Model GraphLab - User defined computation There are two types of user defined computation on GraphLab: Update function: local computation executed on small neighborhoods different functions may access and modify overlapping contexts Big Data Analytics 10 / 33
38 2.1 GraphLab Data Model GraphLab - User defined computation There are two types of user defined computation on GraphLab: Update function: local computation executed on small neighborhoods different functions may access and modify overlapping contexts Sync Mechanism: global aggregation runs concurrently with the update functions Big Data Analytics 10 / 33
39 2.2 Update Function Outline 1. Introduction 2. GraphLab 2.1 GraphLab Data Model 2.2 Update Function 2.3 Sync Mechanism 2.4 PageRank Example 2.5 Execution Model and Data Consistency Big Data Analytics 11 / 33
40 2.2 Update Function Update Function - Scope of a Vertex The Update Function is a stateless user defined function Big Data Analytics 11 / 33
41 2.2 Update Function Update Function - Scope of a Vertex The Update Function is a stateless user defined function It operates on the scope S v of a vertex v: S v := {v} (v ) ( v) N v Big Data Analytics 11 / 33
42 2.2 Update Function Update Function - Scope of a Vertex The Update Function is a stateless user defined function It operates on the scope S v of a vertex v: S v := {v} (v ) ( v) N v Big Data Analytics 11 / 33
43 2.2 Update Function Update Function - Scope of a Vertex The Update Function is a stateless user defined function It operates on the scope S v of a vertex v: S v := {v} (v ) ( v) N v Once executed on a vertex v, it may trigger the execution of the update functions on other vertices T Big Data Analytics 11 / 33
44 2.2 Update Function Update Function Be S v the scope of a vertex v and T the globally shared table. Big Data Analytics 12 / 33
45 2.2 Update Function Update Function Be S v the scope of a vertex v and T the globally shared table. The application of an update function f to a vertex v can be defined as (D Sv, T ) f (D Sv, T) Big Data Analytics 12 / 33
46 2.2 Update Function Update Function Be S v the scope of a vertex v and T the globally shared table. The application of an update function f to a vertex v can be defined as (D Sv, T ) f (D Sv, T) Where: Big Data Analytics 12 / 33
47 2.2 Update Function Update Function Be S v the scope of a vertex v and T the globally shared table. The application of an update function f to a vertex v can be defined as (D Sv, T ) f (D Sv, T) Where: D Sv is the data associated with the scope of v Big Data Analytics 12 / 33
48 2.2 Update Function Update Function Be S v the scope of a vertex v and T the globally shared table. The application of an update function f to a vertex v can be defined as (D Sv, T ) f (D Sv, T) Where: D Sv is the data associated with the scope of v f has read-only access to T Big Data Analytics 12 / 33
49 2.2 Update Function Update Function Be S v the scope of a vertex v and T the globally shared table. The application of an update function f to a vertex v can be defined as (D Sv, T ) f (D Sv, T) Where: D Sv is the data associated with the scope of v f has read-only access to T T is a set of vertices Big Data Analytics 12 / 33
50 2.2 Update Function Update Function and Vertex Programs In the most recent version of the GraphLab abstraction, Update Functions are called Vertex Programs Big Data Analytics 13 / 33
51 2.2 Update Function Update Function and Vertex Programs In the most recent version of the GraphLab abstraction, Update Functions are called Vertex Programs A Vertex Program implements the Gather-Apply-Scatter (GAS) model, and has three phases: Big Data Analytics 13 / 33
52 2.2 Update Function Update Function and Vertex Programs In the most recent version of the GraphLab abstraction, Update Functions are called Vertex Programs A Vertex Program implements the Gather-Apply-Scatter (GAS) model, and has three phases: Big Data Analytics 13 / 33
53 2.2 Update Function Update Function and Vertex Programs In the most recent version of the GraphLab abstraction, Update Functions are called Vertex Programs A Vertex Program implements the Gather-Apply-Scatter (GAS) model, and has three phases: Gather: executed in parallel on each edge of the current vertex (usually to read data) Big Data Analytics 13 / 33
54 2.2 Update Function Update Function and Vertex Programs In the most recent version of the GraphLab abstraction, Update Functions are called Vertex Programs A Vertex Program implements the Gather-Apply-Scatter (GAS) model, and has three phases: Gather: executed in parallel on each edge of the current vertex (usually to read data) Apply: atomic function that modifies the vertex data Big Data Analytics 13 / 33
55 2.2 Update Function Update Function and Vertex Programs In the most recent version of the GraphLab abstraction, Update Functions are called Vertex Programs A Vertex Program implements the Gather-Apply-Scatter (GAS) model, and has three phases: Gather: executed in parallel on each edge of the current vertex (usually to read data) Apply: atomic function that modifies the vertex data Scatter: Executed in parallel on each edge of the current vertex (usually to update edge data and signaling neighboring nodes) Big Data Analytics 13 / 33
56 2.2 Update Function Update Function and Vertex Programs In the most recent version of the GraphLab abstraction, Update Functions are called Vertex Programs A Vertex Program implements the Gather-Apply-Scatter (GAS) model, and has three phases: Gather: executed in parallel on each edge of the current vertex (usually to read data) Apply: atomic function that modifies the vertex data Scatter: Executed in parallel on each edge of the current vertex (usually to update edge data and signaling neighboring nodes) This decomposition allows us to execute a single vertex program on several machines simultaneously and move computation to the data. Big Data Analytics 13 / 33
57 2.2 Update Function Structure of an Update Function 1: procedure UpdateFunction input: vertex v, scope S v, data D Sv, shared table T 2: GatherType: g stores the results of the gather phase 3: for (u, v) ( v) do In Parallel 4: g g + gather ((u, v)) 5: end for 6: apply (g, D v ) 7: for (v, u) (v ) do In Parallel 8: scatter ((v, u)) 9: end for 10: end procedure Big Data Analytics 14 / 33
58 2.3 Sync Mechanism Outline 1. Introduction 2. GraphLab 2.1 GraphLab Data Model 2.2 Update Function 2.3 Sync Mechanism 2.4 PageRank Example 2.5 Execution Model and Data Consistency Big Data Analytics 15 / 33
59 2.3 Sync Mechanism Sync Mechanism Aggregates data accross all vertices in the graph Big Data Analytics 15 / 33
60 2.3 Sync Mechanism Sync Mechanism Aggregates data accross all vertices in the graph Associates the result with a particular entry in T Big Data Analytics 15 / 33
61 2.3 Sync Mechanism Sync Mechanism Aggregates data accross all vertices in the graph Associates the result with a particular entry in T The user provides: Big Data Analytics 15 / 33
62 2.3 Sync Mechanism Sync Mechanism Aggregates data accross all vertices in the graph Associates the result with a particular entry in T The user provides: a key k Big Data Analytics 15 / 33
63 2.3 Sync Mechanism Sync Mechanism Aggregates data accross all vertices in the graph Associates the result with a particular entry in T The user provides: a key k an initial value rk 0 Big Data Analytics 15 / 33
64 2.3 Sync Mechanism Sync Mechanism Aggregates data accross all vertices in the graph Associates the result with a particular entry in T The user provides: a key k an initial value rk 0 a fold function: r i+1 ( k Fold k Dv, rk i ) Big Data Analytics 15 / 33
65 2.3 Sync Mechanism Sync Mechanism Aggregates data accross all vertices in the graph Associates the result with a particular entry in T The user provides: a key k an initial value rk 0 a fold function: r i+1 ( k Fold k Dv, rk i ) an optional merge function: r l k Merge k ( ) rk i, r j k Big Data Analytics 15 / 33
66 2.3 Sync Mechanism Sync Mechanism Aggregates data accross all vertices in the graph Associates the result with a particular entry in T The user provides: a key k an initial value rk 0 a fold function: r i+1 ( k Fold k Dv, rk i ) an optional merge function: rk l Merge k ( ) an apply function: T[k] Apply k r V k ( ) rk i, r j k Big Data Analytics 15 / 33
67 2.3 Sync Mechanism Sync Mechanism 1: procedure Sync Algorithm input: key k, vertices V, data {D v } v V, initial value rk 0, shared table T 2: t r 0 k 3: for v V do 4: t Fold k (D v, t) 5: end for 6: T[k] Apply k (t) 7: end procedure Big Data Analytics 16 / 33
68 2.4 PageRank Example Outline 1. Introduction 2. GraphLab 2.1 GraphLab Data Model 2.2 Update Function 2.3 Sync Mechanism 2.4 PageRank Example 2.5 Execution Model and Data Consistency Big Data Analytics 17 / 33
69 2.4 PageRank Example Pagerank: Data Model w 1 w 2 G := {w 1, w 2, w 3, w 4 } w 3 w 4 Big Data Analytics 17 / 33
70 2.4 PageRank Example Pagerank: Data Model w 1 w 2 G := {w 1, w 2, w 3, w 4 } E := {(w 1, w 2 ), (w 1, w 3 ), (w 3, w 1 ), (w 2, w 3 ), (w 4, w 3 )} w 3 w 4 Big Data Analytics 17 / 33
71 2.4 PageRank Example Pagerank: Data Model w 1 w 2 G := {w 1, w 2, w 3, w 4 } E := {(w 1, w 2 ), (w 1, w 3 ), (w 3, w 1 ), (w 2, w 3 ), (w 4, w 3 )} D v := PR(v) w 3 w 4 Big Data Analytics 17 / 33
72 2.4 PageRank Example Pagerank: Data Model G := {w 1, w 2, w 3, w 4 } E := {(w 1, w 2 ), (w 1, w 3 ), (w 3, w 1 ), (w 2, w 3 ), (w 4, w 3 )} D v := PR(v) D v,u := w 1 w 2 w 3 w 4 Big Data Analytics 17 / 33
73 2.4 PageRank Example Pagerank: Update function 1: procedure PageRankUpdateFunction input: vertex v, scope S v, ingoing edges ( v) 2: T 3: Dv old D v 4: p 0 5: for u {u (u, v) ( v)} do 6: p p + Du (u ) 7: end for 8: D v (1 β) + β p 9: if D v D old > ɛ then 10: T N v 11: end if 12: return (D Sv, T ) 13: end procedure v Big Data Analytics 18 / 33
74 2.4 PageRank Example Pagerank: Vertex Program 1: procedure PageRankGather input: vertex v, scope S v, ingoing edge (u v) 2: return D u (u ) 3: end procedure Big Data Analytics 19 / 33
75 2.4 PageRank Example Pagerank: Vertex Program 1: procedure PageRankGather input: vertex v, scope S v, ingoing edge (u v) 2: return D u (u ) 3: end procedure 1: procedure PageRankApply input: vertex v, scope S v, gather result p, D v 3: D v (1 β) + β p 4: if D v Dv old > ɛ then 5: Signal to perform scatter 6: end if 2: Dv old 7: end procedure Big Data Analytics 19 / 33
76 2.4 PageRank Example Pagerank: Scatter 1: procedure PageRankScatter input: outgoing edge (v u), set of vertices to be rescheduled T 2: Add u to T 3: end procedure Big Data Analytics 20 / 33
77 2.4 PageRank Example PageRank Implementation: Data Types s t r u c t web_page { std : : string pagename ; double pagerank ; e x p l i c i t web_page ( std : : string name ) : pagename ( name ), pagerank ( 0. 0 ) { } void save ( graphlab : : oarchive& oarc ) const { oarc << pagename << pagerank ; } void load ( graphlab : : iarchive& iarc ) { iarc >> pagename >> pagerank ; } } ; typedef graphlab : : distributed_graph <web_page, graphlab : : empty> graph_type ; Big Data Analytics 21 / 33
78 2.4 PageRank Example v o i d scatter ( icontext_type& context, const vertex_type& vertex, edge_type& edge ) ; } ; Big Data Analytics 22 / 33 PageRank Implementation: Vertex Program c l a s s pagerank_program : p u b l i c graphlab : : ivertex_program<graph_type, double >, p u b l i c graphlab : : IS_POD_TYPE { p u b l i c : bool perform_scatter ; edge_dir_type gather_edges ( icontext_type& context, const vertex_type& vertex ) ; edge_dir_type scatter_edges ( icontext_type& context, const vertex_type& vertex ) ; double gather ( icontext_type& context, const vertex_type& vertex, edge_type& edge ) ; v o i d apply ( icontext_type& context, vertex_type& vertex, const gather_type& total ) ;
79 2.4 PageRank Example PageRank Implementation: Gather // we a r e g o i n g to g a t h e r on a l l the in edges edge_dir_type gather_edges ( icontext_type& context, const vertex_type& vertex ) const { } r e t u r n graphlab : : IN_EDGES ; // f o r each in edge g a t h e r the w e i g h t e d sum o f the edge. double gather ( icontext_type& context, const vertex_type& vertex, edge_type& edge ) const { r e t u r n edge. source ( ). data ( ). pagerank / edge. source ( ). num_out_edges ( ) ; } Big Data Analytics 23 / 33
80 2.4 PageRank Example PageRank Implementation: Apply // Use the t o t a l rank o f a d j a c e n t pages // to update t h i s page v o i d apply ( icontext_type& context, vertex_type& vertex, const gather_type& total ) { } double newval = total ; double oldval = vertex. data ( ). pagerank ; vertex. data ( ). pagerank = newval ; perform_scatter = ( std : : fabs ( prevval oldval ) > 1E 3); Big Data Analytics 24 / 33
81 2.4 PageRank Example PageRank Implementation: Scatter edge_dir_type scatter_edges ( icontext_type& context, const vertex_type& vertex ) { } i f ( perform_scatter ) r e t u r n graphlab : : OUT_EDGES ; e l s e r e t u r n graphlab : : NO_EDGES ; v o i d scatter ( icontext_type& context, const vertex_type& vertex, edge_type& edge ) const { } context. signal ( edge. target ( ) ) ; Big Data Analytics 25 / 33
82 2.5 Execution Model and Data Consistency Outline 1. Introduction 2. GraphLab 2.1 GraphLab Data Model 2.2 Update Function 2.3 Sync Mechanism 2.4 PageRank Example 2.5 Execution Model and Data Consistency Big Data Analytics 26 / 33
83 2.5 Execution Model and Data Consistency Execution Model 1: procedure GraphLabExec input: Graph G := (V, E), data D, initial vertex set T V 2: while T > 0 do 3: v RemoveNext(T ) 4: (T, D Sv ) f (D Sv, T) 5: T T T 6: end while 7: end procedure Big Data Analytics 26 / 33
84 2.5 Execution Model and Data Consistency Serializability The GraphLab abstraction provides a rich sequential model Big Data Analytics 27 / 33
85 2.5 Execution Model and Data Consistency Serializability The GraphLab abstraction provides a rich sequential model Parallel Execution: Big Data Analytics 27 / 33
86 2.5 Execution Model and Data Consistency Serializability The GraphLab abstraction provides a rich sequential model Parallel Execution: Multiple processors to execute the same loop on the same graph Big Data Analytics 27 / 33
87 2.5 Execution Model and Data Consistency Serializability The GraphLab abstraction provides a rich sequential model Parallel Execution: Multiple processors to execute the same loop on the same graph Simutaneous execution of update functions on different vertices Big Data Analytics 27 / 33
88 2.5 Execution Model and Data Consistency Serializability The GraphLab abstraction provides a rich sequential model Parallel Execution: Multiple processors to execute the same loop on the same graph Simutaneous execution of update functions on different vertices We desire a serializable execution: Big Data Analytics 27 / 33
89 2.5 Execution Model and Data Consistency Serializability The GraphLab abstraction provides a rich sequential model Parallel Execution: Multiple processors to execute the same loop on the same graph Simutaneous execution of update functions on different vertices We desire a serializable execution: Definition A GraphLab program is sequentially consistent if for every parallel execution, there exists a sequential execution of update functions that produces the same values in the data graph Big Data Analytics 27 / 33
90 2.5 Execution Model and Data Consistency Data Consistency To ensure Serializability we need consistency models GraphLab supports three types of consistency models: Big Data Analytics 28 / 33
91 2.5 Execution Model and Data Consistency Data Consistency To ensure Serializability we need consistency models GraphLab supports three types of consistency models: Full Consistency: the scope of two concurrently executing update functions do not overlap Big Data Analytics 28 / 33
92 2.5 Execution Model and Data Consistency Data Consistency To ensure Serializability we need consistency models GraphLab supports three types of consistency models: Full Consistency: the scope of two concurrently executing update functions do not overlap Edge Consistency: each update function has R/W access to its vertex and adjacent edges but only read access to neighboring vertices Vertex Consistency: only ensures that no concurrently update functions are executed on the same vertice Big Data Analytics 28 / 33
93 2.5 Execution Model and Data Consistency Data Consistency - Exclusion Sets Big Data Analytics 29 / 33
94 2.5 Execution Model and Data Consistency Data Consistency Big Data Analytics 30 / 33
95 2.5 Execution Model and Data Consistency Data Consistency GraphLab guarantees sequential consistency under the following three conditions: The full consistency model is used Big Data Analytics 31 / 33
96 2.5 Execution Model and Data Consistency Data Consistency GraphLab guarantees sequential consistency under the following three conditions: The full consistency model is used The edge consistency model is used and update functions do not modify data in adjacent vertices Big Data Analytics 31 / 33
97 2.5 Execution Model and Data Consistency Data Consistency GraphLab guarantees sequential consistency under the following three conditions: The full consistency model is used The edge consistency model is used and update functions do not modify data in adjacent vertices The vertex consistency model is used and update functions only access local vertex data Big Data Analytics 31 / 33
98 2.5 Execution Model and Data Consistency Benchmarks - Horizontal Scaling Dataset 1 Dataset 2 Dataset V E Size K 50.9 M 655 MB M 1.8 B 31 GB Yong Guo, Marcin Biczak, Ana Lucia Varbanescu, Alexandru Iosup, Claudio Martella, Theodore L. Willke. Towards Benchmarking Graph-Processing Platforms. Super Computing 2013 Big Data Analytics 32 / 33
99 2.5 Execution Model and Data Consistency Summary A GraphLab program consists of: A data graph Big Data Analytics 33 / 33
100 2.5 Execution Model and Data Consistency Summary A GraphLab program consists of: A data graph An update function Big Data Analytics 33 / 33
101 2.5 Execution Model and Data Consistency Summary A GraphLab program consists of: A data graph An update function Gather Function Apply Function Scatter Function A sync mechanism Big Data Analytics 33 / 33
102 2.5 Execution Model and Data Consistency Summary A GraphLab program consists of: A data graph An update function Gather Function Apply Function Scatter Function A sync mechanism A consistency model Big Data Analytics 33 / 33
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33 Outline 1. Introduction
More informationBig Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
More informationMachine Learning over Big Data
Machine Learning over Big Presented by Fuhao Zou fuhao@hust.edu.cn Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed
More informationOutline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013
MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing Iftekhar Naim Outline Motivation MapReduce Overview Design Issues & Abstractions Examples and Results Pros and Cons
More informationLARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.
LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD Dr. Buğra Gedik, Ph.D. MOTIVATION Graph data is everywhere Relationships between people, systems, and the nature Interactions between people, systems,
More informationAsking Hard Graph Questions. Paul Burkhardt. February 3, 2014
Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1
More informationMapGraph. A High Level API for Fast Development of High Performance Graphic Analytics on GPUs. http://mapgraph.io
MapGraph A High Level API for Fast Development of High Performance Graphic Analytics on GPUs http://mapgraph.io Zhisong Fu, Michael Personick and Bryan Thompson SYSTAP, LLC Outline Motivations MapGraph
More informationGraph Processing and Social Networks
Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph
More informationFast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction
Human connectome. Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011 2 NA = 6.022 1023 mol 1 Paul Burkhardt, Chris Waring An NSA Big Graph experiment Fast Iterative Graph Computation with Resource
More informationMizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
/35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of
More informationIntroduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
More informationGRAPHALYTICS http://bl.ocks.org/mbostock/4062045 A Big Data Benchmark for Graph-Processing Platforms
GRAPHALYTICS http://bl.ocks.org/mbostock/4062045 A Big Data Benchmark for Graph-Processing Platforms Mihai Capotã, Yong Guo, Ana Lucia Varbanescu, Tim Hegeman, Jorai Rijsdijk, Alexandru Iosup, GRAPHALYTICS
More informationMapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
More informationMapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12
MapReduce Algorithms A Sense of Scale At web scales... Mail: Billions of messages per day Search: Billions of searches per day Social: Billions of relationships 2 A Sense of Scale At web scales... Mail:
More informationBig Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
More informationSoftware tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team
Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 36 Outline
More informationDATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
More informationReport: Declarative Machine Learning on MapReduce (SystemML)
Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,
More informationThe Hadoop Implementation. Thomas Zimmermann Philipp Berger
Link Analysis goes MapReduce The Hadoop Implementation Thomas Zimmermann Philipp Berger Flashback 2 Overview 3 1. Pre- / Postprocessing 2. Our Jobs 3. Evaluation Overview 4 1. Pre- / Postprocessing 2.
More informationAssignment 2: More MapReduce with Hadoop
Assignment 2: More MapReduce with Hadoop Jean-Pierre Lozi February 5, 2015 Provided files following URL: An archive that contains all files you will need for this assignment can be found at the http://sfu.ca/~jlozi/cmpt732/assignment2.tar.gz
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Distributed File Systems and NoSQL Database Distributed
More informationApache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
More informationSocial Media Mining. Network Measures
Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the like-minded users
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 21 Outline
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationInfiniteGraph: The Distributed Graph Database
A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086
More informationA1 and FARM scalable graph database on top of a transactional memory layer
A1 and FARM scalable graph database on top of a transactional memory layer Miguel Castro, Aleksandar Dragojević, Dushyanth Narayanan, Ed Nightingale, Alex Shamis Richie Khanna, Matt Renzelmann Chiranjeeb
More informationSystems and Algorithms for Big Data Analytics
Systems and Algorithms for Big Data Analytics YAN, Da Email: yanda@cse.cuhk.edu.hk My Research Graph Data Distributed Graph Processing Spatial Data Spatial Query Processing Uncertain Data Querying & Mining
More informationPresto/Blockus: Towards Scalable R Data Analysis
/Blockus: Towards Scalable R Data Analysis Andrew A. Chien University of Chicago and Argonne ational Laboratory IRIA-UIUC-AL Joint Institute Potential Collaboration ovember 19, 2012 ovember 19, 2012 Andrew
More informationFrom GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop
More informationApache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
More informationParallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
More informationThe Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
More informationOverview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012
Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group) 06-08-2012 Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships
More informationDATA STRUCTURES USING C
DATA STRUCTURES USING C QUESTION BANK UNIT I 1. Define data. 2. Define Entity. 3. Define information. 4. Define Array. 5. Define data structure. 6. Give any two applications of data structures. 7. Give
More informationFastForward I/O and Storage: ACG 6.6 Demonstration
FastForward I/O and Storage: ACG 6.6 Demonstration NOTICE: THIS MANUSCRIPT HAS BEEN AUTHORED BY INTEL UNDER ITS SUBCONTRACT WITH LAWRENCE LIVERMORE NATIONAL SECURITY, LLC WHO IS THE OPERATOR AND MANAGER
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Going For Large Scale 1
More informationThe PageRank Citation Ranking: Bring Order to the Web
The PageRank Citation Ranking: Bring Order to the Web presented by: Xiaoxi Pang 25.Nov 2010 1 / 20 Outline Introduction A ranking for every page on the Web Implementation Convergence Properties Personalized
More informationWhy? A central concept in Computer Science. Algorithms are ubiquitous.
Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online
More informationhttp://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationBig Data looks Tiny from the Stratosphere
Volker Markl http://www.user.tu-berlin.de/marklv volker.markl@tu-berlin.de Big Data looks Tiny from the Stratosphere Data and analyses are becoming increasingly complex! Size Freshness Format/Media Type
More informationTrinity: A Distributed Graph Engine on a Memory Cloud
Trinity: A Distributed Graph Engine on a Memory Cloud Bin Shao Microsoft Research Asia Beijing, China binshao@microsoft.com Haixun Wang Microsoft Research Asia Beijing, China haixunw@microsoft.com Yatao
More informationDistributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud
Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud Yucheng Low Carnegie Mellon University ylow@cs.cmu.edu Danny Bickson Carnegie Mellon University bickson@cs.cmu.edu Joseph
More informationCommon Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki hossein@databricks.com
Common Patterns and Pitfalls for Implementing Algorithms in Spark Hossein Falaki @mhfalaki hossein@databricks.com Challenges of numerical computation over big data When applying any algorithm to big data
More informationCan the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
More information5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.
1. The advantage of.. is that they solve the problem if sequential storage representation. But disadvantage in that is they are sequential lists. [A] Lists [B] Linked Lists [A] Trees [A] Queues 2. The
More informationIntroduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
More informationBig Data Analytics. Prof. Dr. Lars Schmidt-Thieme
Big Data Analytics Prof. Dr. Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie,
More informationA Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
More informationReplication-based Fault-tolerance for Large-scale Graph Processing
Replication-based Fault-tolerance for Large-scale Graph Processing Peng Wang Kaiyuan Zhang Rong Chen Haibo Chen Haibing Guan Shanghai Key Laboratory of Scalable Computing and Systems Institute of Parallel
More informationBP2SAN From Business Processes to Stochastic Automata Networks
BP2SAN From Business Processes to Stochastic Automata Networks Kelly Rosa Braghetto Department of Computer Science University of São Paulo kellyrb@ime.usp.br March, 2011 Contents 1 Introduction 1 2 Instructions
More informationRelationship collections are well expressed
C loud C omputing Graph Twiddling in a MapReduce World The easily distributed sorting primitives that constitute MapReduce jobs have shown great value in processing large data volumes If useful graph operations
More informationClash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden
More informationBig Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
More informationHow To Create A Graphlab Framework For Parallel Machine Learning
This work is supported by NSF grants CCF-1149252, CCF-1337215, and STARnet, a Semiconductor Research Corporation Program, sponsored by MARCO and DARPA. Datacenter Simulation Methodologies: GraphLab Tamara
More informationData Structure [Question Bank]
Unit I (Analysis of Algorithms) 1. What are algorithms and how they are useful? 2. Describe the factor on best algorithms depends on? 3. Differentiate: Correct & Incorrect Algorithms? 4. Write short note:
More informationGraph Database Proof of Concept Report
Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment
More informationScalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
More informationProgramming Abstractions and Security for Cloud Computing Systems
Programming Abstractions and Security for Cloud Computing Systems By Sapna Bedi A MASTERS S ESSAY SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty
More information16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
More informationThe Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia
The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit
More informationConvex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59 Introduction Contents 1 Introduction
More informationTutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
More informationPractical Graph Mining with R. 5. Link Analysis
Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities
More informationProtein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
More informationPDQCollections*: A Data-Parallel Programming Model and Library for Associative Containers
PDQCollections*: A Data-Parallel Programming Model and Library for Associative Containers Maneesh Varshney Vishwa Goudar Christophe Beaumont Jan, 2013 * PDQ could stand for Pretty Darn Quick or Processes
More informationMapReduce Evaluator: User Guide
University of A Coruña Computer Architecture Group MapReduce Evaluator: User Guide Authors: Jorge Veiga, Roberto R. Expósito, Guillermo L. Taboada and Juan Touriño December 9, 2014 Contents 1 Overview
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationKeywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
More informationReminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new
Reminder: Complexity (1) Parallel Complexity Theory Lecture 6 Number of steps or memory units required to compute some result In terms of input size Using a single processor O(1) says that regardless of
More informationReminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new GAP (2) Graph Accessibility Problem (GAP) (1)
Reminder: Complexity (1) Parallel Complexity Theory Lecture 6 Number of steps or memory units required to compute some result In terms of input size Using a single processor O(1) says that regardless of
More informationCloud Computing. Lectures 10 and 11 Map Reduce: System Perspective 2014-2015
Cloud Computing Lectures 10 and 11 Map Reduce: System Perspective 2014-2015 1 MapReduce in More Detail 2 Master (i) Execution is controlled by the master process: Input data are split into 64MB blocks.
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationHow To Cluster Of Complex Systems
Entropy based Graph Clustering: Application to Biological and Social Networks Edward C Kenley Young-Rae Cho Department of Computer Science Baylor University Complex Systems Definition Dynamically evolving
More informationThe Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science amir@sics.se June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data
More informationLecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
More informationSocial Media Mining. Graph Essentials
Graph Essentials Graph Basics Measures Graph and Essentials Metrics 2 2 Nodes and Edges A network is a graph nodes, actors, or vertices (plural of vertex) Connections, edges or ties Edge Node Measures
More informationMapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
More informationA computational model for MapReduce job flow
A computational model for MapReduce job flow Tommaso Di Noia, Marina Mongiello, Eugenio Di Sciascio Dipartimento di Ingegneria Elettrica e Dell informazione Politecnico di Bari Via E. Orabona, 4 70125
More informationAn Empirical Study of Two MIS Algorithms
An Empirical Study of Two MIS Algorithms Email: Tushar Bisht and Kishore Kothapalli International Institute of Information Technology, Hyderabad Hyderabad, Andhra Pradesh, India 32. tushar.bisht@research.iiit.ac.in,
More informationDistributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez (200950006) Sameer Kumar (200950031)
using Hadoop MapReduce framework Binoy A Fernandez (200950006) Sameer Kumar (200950031) Objective To demonstrate how the hadoop mapreduce framework can be extended to work with image data for distributed
More informationAnalysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
More informationDistributed Dynamic Load Balancing for Iterative-Stencil Applications
Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
More informationThis exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.
Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,
More informationGraphalytics: A Big Data Benchmark for Graph-Processing Platforms
Graphalytics: A Big Data Benchmark for Graph-Processing Platforms Mihai Capotă, Tim Hegeman, Alexandru Iosup, Arnau Prat-Pérez, Orri Erling, Peter Boncz Delft University of Technology Universitat Politècnica
More informationDistributed Computing over Communication Networks: Maximal Independent Set
Distributed Computing over Communication Networks: Maximal Independent Set What is a MIS? MIS An independent set (IS) of an undirected graph is a subset U of nodes such that no two nodes in U are adjacent.
More informationLecture 10 - Functional programming: Hadoop and MapReduce
Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional
More informationBig Data Systems CS 5965/6965 FALL 2015
Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html
More informationBig Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
More informationHadoop WordCount Explained! IT332 Distributed Systems
Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,
More informationBig Data Science. Prof. Lise Getoor University of Maryland, College Park. http://www.cs.umd.edu/~getoor. October 17, 2013
Big Data Science Prof Lise Getoor University of Maryland, College Park October 17, 2013 http://wwwcsumdedu/~getoor BIG Data is not flat 2004-2013 lonnitaylor Data is multi-modal, multi-relational, spatio-temporal,
More informationBOĞAZİÇİ UNIVERSITY COMPUTER ENGINEERING
Parallel l Tetrahedral Mesh Refinement Mehmet Balman Computer Engineering, Boğaziçi University Adaptive Mesh Refinement (AMR) A computation ti technique used to improve the efficiency i of numerical systems
More informationGraphChi: Large-Scale Graph Computation on Just a PC
GraphChi: Large-Scale Graph Computation on Just a PC Aapo Kyrola Carnegie Mellon University akyrola@cs.cmu.edu Guy Blelloch Carnegie Mellon University guyb@cs.cmu.edu Carlos Guestrin University of Washington
More informationParallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
More informationarxiv:1405.1499v3 [cs.db] 30 Sep 2015
Noname manuscript No. (will be inserted by the editor) NScale: Neighborhood-centric Large-Scale Graph Analytics in the Cloud Abdul Quamar Amol Deshpande Jimmy Lin arxiv:145.1499v3 [cs.db] 3 Sep 215 the
More informationStorage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.
More informationProject DALKIT (informal working title)
Project DALKIT (informal working title) Michael Axtmann, Timo Bingmann, Peter Sanders, Sebastian Schlag, and Students 2015-03-27 INSTITUTE OF THEORETICAL INFORMATICS ALGORITHMICS KIT University of the
More information