Systems and Algorithms for Big Data Analytics



Similar documents
LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

Big Data Analytics. Lucas Rego Drumond

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Large Scale Graph Processing with Apache Giraph

Big Graph Processing: Some Background

An Experimental Comparison of Pregel-like Graph Processing Systems

Social Media Mining. Graph Essentials

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Course on Social Network Analysis Graphs and Networks

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Distributed Computing over Communication Networks: Maximal Independent Set

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Machine Learning over Big Data

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Graph Theory and Complex Networks: An Introduction. Chapter 08: Computer networks

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

Apache Hama Design Document v0.6

Apache Flink Next-gen data analysis. Kostas

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

An NSA Big Graph experiment. Paul Burkhardt, Chris Waring. May 20, 2013

Presto/Blockus: Towards Scalable R Data Analysis

Distance Degree Sequences for Network Analysis

Practical Graph Mining with R. 5. Link Analysis

Big Data and Scripting Systems beyond Hadoop

Large-Scale Data Processing

Graph Processing and Social Networks

Evaluating partitioning of big graphs

6.852: Distributed Algorithms Fall, Class 2

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis. Contents. Introduction. Maarten van Steen. Version: April 28, 2014

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

Distributed SociaLite: A Datalog-Based Language for Large-Scale Graph Analysis

Social Network Mining

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013

Information Processing, Big Data, and the Cloud

Analysis of Algorithms, I

DATA ANALYSIS II. Matrix Algorithms

NP-Completeness. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

An Empirical Study of Two MIS Algorithms

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

Protein Protein Interaction Networks

Part 2: Community Detection

Cpt S 223. School of EECS, WSU

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

CIS 700: algorithms for Big Data

Chapter 6: Graph Theory

A1 and FARM scalable graph database on top of a transactional memory layer

Oracle Spatial and Graph. Jayant Sharma Director, Product Management

Home Page. Data Structures. Title Page. Page 1 of 24. Go Back. Full Screen. Close. Quit

Graph Mining and Social Network Analysis

Handout #Ch7 San Skulrattanakulchai Gustavus Adolphus College Dec 6, Chapter 7: Digraphs

Using Map-Reduce for Large Scale Analysis of Graph-Based Data

Optimizations and Analysis of BSP Graph Processing Models on Public Clouds

The Stratosphere Big Data Analytics Platform

Complex Networks Analysis: Clustering Methods

V. Adamchik 1. Graph Theory. Victor Adamchik. Fall of 2005

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Accelerating In-Memory Graph Database traversal using GPGPUS

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective

2. (a) Explain the strassen s matrix multiplication. (b) Write deletion algorithm, of Binary search tree. [8+8]

Mining Social Network Graphs

Unified Big Data Processing with Apache Spark. Matei

SCAN: A Structural Clustering Algorithm for Networks

Graph theory and network analysis. Devika Subramanian Comp 140 Fall 2008

Parallel Algorithms for Small-world Network. David A. Bader and Kamesh Madduri

8.1 Min Degree Spanning Tree

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Seminar. Path planning using Voronoi diagrams and B-Splines. Stefano Martina

Significantly Speed up real world big data Applications using Apache Spark

Implementing Graph Pattern Mining for Big Data in the Cloud

The Power of Relationships

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

Big Data looks Tiny from the Stratosphere

CMPSCI611: Approximating MAX-CUT Lecture 20

Persistent Data Structures and Planar Point Location

Outline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits

B490 Mining the Big Data. 2 Clustering

Analysis of MapReduce Algorithms

Trinity: A Distributed Graph Engine on a Memory Cloud

Data Structure [Question Bank]

Graph Theory Algorithms for Mobile Ad Hoc Networks

ONLINE DEGREE-BOUNDED STEINER NETWORK DESIGN. Sina Dehghani Saeed Seddighin Ali Shafahi Fall 2015

imgraph: A distributed in-memory graph database

Introduction to Graph Mining

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Transcription:

Systems and Algorithms for Big Data Analytics YAN, Da Email: yanda@cse.cuhk.edu.hk

My Research Graph Data Distributed Graph Processing Spatial Data Spatial Query Processing Uncertain Data Querying & Mining Uncertain Data 2

My Research Graph Data Distributed Graph Processing Algorithm Design & Analysis Computation Model Communication Mechanism Fault Tolerance Out-of-core Support 3

My Research Spatial Settings Road Networks Terrain Meshes Euclidean Space (Trajectories). Spatial Data Spatial Query Processing Spatial Queries Optimal Meeting Point Distance-Preserving Subgraph Facility Location Problem Reverse Nearest Neighbors 4

My Research Top-k Queries (DASFAA 2011 Best Paper) Sequential Pattern Mining Spatial Queries. Uncertain Data Querying & Mining Uncertain Data 5

My Research Focus of this presentation Graph Data Distributed Graph Processing Spatial Data Spatial Query Processing Uncertain Data Querying & Mining Uncertain Data 6

Google s Pregel Distributed Framework for Graph Processing» User-friendly: think like a vertex» Message passing» Iterative Bulk synchronous parallel Superstep 7

Google s Pregel Vertex Partitioning 0 1 2 3 4 5 6 7 8 0 1 3 1 0 2 3 2 1 3 4 7 3 0 1 2 7 4 2 5 7 5 4 6 6 5 8 7 2 3 4 8 8 6 7 M 0 M 1 M 2 8

Google s Pregel Programming Interfaces» u.compute(msgs)» u.send_msg(v, msg)» get_superstep_number()» u.vote_to_halt() Called inside u.compute(msgs) 9

Google s Pregel Vertex state» Active / inactive» Reactivated by messages Stop condition» All vertices are halted, and» No pending messages for the next superstep 10

Google s Pregel Hash-Min: Connected Components 3 1 3 1 7 7 5 5 0 0 6 6 8 8 2 2 4 4 Superstep 1 11

Google s Pregel Hash-Min: Connected Components 3 1 1 0 7 5 5 0 0 0 6 0 8 6 2 0 4 2 Superstep 2 12

Google s Pregel Illustration of Hash-Min 3 1 0 0 7 0 5 0 0 0 6 0 8 0 2 0 4 0 Superstep 3 13

Outline Practical Pregel Algorithms Blogel: Block-Centric Computation Pregel+: Message Reduction Other Improvements to Pregel Future Directions 14

Outline Practical Pregel Algorithms Blogel: Block-Centric Computation Pregel+: Message Reduction Other Improvements to Pregel Future Directions 15

Practical Pregel Alogorithms Practical Pregel Algorithms (PPAs) [PVLDB 14]» The first cost model for Pregel algorithm design» PPAs for fundamental graph problems Breadth-first search, list ranking, spanning tree, Euler tour, pre/post-order traversal, connected components, biconnected components, strongly connected components, etc. 16

Practical Pregel Alogorithms Practical Pregel Algorithms (PPAs) [PVLDB 14]» Linear cost per superstep O( V + E ) message number O( V + E ) computation time O( V + E ) RAM space» Logarithm number of supersteps O(log V ) supersteps O(log V ) = O(log E ) How about load balancing? 17

Practical Pregel Alogorithms Balanced Practical Pregel Algorithms (BPPAs)» d in (v): in-degree of v» d out (v): out-degree of v» Linear cost per superstep O(d in (v) + d out (v)) message number O(d in (v) + d out (v)) computation time O(d in (v) + d out (v)) RAM space» Logarithm number of supersteps 18

Practical Pregel Alogorithms Example: List Ranking» A procedure in computing bi-connected components» Linked list where each element v has Value val(v) Predecessor pred(v)» Element at the head has pred(v) = NULL NULL v 1 v 2 v 3 v 4 v 5 1 1 1 1 1 Toy Example: val(v) = 1 for all v 19

Practical Pregel Alogorithms Example: List Ranking» Compute sum(v) for each element v summing val(v) and values of all predecessors» Why TeraSort cannot work? NULL v 1 v 2 v 3 v 4 v 5 1 2 3 4 5 20

Practical Pregel Alogorithms Example: List Ranking» Pointer jumping / path doubling sum(v) sum(v) + sum(pred(v)) pred(v) pred(pred(v)) As long as pred(v) NULL NULL v 1 v 2 v 3 v 4 v 5 1 1 1 1 1 21

Practical Pregel Alogorithms Example: List Ranking» Pointer jumping / path doubling sum(v) sum(v) + sum(pred(v)) pred(v) pred(pred(v)) NULL NULL v 1 v 2 v 3 v 4 v 5 1 1 1 1 1 1 2 2 2 2 22

Practical Pregel Alogorithms Example: List Ranking» Pointer jumping / path doubling sum(v) sum(v) + sum(pred(v)) pred(v) pred(pred(v)) NULL NULL v 1 v 2 v 3 v 4 v 5 1 1 1 1 1 1 2 2 2 2 NULL 1 2 3 4 4 23

Practical Pregel Alogorithms Example: List Ranking» Pointer jumping / path doubling sum(v) sum(v) + sum(pred(v)) pred(v) pred(pred(v)) O(log V ) supersteps NULL NULL v 1 v 2 v 3 v 4 v 5 1 1 1 1 1 1 2 2 2 2 NULL 1 2 3 4 4 NULL 1 2 3 4 5 24

Practical Pregel Alogorithms Example: Connected Components» Pointer jumping / path doubling» Each vertex u maintains a pointer D[u] Vertices are organized by a pseudo-forest D[u] is the parent link v w 25

Practical Pregel Alogorithms Example: Connected Components» Repeating two steps: O(log V ) rounds» Step 1: tree hooking w x u v D[v] < D[u] 26

Practical Pregel Alogorithms Example: Connected Components» Repeating two steps: O(log V ) rounds» Step 2: Shortcutting y Pointing v to the parent of v s parent u w x u x y w 27

Practical Pregel Alogorithms Example: Connected Components» Repeating two steps: O(log V ) rounds» Stop condition: D[u] converges for every vertex u Every vertex belongs to a star Every star refers to a CC 28

Outline Practical Pregel Algorithms Blogel: Block-Centric Computation Pregel+: Message Reduction Other Improvements to Pregel Future Directions 29

Block-Centric Computation Blogel: Block-Centric Model [PVLDB 14]» Orders of magnitude performance improvement e.g., one hour 10 seconds 30

Block-Centric Computation Motivation» Graph characteristics adverse to Pregel Large graph diameter Skewed vertex degree distribution High average vertex degree Data Type V E AVG Deg Max Deg WebUK directed 133,633,040 5,507,679,822 41.21 22,429 LiveJournal directed 10,690,276 224,614,770 21.01 1,053,676 Twitter directed 52,579,682 1,963,263,821 37.34 779,958 BTC undirected 164,732,473 772,822,094 4.69 1,637,619 33

Block-Centric Computation Idea of Block-Centric Computation» A block refers to a connected subgraph of the graph» Message exchanges occur only among blocks» Serial in-memory algorithm is run within a block 34

Block-Centric Computation Benefits of Block-Centric Computation» High-degree vertices inside a block send no msgs» Much less number of supersteps» Much less number of blocks than vertices 35

Block-Centric Computation Example: Hash-Min» Condense each block into a supervertex, to get blocklevel graph i.e., to construct an adjacency list for each block» Run Hash-Min over block-level graph To propagate min block ID instead of min vertex ID 36

Block-Centric Computation Effectiveness BTC Friendster USA Road Computing Time Total Msg # Superstep # V-Centric 28.48 s 1,188,832,712 30 B-Centric 0.94 s 1,747,653 6 V-Centric 120.24 s 7,226,963,186 22 B-Centric 2.52 s 19,410,865 5 V-Centric 510.98s 8,353,044,435 6,262 B-Centric 1.94 s 270,257 26 37

Block-Centric Computation Example: Single-Source Shortest Paths» Source s V» Each edge has a length» Goal: to compute distance from s to each v V 38

Block-Centric Computation Example: Single-Source Shortest Paths» Vertices receives msgs from remote neighbors to update their distances» A block runs Dijkstra s algorithm from updated vertices» Remote neighbors are sent msgs, rather than enqueued 39

Block-Centric Computation Effectiveness Euro Road USA Road Time Step # V-Centric 1767.69 s 6210 B-Centric 11.10 s 60 V-Centric 9788.08 s 10789 B-Centric 12.48 s 58 40

Block-Centric Computation Graph Partitioning» Graph Voronoi Diagram (GVD) partitioning v Three seeds v is 2 hops from red seed v is 3 hops from green seed v is 5 hops from blue seed 41

Block-Centric Computation GVD Partitioning» Sample seed vertices with probability p 42

Block-Centric Computation GVD Partitioning» Sample seed vertices with probability p 43

Block-Centric Computation GVD Partitioning» Sample seed vertices with probability p» Compute GVD grouping Vertex-centric multi-source BFS 44

Block-Centric Computation Vertex-Centric Multi-Source BFS State after Seed Sampling 45

Block-Centric Computation Vertex-Centric Multi-Source BFS Superstep 1 46

Block-Centric Computation Vertex-Centric Multi-Source BFS Superstep 2 47

Block-Centric Computation Vertex-Centric Multi-Source BFS Superstep 3 48

Block-Centric Computation GVD Partitioning» Sample seed vertices with probability p» Compute GVD grouping» Repeat GVD Computation: Erase colors of large blocks Increase p and resample seeds Compute GVD over unassigned vertices 49

Block-Centric Computation GVD Partitioning» Sample seed vertices with probability p» Compute GVD grouping» Repeat GVD Computation» Run Hash-Min over unassigned vertices Why is this step necessary? Consider a graph with many small components 50

Block-Centric Computation GVD Partitioning Performance 3000 2500 2000 2026.65 1500 1000 500 0 505.85 186.89 105.48 75.88 70.68 WebUK Friendster BTC LiveJournal USA Road Euro Road Loading Partitioning Dumping 51

Outline Practical Pregel Algorithms Blogel: Block-Centric Computation Pregel+: Message Reduction Other Improvements to Pregel Future Directions 52

Message Reduction Message Reduction in Pregel+ [WWW 15]» Two techniques to reduce # of messages transmitted Vertex Mirroring Request-Respond Paradigm 53

Message Reduction Vertex Mirroring» Motivation: High-degree vertices send a lot of messages A vertex sends the same messages to neighbors Hash-Min: min(v) PageRank: PageRank(v) / out-degree(v) 54

Message Reduction Vertex Mirroring v 1 u 1 w 1 v 2 u 2 w 2 v j u i w k M 2 M 1 M 3 55

Message Reduction Vertex Mirroring v 1 u 1 w 1 v 2 u 2 w 2 v j u i u i u i w k M 2 M 1 M 3 56

Message Reduction Vertex Mirroring v.s. Message Combining» Create mirror for u 4? Consider messages to v 2 u 1 v 1 v 2 v 1 u 2 v 1 v 2 u 3 v 1 v 2 v 2 v 3 u 4 v 1 v 2 v 3 v 4 v 4 M 1 M 2 57

Message Reduction Vertex Mirroring v.s. Message Combining» Create mirror for u 4? Message combining without mirroring u 4 u 1 v 1 v 2 u 1 v 1 u 2 v 1 v 2 u 3 v 1 v 2 u 2 u 3 a(u 1 ) + a(u 2 ) + a(u 3 ) + a(u 4 ) v 2 v 3 u 4 v 1 v 2 v 3 v 4 u 4 v 4 M 1 M 1 M 2 58

Message Reduction Vertex Mirroring v.s. Message Combining» Create mirror for u 4? Message combining with u 4 mirrored u 1 v 1 v 2 u 1 a(u 1 ) + a(u 2 ) + a(u 3 ) v 1 u 2 v 1 v 2 u 2 v 2 u 3 v 1 v 2 u 4 v 1 v 2 v 3 v 4 u 3 u 4 a(u 4 ) u 4 v 3 v 4 M 1 M 1 M 2 59

Message Reduction Vertex Mirroring v.s. Message Combining» Only mirror high-degree vertices Choice of degree threshold τ M machines, n vertices, m edges Average degree: deg avg = m / n Optimal τ is M exp{deg avg / M} 60

Message Reduction Effectiveness of Message Reduction Number of messages sent by each worker in Pregel+ (blue bars w/o mirroring, red bars mirroring) 61

Message Reduction Request-Respond Paradigm» Motivation As a pointer-jumping algorithm goes on, there are fewer and fewer delegates communicating with more and more vertices E.g., PPA for computing connected components Merge small trees to large trees A vertex is the delegate of its children 62

Message Reduction Request-Respond Paradigm» Request-Respond API Retains all basic Pregel operations A vertex v can request attribute a(u) in superstep i, and a(u) will be available in superstep (i + 1) Here, u can be a delegate, and a(u) may be requested by many vertices v 63

Message Reduction Request-Respond Paradigm» Benefits Without Request-Respond v 1 v 2 v 3 v 4 <v 1 > <v 2 > <v 3 > <v 4 > u a(u) M 2 64

Message Reduction Request-Respond Paradigm» Benefits Without Request-Respond v 1 v 2 a(u) a(u) a(u) u v 3 v 4 a(u) a(u) M 2 65

Message Reduction Request-Respond Paradigm» Benefits Using Request-Respond v 1 v 2 u a[u] request u u v 3 v 4 M 1 a[u] M 2 66

Message Reduction Effectiveness of Request-Respond Paradigm Number of messages sent by each worker using Pregel+ (blue bars w/o req-resp, red bars with req-resp) 67

Outline Practical Pregel Algorithms Blogel: Block-Centric Computation Pregel+: Message Reduction Other Improvements to Pregel Future Directions 68

Other Improvements Fault Tolerance» Checkpointing time: 60 seconds 2 seconds Querying Workload» Over 100 seconds per query 3 queries per second Out-of-core Execution» Performance comparable to the fastest in-memory Pregel-like system Survey on Big Graph Systems 69

Open-Source Systems High ranking in Google, well indexed Used by industrial partners An ITF project funded with HK$ 1.4M 70

Open-Source Systems Many times faster than CMU s GraphLab» GraphLab is sold for US$ 6.7M 10x faster than Giraph used by Facebook» Facebook researchers closely follow our work Taobao replaces Spark with our system» Faster with 4 machines than Spark with 100 machines 71

Future Directions Beyond Pregel» Graph problem not suitable for Pregel Output size beyond linear Non-iterative» Examples Graph matching Motif mining Frequent subgraph mining 72

Future Directions Other Big Data Systems» Urban Computing Taxi trajectories Octopus card records (bus, MTR, ferry, )» Machine Learning Improving recommendation by Semantic Web Systems for deep learning 73

Thanks YAN, Da Contact Info Email: yanda@cse.cuhk.edu.hk Webpage: www.cse.cuhk.edu.hk/~yanda 74