GraphX: Unifying Data-Parallel and Graph-Parallel Analytics



Similar documents
Making Sense at Scale with the Berkeley Data Analytics Stack

Machine Learning over Big Data

Spark and the Big Data Library

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Spark: Making Big Data Interactive & Real-Time

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Systems and Algorithms for Big Data Analytics

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Big Graph Processing: Some Background

Big Data Analytics. Lucas Rego Drumond

Unified Big Data Processing with Apache Spark. Matei

The Power of Relationships

Conquering Big Data with Apache Spark

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Architectures for massive data management

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Introduc8on to Apache Spark

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

Large-Scale Data Processing

HIGH PERFORMANCE BIG DATA ANALYTICS

MLlib: Scalable Machine Learning on Spark

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

DATA ANALYSIS II. Matrix Algorithms

Beyond Hadoop with Apache Spark and BDAS

Spark: Cluster Computing with Working Sets

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Ali Ghodsi Head of PM and Engineering Databricks

What s next for the Berkeley Data Analytics Stack?

Graph Database Proof of Concept Report

Distributed R for Big Data

Spark Application Carousel. Spark Summit East 2015

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Practical Graph Mining with R. 5. Link Analysis

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Big Data and Scripting Systems beyond Hadoop

Unified Big Data Analytics Pipeline. 连 城

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

Large Scale Graph Processing with Apache Giraph

Big Data looks Tiny from the Stratosphere

Big Data Analytics with Cassandra, Spark & MLLib

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

A System for Distributed Graph-Parallel Machine Learning

Part 1: Link Analysis & Page Rank

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Graph Processing and Social Networks

How Location Intelligence drives value in next-generation customer loyalty programs in Retail business

Research Statement Joseph E. Gonzalez

Scaling Out With Apache Spark. DTL Meeting Slides based on

Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher

Part 2: Community Detection

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Big Data and Scripting Systems build on top of Hadoop

Presto/Blockus: Towards Scalable R Data Analysis

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Big Data Analytics Hadoop and Spark

From GWS to MapReduce: Google s Cloud Technology in the Early Days

DduP Towards a Deduplication Framework utilising Apache Spark

Apache Flink Next-gen data analysis. Kostas

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

The Need for Training in Big Data: Experiences and Case Studies

Apache Flink. Fast and Reliable Large-Scale Data Processing

Big Data Processing. Patrick Wendell Databricks

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Similarity Search in a Very Large Scale Using Hadoop and HBase

Beyond Hadoop MapReduce Apache Tez and Apache Spark

MapReduce Approach to Collective Classification for Networks

Collaborative Filtering. Radek Pelánek

Using Apache Spark Pat McDonough - Databricks

Data Science in the Wild

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

SCALABLE GRAPH ANALYTICS WITH GRADOOP AND BIIIG

Conquering Big Data with BDAS (Berkeley Data Analytics)

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

Social Media Mining. Network Measures

Hadoop Ecosystem B Y R A H I M A.

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Estimating PageRank Values of Wikipedia Articles using MapReduce

Transcription:

GraphX: Unifying Data-Parallel and Graph-Parallel nalytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel rankshaw, nkur Dave, Michael Franklin, and Ion Stoica Strata 2014 *These slides are best viewed in PowerPoint with animation.

Graphs are entral to nalytics Hyperlinks PageRank Top 20 Pages Raw Wikipedia Text Table Title PR < / >! < / >! < / >! XML! Title Body Term-Doc Graph Topic Model (LD) Word Topics Word Topic Discussion Table Editor Graph ommunity Detection User ommunity ommunity Topic User Disc. User om. Topic om.

PageRank: Identifying Leaders R[i] =0.15 + Rank of user i X j2nbrs(i) w ji R[j] Weighted sum of neighbors ranks Update ranks in parallel Iterate until convergence 3

The Graph-Parallel Pattern Model / lg. State omputation depends only on the neighbors 4

Many Graph-Parallel lgorithms ollaborative Filtering lternating Least Squares Stochastic Gradient Descent Tensor Factorization Structured Prediction Loopy Belief Propagation Max-Product Linear Programs Gibbs Sampling Semi-supervised ML Graph SSL oem ommunity Detection Triangle-ounting K-core Decomposition K-Truss Graph nalytics PageRank Personalized PageRank Shortest Path Graph oloring lassification Neural Networks 5

Graph-Parallel Systems oogle Expose specialized PIs to simplify graph programming. Exploit graph structure to achieve orders-ofmagnitude performance gains over more general data-parallel systems. 6

PageRank on the Live-Journal Graph Mahout/Hadoop 1340 Naïve Spark 354 GraphLab 22 0 200 400 600 800 1000 1200 1400 1600 Runtime (in seconds, PageRank for 10 iterations) GraphLab is 60x faster than Hadoop GraphLab is 16x faster than Spark

Graphs are entral to nalytics Hyperlinks PageRank Top 20 Pages Raw Wikipedia Text Table Title PR < / >! < / >! < / >! XML! Title Body Term-Doc Graph Topic Model (LD) Word Topics Word Topic Discussion Table Editor Graph ommunity Detection User ommunity ommunity Topic User Disc. User om. Topic om.

Separate Systems to Support Each View 6. Before Table View Graph View 7. fter 8. fter Table Row Dependency Graph Row Row Result Row

Having separate systems for each view is difficult to use and inefficient 10

Difficult to Program and Use Users must Learn, Deploy, and Manage multiple systems Leads to brittle and often complex interfaces 11

Inefficient Extensive data movement and duplication across the network and file system < / >! < / >! < / >! XML! HDFS HDFS HDFS HDFS Limited reuse internal data-structures across stages 12

Solution: The GraphX Unified pproach New PI Blurs the distinction between Tables and Graphs New System ombines Data-Parallel Graph-Parallel Systems Enabling users to easily and efficiently express the entire graph analytics pipeline

Tables and Graphs are composable views of the same physical data Table View GraphX Unified Representation Graph View Each view has its own operators that exploit the semantics of the view to achieve efficient execution

View a Graph as a Table Property Graph R F Vertex Property Table Id Property (V) Rxin (Stu., Berk.) Jegonzal (PstDoc, Berk.) Franklin (Prof., Berk) Istoica (Prof., Berk) J I Edge Property Table SrcId DstId Property (E) rxin jegonzal Friend franklin rxin dvisor istoica franklin oworker franklin jegonzal PI

Table Operators Table (RDD) operators are inherited from Spark: map filter groupby sort union join leftouterjoin rightouterjoin reduce count fold reducebykey groupbykey cogroup cross zip sample take first partitionby mapwith pipe save... 16

Graph Operators class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ]) // Table Views ----------------- def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ] def triplets: Table [ ((Id, V), (Id, V), E) ] // Transformations ------------------------------ def reverse: Graph[V, E] def subgraph(pv: (Id, V) => Boolean, pe: Edge[V,E] => Boolean): Graph[V,E] def mapv(m: (Id, V) => T ): Graph[T,E] def mape(m: Edge[V,E] => T ): Graph[V,T] // Joins ---------------------------------------- def joinv(tbl: Table [(Id, T)]): Graph[(V, T), E ] def joine(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)] // omputation ---------------------------------- def mrtriplets(mapf: (Edge[V,E]) => List[(Id, T)], reducef: (T, T) => T): Graph[T, E] } 17

Triplets Join Vertices and Edges The triplets operator joins vertices and edges: Vertices Triplets Edges B B D B D B B D The mrtriplets operator sums adjacent triplets. SELET t.dstid, reduceudf( mapudf(t) ) S sum FROM triplets S t GROUPBY t.dstid

Map Reduce Triplets Map-Reduce for each vertex B mapf( B ) 1 mapf( ) 2 D E reducef( 1, 2 ) F 19

Example: Oldest Follower What is the age of the oldest follower for each user val oldestfollowerge = graph.mrtriplets( e=> (e.dst.id, e.src.age),//map (a,b)=> max(a, b) //Reduce ).vertices 23 42 B D 30 E 19 75 F 16 20

We express the Pregel and GraphLab abstractions using the GraphX operators in less than 50 lines of code! By composing these operators we can construct entire graph-analytics pipelines. 21

DIY Demo this fternoon

GraphX System Design

Distributed Graphs as Tables (RDDs) Property Graph B Part. 1 Vertex Table (RDD) Routing Table (RDD) 1 2 Edge Table (RDD) B B B 1 B D 2D Vertex ut Heuristic D D 1 1 2 D E F E E E 2 E F D Part. 2 F F 2 E F

aching for Iterative mrtriplets Vertex Table (RDD) B D E F Mirror ache B D Mirror ache D E F Edge Table (RDD) B E E B D E F D F

Incremental Updates for Iterative mrtriplets hange Vertex Table (RDD) B Mirror ache B Edge Table (RDD) B B D D hange D E Mirror ache D E Scan E E F D F F E F

ggregation for Iterative mrtriplets hange hange Vertex Table (RDD) B Local ggregate Mirror ache B Edge Table (RDD) B B hange D D hange hange D E Local ggregate Mirror ache D E Scan E E F D hange F F E F

Reduction in ommunication Due to ached Updates Network omm. (MB) 10000 1000 100 10 1 0.1 onnected omponents on Twitter Graph Most vertices are within 8 hops of all vertices in their comp. 0 2 4 6 8 10 12 14 16 Iteration

Benefit of Indexing ctive Edges Runtime (Seconds) 30 25 20 15 10 5 0 onnected omponents on Twitter Graph Scan Indexed Scan ll Edges Index of ctive Edges 0 2 4 6 8 10 12 14 16 Iteration

Join Elimination Identify and bypass joins for unused triplets fields» Example: PageRank only accesses source attribute ommunication (MB) 14000 12000 10000 8000 6000 4000 2000 0 PageRank on Twitter Three Way Join Join Elimination Factor of 2 reduction in communication 0 5 10 15 20 Iteration 30

dditional Query Optimizations Indexing and Bitmaps:» To accelerate joins across graphs» To efficiently construct sub-graphs Substantial Index and Data Reuse:» Reuse routing tables across graphs and sub-graphs» Reuse edge adjacency information and indices 31

Performance omparisons Live-Journal: 69 Million Edges Mahout/Hadoop Naïve Spark Giraph GraphX GraphLab 22 68 207 354 1340 0 200 400 600 800 1000 1200 1400 1600 Runtime (in seconds, PageRank for 10 iterations) GraphX is roughly 3x slower than GraphLab

GraphX scales to larger graphs Twitter Graph: 1.5 Billion Edges Giraph 749 GraphX 451 GraphLab 203 0 200 400 600 800 Runtime (in seconds, PageRank for 10 iterations) GraphX is roughly 2x slower than GraphLab» Scala + Java overhead: Lambdas, G time,» No shared memory parallelism: 2x increase in comm.

PageRank is just one stage. What about a pipeline

Small Pipeline in GraphX Raw Wikipedia Hyperlinks PageRank Top 20 Pages < / >! < / >! < / >! XML! HDFS HDFS Spark Preprocess Spark Giraph + Spark GraphX GraphLab + Spark 342 375 605 ompute Spark Post. 1492 0 200 400 600 800 1000 1200 1400 1600 Total Runtime (in Seconds) Timed end-to-end GraphX is faster than GraphLab

The GraphX Stack (Lines of ode) PageRank (5) onnected Shortest SVD LS K-core omp. (10) Path (10) (40) (40) (51) Triangle LD ount (120) (45) Pregel (28) + GraphLab (50) GraphX (3575) Spark

Status lpha release as part of Spark 0.9 Seeking collaborators and feedback

onclusion and Observations Domain specific views: Tables and Graphs» tables and graphs are first-class composable objects» specialized operators which exploit view semantics Single system that efficiently spans the pipeline» minimize data movement and duplication» eliminates need to learn and manage multiple systems Graphs through the lens of database systems» Graph-Parallel Pattern à Triplet joins in relational alg.» Graph Systems à Distributed join optimizations 38

ctive Research Static Data à Dynamic Data» pply GraphX unified approach to time evolving data» Model and analyze relationships over time Serving Graph Structured Data» llow external systems to interact with GraphX» Unify distributed graph databases with relational database technology 39

Thanks! http://amplab.github.io/graphx/ ankurd@eecs.berkeley.edu crankshaw@eecs.berkeley.edu rxin@eecs.berkeley.edu jegonzal@eecs.berkeley.edu

Graph Property 1 Real-World Graphs Number of Vertices count 10 10 10 8 10 6 10 4 10 2 Power-Law Degree Distribution ltavista WebGraph1.4B Vertices, 6.6B Edges More than 10 8 vertices have one neighbor. Top 1% of vertices are adjacent to 50% of the edges! 10 0 10 0 10 2 10 4 10 6 10 8 degree Degree Ratio of Edges to Vertices Edges >> Vertices Facebook 200 180 160 140 120 100 80 60 40 20 0 2008 2009 2010 2011 2012 Year 41

Graph Property 2 Num-Vertices 100000000 10000000 1000000 100000 10000 1000 100 10 1 ctive Vertices PageRank on Web Graph 51% updated only once! 0 10 20 30 40 50 60 70 Number of Updates

Graphs are Essential to Data Mining and Machine Learning Identify influential people and information Find communities Understand people s shared interests Model complex data dependencies

Recommending Products Users Ratings Items

Recommending Products Low-Rank Matrix Factorization: f(3) Users Netflix Movies Users x f(i) Movies f(j) User Factors (U) f(1) f(2) r 13 r 14 r 24 r 25 f(4) f(5) Movie Factors (M) Iterate: f[i] = arg min w2r d X j2nbrs(i) r ij w T f[j] 2 + w 2 2 45

Predicting User Behavior Liberal onservative Post Post Post Post Post Post Post Post Post Post Post Post Post Post Post onditional Random Field! Belief Propagation! Post 46

Finding ommunities ount triangles passing through each vertex: " 1 2 3 4 Measures cohesiveness of local community Fewer Triangles Weaker ommunity More Triangles Stronger ommunity

Example Graph nalytics Pipeline Preprocessing ompute Post Proc. < / >! < / >! < / >! XML! Raw Data ETL Slice ompute Initial Graph Subgraph PageRank nalyze Top Users Repeat 48