GraphX: Unifying Data-Parallel and Graph-Parallel nalytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel rankshaw, nkur Dave, Michael Franklin, and Ion Stoica Strata 2014 *These slides are best viewed in PowerPoint with animation.
Graphs are entral to nalytics Hyperlinks PageRank Top 20 Pages Raw Wikipedia Text Table Title PR < / >! < / >! < / >! XML! Title Body Term-Doc Graph Topic Model (LD) Word Topics Word Topic Discussion Table Editor Graph ommunity Detection User ommunity ommunity Topic User Disc. User om. Topic om.
PageRank: Identifying Leaders R[i] =0.15 + Rank of user i X j2nbrs(i) w ji R[j] Weighted sum of neighbors ranks Update ranks in parallel Iterate until convergence 3
The Graph-Parallel Pattern Model / lg. State omputation depends only on the neighbors 4
Many Graph-Parallel lgorithms ollaborative Filtering lternating Least Squares Stochastic Gradient Descent Tensor Factorization Structured Prediction Loopy Belief Propagation Max-Product Linear Programs Gibbs Sampling Semi-supervised ML Graph SSL oem ommunity Detection Triangle-ounting K-core Decomposition K-Truss Graph nalytics PageRank Personalized PageRank Shortest Path Graph oloring lassification Neural Networks 5
Graph-Parallel Systems oogle Expose specialized PIs to simplify graph programming. Exploit graph structure to achieve orders-ofmagnitude performance gains over more general data-parallel systems. 6
PageRank on the Live-Journal Graph Mahout/Hadoop 1340 Naïve Spark 354 GraphLab 22 0 200 400 600 800 1000 1200 1400 1600 Runtime (in seconds, PageRank for 10 iterations) GraphLab is 60x faster than Hadoop GraphLab is 16x faster than Spark
Graphs are entral to nalytics Hyperlinks PageRank Top 20 Pages Raw Wikipedia Text Table Title PR < / >! < / >! < / >! XML! Title Body Term-Doc Graph Topic Model (LD) Word Topics Word Topic Discussion Table Editor Graph ommunity Detection User ommunity ommunity Topic User Disc. User om. Topic om.
Separate Systems to Support Each View 6. Before Table View Graph View 7. fter 8. fter Table Row Dependency Graph Row Row Result Row
Having separate systems for each view is difficult to use and inefficient 10
Difficult to Program and Use Users must Learn, Deploy, and Manage multiple systems Leads to brittle and often complex interfaces 11
Inefficient Extensive data movement and duplication across the network and file system < / >! < / >! < / >! XML! HDFS HDFS HDFS HDFS Limited reuse internal data-structures across stages 12
Solution: The GraphX Unified pproach New PI Blurs the distinction between Tables and Graphs New System ombines Data-Parallel Graph-Parallel Systems Enabling users to easily and efficiently express the entire graph analytics pipeline
Tables and Graphs are composable views of the same physical data Table View GraphX Unified Representation Graph View Each view has its own operators that exploit the semantics of the view to achieve efficient execution
View a Graph as a Table Property Graph R F Vertex Property Table Id Property (V) Rxin (Stu., Berk.) Jegonzal (PstDoc, Berk.) Franklin (Prof., Berk) Istoica (Prof., Berk) J I Edge Property Table SrcId DstId Property (E) rxin jegonzal Friend franklin rxin dvisor istoica franklin oworker franklin jegonzal PI
Table Operators Table (RDD) operators are inherited from Spark: map filter groupby sort union join leftouterjoin rightouterjoin reduce count fold reducebykey groupbykey cogroup cross zip sample take first partitionby mapwith pipe save... 16
Graph Operators class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ]) // Table Views ----------------- def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ] def triplets: Table [ ((Id, V), (Id, V), E) ] // Transformations ------------------------------ def reverse: Graph[V, E] def subgraph(pv: (Id, V) => Boolean, pe: Edge[V,E] => Boolean): Graph[V,E] def mapv(m: (Id, V) => T ): Graph[T,E] def mape(m: Edge[V,E] => T ): Graph[V,T] // Joins ---------------------------------------- def joinv(tbl: Table [(Id, T)]): Graph[(V, T), E ] def joine(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)] // omputation ---------------------------------- def mrtriplets(mapf: (Edge[V,E]) => List[(Id, T)], reducef: (T, T) => T): Graph[T, E] } 17
Triplets Join Vertices and Edges The triplets operator joins vertices and edges: Vertices Triplets Edges B B D B D B B D The mrtriplets operator sums adjacent triplets. SELET t.dstid, reduceudf( mapudf(t) ) S sum FROM triplets S t GROUPBY t.dstid
Map Reduce Triplets Map-Reduce for each vertex B mapf( B ) 1 mapf( ) 2 D E reducef( 1, 2 ) F 19
Example: Oldest Follower What is the age of the oldest follower for each user val oldestfollowerge = graph.mrtriplets( e=> (e.dst.id, e.src.age),//map (a,b)=> max(a, b) //Reduce ).vertices 23 42 B D 30 E 19 75 F 16 20
We express the Pregel and GraphLab abstractions using the GraphX operators in less than 50 lines of code! By composing these operators we can construct entire graph-analytics pipelines. 21
DIY Demo this fternoon
GraphX System Design
Distributed Graphs as Tables (RDDs) Property Graph B Part. 1 Vertex Table (RDD) Routing Table (RDD) 1 2 Edge Table (RDD) B B B 1 B D 2D Vertex ut Heuristic D D 1 1 2 D E F E E E 2 E F D Part. 2 F F 2 E F
aching for Iterative mrtriplets Vertex Table (RDD) B D E F Mirror ache B D Mirror ache D E F Edge Table (RDD) B E E B D E F D F
Incremental Updates for Iterative mrtriplets hange Vertex Table (RDD) B Mirror ache B Edge Table (RDD) B B D D hange D E Mirror ache D E Scan E E F D F F E F
ggregation for Iterative mrtriplets hange hange Vertex Table (RDD) B Local ggregate Mirror ache B Edge Table (RDD) B B hange D D hange hange D E Local ggregate Mirror ache D E Scan E E F D hange F F E F
Reduction in ommunication Due to ached Updates Network omm. (MB) 10000 1000 100 10 1 0.1 onnected omponents on Twitter Graph Most vertices are within 8 hops of all vertices in their comp. 0 2 4 6 8 10 12 14 16 Iteration
Benefit of Indexing ctive Edges Runtime (Seconds) 30 25 20 15 10 5 0 onnected omponents on Twitter Graph Scan Indexed Scan ll Edges Index of ctive Edges 0 2 4 6 8 10 12 14 16 Iteration
Join Elimination Identify and bypass joins for unused triplets fields» Example: PageRank only accesses source attribute ommunication (MB) 14000 12000 10000 8000 6000 4000 2000 0 PageRank on Twitter Three Way Join Join Elimination Factor of 2 reduction in communication 0 5 10 15 20 Iteration 30
dditional Query Optimizations Indexing and Bitmaps:» To accelerate joins across graphs» To efficiently construct sub-graphs Substantial Index and Data Reuse:» Reuse routing tables across graphs and sub-graphs» Reuse edge adjacency information and indices 31
Performance omparisons Live-Journal: 69 Million Edges Mahout/Hadoop Naïve Spark Giraph GraphX GraphLab 22 68 207 354 1340 0 200 400 600 800 1000 1200 1400 1600 Runtime (in seconds, PageRank for 10 iterations) GraphX is roughly 3x slower than GraphLab
GraphX scales to larger graphs Twitter Graph: 1.5 Billion Edges Giraph 749 GraphX 451 GraphLab 203 0 200 400 600 800 Runtime (in seconds, PageRank for 10 iterations) GraphX is roughly 2x slower than GraphLab» Scala + Java overhead: Lambdas, G time,» No shared memory parallelism: 2x increase in comm.
PageRank is just one stage. What about a pipeline
Small Pipeline in GraphX Raw Wikipedia Hyperlinks PageRank Top 20 Pages < / >! < / >! < / >! XML! HDFS HDFS Spark Preprocess Spark Giraph + Spark GraphX GraphLab + Spark 342 375 605 ompute Spark Post. 1492 0 200 400 600 800 1000 1200 1400 1600 Total Runtime (in Seconds) Timed end-to-end GraphX is faster than GraphLab
The GraphX Stack (Lines of ode) PageRank (5) onnected Shortest SVD LS K-core omp. (10) Path (10) (40) (40) (51) Triangle LD ount (120) (45) Pregel (28) + GraphLab (50) GraphX (3575) Spark
Status lpha release as part of Spark 0.9 Seeking collaborators and feedback
onclusion and Observations Domain specific views: Tables and Graphs» tables and graphs are first-class composable objects» specialized operators which exploit view semantics Single system that efficiently spans the pipeline» minimize data movement and duplication» eliminates need to learn and manage multiple systems Graphs through the lens of database systems» Graph-Parallel Pattern à Triplet joins in relational alg.» Graph Systems à Distributed join optimizations 38
ctive Research Static Data à Dynamic Data» pply GraphX unified approach to time evolving data» Model and analyze relationships over time Serving Graph Structured Data» llow external systems to interact with GraphX» Unify distributed graph databases with relational database technology 39
Thanks! http://amplab.github.io/graphx/ ankurd@eecs.berkeley.edu crankshaw@eecs.berkeley.edu rxin@eecs.berkeley.edu jegonzal@eecs.berkeley.edu
Graph Property 1 Real-World Graphs Number of Vertices count 10 10 10 8 10 6 10 4 10 2 Power-Law Degree Distribution ltavista WebGraph1.4B Vertices, 6.6B Edges More than 10 8 vertices have one neighbor. Top 1% of vertices are adjacent to 50% of the edges! 10 0 10 0 10 2 10 4 10 6 10 8 degree Degree Ratio of Edges to Vertices Edges >> Vertices Facebook 200 180 160 140 120 100 80 60 40 20 0 2008 2009 2010 2011 2012 Year 41
Graph Property 2 Num-Vertices 100000000 10000000 1000000 100000 10000 1000 100 10 1 ctive Vertices PageRank on Web Graph 51% updated only once! 0 10 20 30 40 50 60 70 Number of Updates
Graphs are Essential to Data Mining and Machine Learning Identify influential people and information Find communities Understand people s shared interests Model complex data dependencies
Recommending Products Users Ratings Items
Recommending Products Low-Rank Matrix Factorization: f(3) Users Netflix Movies Users x f(i) Movies f(j) User Factors (U) f(1) f(2) r 13 r 14 r 24 r 25 f(4) f(5) Movie Factors (M) Iterate: f[i] = arg min w2r d X j2nbrs(i) r ij w T f[j] 2 + w 2 2 45
Predicting User Behavior Liberal onservative Post Post Post Post Post Post Post Post Post Post Post Post Post Post Post onditional Random Field! Belief Propagation! Post 46
Finding ommunities ount triangles passing through each vertex: " 1 2 3 4 Measures cohesiveness of local community Fewer Triangles Weaker ommunity More Triangles Stronger ommunity
Example Graph nalytics Pipeline Preprocessing ompute Post Proc. < / >! < / >! < / >! XML! Raw Data ETL Slice ompute Initial Graph Subgraph PageRank nalyze Top Users Repeat 48