GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

Transcription

1 GraphX: Unifying Data-Parallel and Graph-Parallel nalytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel rankshaw, nkur Dave, Michael Franklin, and Ion Stoica Strata 2014 *These slides are best viewed in PowerPoint with animation.

2 Graphs are entral to nalytics Hyperlinks PageRank Top 20 Pages Raw Wikipedia Text Table Title PR < / >! < / >! < / >! XML! Title Body Term-Doc Graph Topic Model (LD) Word Topics Word Topic Discussion Table Editor Graph ommunity Detection User ommunity ommunity Topic User Disc. User om. Topic om.

3 PageRank: Identifying Leaders R[i] = Rank of user i X j2nbrs(i) w ji R[j] Weighted sum of neighbors ranks Update ranks in parallel Iterate until convergence 3

4 The Graph-Parallel Pattern Model / lg. State omputation depends only on the neighbors 4

5 Many Graph-Parallel lgorithms ollaborative Filtering lternating Least Squares Stochastic Gradient Descent Tensor Factorization Structured Prediction Loopy Belief Propagation Max-Product Linear Programs Gibbs Sampling Semi-supervised ML Graph SSL oem ommunity Detection Triangle-ounting K-core Decomposition K-Truss Graph nalytics PageRank Personalized PageRank Shortest Path Graph oloring lassification Neural Networks 5

6 Graph-Parallel Systems oogle Expose specialized PIs to simplify graph programming. Exploit graph structure to achieve orders-ofmagnitude performance gains over more general data-parallel systems. 6

7 PageRank on the Live-Journal Graph Mahout/Hadoop 1340 Naïve Spark 354 GraphLab Runtime (in seconds, PageRank for 10 iterations) GraphLab is 60x faster than Hadoop GraphLab is 16x faster than Spark

8 Graphs are entral to nalytics Hyperlinks PageRank Top 20 Pages Raw Wikipedia Text Table Title PR < / >! < / >! < / >! XML! Title Body Term-Doc Graph Topic Model (LD) Word Topics Word Topic Discussion Table Editor Graph ommunity Detection User ommunity ommunity Topic User Disc. User om. Topic om.

9 Separate Systems to Support Each View 6. Before Table View Graph View 7. fter 8. fter Table Row Dependency Graph Row Row Result Row

10 Having separate systems for each view is difficult to use and inefficient 10

11 Difficult to Program and Use Users must Learn, Deploy, and Manage multiple systems Leads to brittle and often complex interfaces 11

12 Inefficient Extensive data movement and duplication across the network and file system < / >! < / >! < / >! XML! HDFS HDFS HDFS HDFS Limited reuse internal data-structures across stages 12

13 Solution: The GraphX Unified pproach New PI Blurs the distinction between Tables and Graphs New System ombines Data-Parallel Graph-Parallel Systems Enabling users to easily and efficiently express the entire graph analytics pipeline

14 Tables and Graphs are composable views of the same physical data Table View GraphX Unified Representation Graph View Each view has its own operators that exploit the semantics of the view to achieve efficient execution

15 View a Graph as a Table Property Graph R F Vertex Property Table Id Property (V) Rxin (Stu., Berk.) Jegonzal (PstDoc, Berk.) Franklin (Prof., Berk) Istoica (Prof., Berk) J I Edge Property Table SrcId DstId Property (E) rxin jegonzal Friend franklin rxin dvisor istoica franklin oworker franklin jegonzal PI

16 Table Operators Table (RDD) operators are inherited from Spark: map filter groupby sort union join leftouterjoin rightouterjoin reduce count fold reducebykey groupbykey cogroup cross zip sample take first partitionby mapwith pipe save... 16

17 Graph Operators class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ]) // Table Views def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ] def triplets: Table [ ((Id, V), (Id, V), E) ] // Transformations def reverse: Graph[V, E] def subgraph(pv: (Id, V) => Boolean, pe: Edge[V,E] => Boolean): Graph[V,E] def mapv(m: (Id, V) => T ): Graph[T,E] def mape(m: Edge[V,E] => T ): Graph[V,T] // Joins def joinv(tbl: Table [(Id, T)]): Graph[(V, T), E ] def joine(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)] // omputation def mrtriplets(mapf: (Edge[V,E]) => List[(Id, T)], reducef: (T, T) => T): Graph[T, E] } 17

18 Triplets Join Vertices and Edges The triplets operator joins vertices and edges: Vertices Triplets Edges B B D B D B B D The mrtriplets operator sums adjacent triplets. SELET t.dstid, reduceudf( mapudf(t) ) S sum FROM triplets S t GROUPBY t.dstid

19 Map Reduce Triplets Map-Reduce for each vertex B mapf( B ) 1 mapf( ) 2 D E reducef( 1, 2 ) F 19

20 Example: Oldest Follower What is the age of the oldest follower for each user val oldestfollowerge = graph.mrtriplets( e=> (e.dst.id, e.src.age),//map (a,b)=> max(a, b) //Reduce ).vertices B D 30 E F 16 20

21 We express the Pregel and GraphLab abstractions using the GraphX operators in less than 50 lines of code! By composing these operators we can construct entire graph-analytics pipelines. 21

22 DIY Demo this fternoon

23 GraphX System Design

24 Distributed Graphs as Tables (RDDs) Property Graph B Part. 1 Vertex Table (RDD) Routing Table (RDD) 1 2 Edge Table (RDD) B B B 1 B D 2D Vertex ut Heuristic D D D E F E E E 2 E F D Part. 2 F F 2 E F

25 aching for Iterative mrtriplets Vertex Table (RDD) B D E F Mirror ache B D Mirror ache D E F Edge Table (RDD) B E E B D E F D F

26 Incremental Updates for Iterative mrtriplets hange Vertex Table (RDD) B Mirror ache B Edge Table (RDD) B B D D hange D E Mirror ache D E Scan E E F D F F E F

27 ggregation for Iterative mrtriplets hange hange Vertex Table (RDD) B Local ggregate Mirror ache B Edge Table (RDD) B B hange D D hange hange D E Local ggregate Mirror ache D E Scan E E F D hange F F E F

28 Reduction in ommunication Due to ached Updates Network omm. (MB) onnected omponents on Twitter Graph Most vertices are within 8 hops of all vertices in their comp Iteration

29 Benefit of Indexing ctive Edges Runtime (Seconds) onnected omponents on Twitter Graph Scan Indexed Scan ll Edges Index of ctive Edges Iteration

30 Join Elimination Identify and bypass joins for unused triplets fields» Example: PageRank only accesses source attribute ommunication (MB) PageRank on Twitter Three Way Join Join Elimination Factor of 2 reduction in communication Iteration 30

31 dditional Query Optimizations Indexing and Bitmaps:» To accelerate joins across graphs» To efficiently construct sub-graphs Substantial Index and Data Reuse:» Reuse routing tables across graphs and sub-graphs» Reuse edge adjacency information and indices 31

32 Performance omparisons Live-Journal: 69 Million Edges Mahout/Hadoop Naïve Spark Giraph GraphX GraphLab Runtime (in seconds, PageRank for 10 iterations) GraphX is roughly 3x slower than GraphLab

33 GraphX scales to larger graphs Twitter Graph: 1.5 Billion Edges Giraph 749 GraphX 451 GraphLab Runtime (in seconds, PageRank for 10 iterations) GraphX is roughly 2x slower than GraphLab» Scala + Java overhead: Lambdas, G time,» No shared memory parallelism: 2x increase in comm.

34 PageRank is just one stage. What about a pipeline

35 Small Pipeline in GraphX Raw Wikipedia Hyperlinks PageRank Top 20 Pages < / >! < / >! < / >! XML! HDFS HDFS Spark Preprocess Spark Giraph + Spark GraphX GraphLab + Spark ompute Spark Post Total Runtime (in Seconds) Timed end-to-end GraphX is faster than GraphLab

36 The GraphX Stack (Lines of ode) PageRank (5) onnected Shortest SVD LS K-core omp. (10) Path (10) (40) (40) (51) Triangle LD ount (120) (45) Pregel (28) + GraphLab (50) GraphX (3575) Spark

37 Status lpha release as part of Spark 0.9 Seeking collaborators and feedback

38 onclusion and Observations Domain specific views: Tables and Graphs» tables and graphs are first-class composable objects» specialized operators which exploit view semantics Single system that efficiently spans the pipeline» minimize data movement and duplication» eliminates need to learn and manage multiple systems Graphs through the lens of database systems» Graph-Parallel Pattern à Triplet joins in relational alg.» Graph Systems à Distributed join optimizations 38

39 ctive Research Static Data à Dynamic Data» pply GraphX unified approach to time evolving data» Model and analyze relationships over time Serving Graph Structured Data» llow external systems to interact with GraphX» Unify distributed graph databases with relational database technology 39

40 Thanks!

41 Graph Property 1 Real-World Graphs Number of Vertices count Power-Law Degree Distribution ltavista WebGraph1.4B Vertices, 6.6B Edges More than 10 8 vertices have one neighbor. Top 1% of vertices are adjacent to 50% of the edges! degree Degree Ratio of Edges to Vertices Edges >> Vertices Facebook Year 41

42 Graph Property 2 Num-Vertices ctive Vertices PageRank on Web Graph 51% updated only once! Number of Updates

43 Graphs are Essential to Data Mining and Machine Learning Identify influential people and information Find communities Understand people s shared interests Model complex data dependencies

44 Recommending Products Users Ratings Items

45 Recommending Products Low-Rank Matrix Factorization: f(3) Users Netflix Movies Users x f(i) Movies f(j) User Factors (U) f(1) f(2) r 13 r 14 r 24 r 25 f(4) f(5) Movie Factors (M) Iterate: f[i] = arg min w2r d X j2nbrs(i) r ij w T f[j] 2 + w

46 Predicting User Behavior Liberal onservative Post Post Post Post Post Post Post Post Post Post Post Post Post Post Post onditional Random Field! Belief Propagation! Post 46

47 Finding ommunities ount triangles passing through each vertex: " Measures cohesiveness of local community Fewer Triangles Weaker ommunity More Triangles Stronger ommunity

48 Example Graph nalytics Pipeline Preprocessing ompute Post Proc. < / >! < / >! < / >! XML! Raw Data ETL Slice ompute Initial Graph Subgraph PageRank nalyze Top Users Repeat 48