A System for Distributed Graph-Parallel Machine Learning

Similar documents

Machine Learning over Big Data

Big Graph Processing: Some Background

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013

Spark and the Big Data Library

The Power of Relationships

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud

Systems and Algorithms for Big Data Analytics

Large-Scale Data Processing

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Presto/Blockus: Towards Scalable R Data Analysis

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Big Data Analytics. Lucas Rego Drumond

Research Statement Joseph E. Gonzalez

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Spark: Cluster Computing with Working Sets

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Architectures for massive data management

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Big Data Systems CS 5965/6965 FALL 2015

HIGH PERFORMANCE BIG DATA ANALYTICS

Big Data and Scripting Systems beyond Hadoop

Graph Processing and Social Networks

Big Data looks Tiny from the Stratosphere

GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

CSE-E5430 Scalable Cloud Computing Lecture 2

Large Scale Graph Processing with Apache Giraph

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

GraphChi: Large-Scale Graph Computation on Just a PC

MMap: Fast Billion-Scale Graph Computation on a PC via Memory Mapping

Apache Hama Design Document v0.6

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Unified Big Data Processing with Apache Spark. Matei

Distributed Structured Prediction for Big Data

Evaluating partitioning of big graphs

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Spark: Making Big Data Interactive & Real-Time

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Scalability! But at what COST?

Report: Declarative Machine Learning on MapReduce (SystemML)

An Experimental Comparison of Pregel-like Graph Processing Systems

SYNC or ASYNC: Time to Fuse for Distributed Graph-Parallel Computation

Frog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model

Information Processing, Big Data, and the Cloud

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

MLlib: Scalable Machine Learning on Spark

Extreme Machine Learning with GPUs John Canny

Beyond Hadoop with Apache Spark and BDAS

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

The Stratosphere Big Data Analytics Platform

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Using Map-Reduce for Large Scale Analysis of Graph-Based Data

How To Create A Graphlab Framework For Parallel Machine Learning

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Apache Flink Next-gen data analysis. Kostas

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Challenges for Data Driven Systems

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Big Data Science. Prof. Lise Getoor University of Maryland, College Park. October 17, 2013

Parallelism and Cloud Computing

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Protein Protein Interaction Networks

Analysis of Web Archives. Vinay Goel Senior Data Engineer

DATA ANALYSIS II. Matrix Algorithms

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Brave New World: Hadoop vs. Spark

PARADIGMS FOR REALIZING MACHINE LEARNING ALGORITHMS

Transcription:

A System for Distributed Graph-Parallel Machine Learning The Team: Joseph Gonzalez Postdoc, UC Berkeley AMPLab PhD @ CMU Yucheng Low Haijie Gu Aapo Kyrola Danny Bickson Carlos Guestrin Alex Smola Guy Blelloch

About me Scalable Machine Learning + Big Graphical Models BigLearning 2

Graphs are Essential to Data-Mining and Machine Learning Identify influential people and information Find communities Target ads and products Model complex data dependencies 3

Example: Estimate Political Bias??? Liberal???? Conservative?? Post Post???? Post Post Post?? Post Post Post?? Post Post Post?? Post?? Post? Post Loopy Belief Propagation Post???? Conditional Random Field?? Post?? 4

Collaborative Filtering: Exploiting Dependencies Women on the Verge of a Nervous Breakdown The Celebration What do I recommend??? City of God Wild Strawberries La Dolce Vita

Users Netflix Movies Iterate: Matrix Factorization Alternating Least Squares (ALS) x Users f(i) Movies f(j) User Factors (U) f(1) f(2) r 13 r 14 r 24 r25 f(3) f(4) f(5) Movie Factors (M) Factor for User i Factor for Movie j

PageRank Rank of user i Weighted sum of neighbors ranks Everyone starts with equal ranks Update ranks in parallel Iterate until convergence 7

How should we program graph-parallel algorithms? Low-level tools like MPI and Pthreads? - Me, during my first years of grad school 8

Threads, Locks, and MPI ML experts repeatedly solve the same parallel design challenges: Implement and debug complex parallel system Tune for a single parallel platform Six months later the conference paper contains: We implemented in parallel. The resulting code: is difficult to maintain and extend couples learning model and implementation 9

How should we program graph-parallel algorithms? High-level Abstractions! -Me, now 10

The Graph-Parallel Abstraction A user-defined Vertex-Program runs on each vertex Graph constrains interaction along edges Think Using messages like (e.g. Pregel a [PODC 09, Vertex. SIGMOD 10]) Through shared state (e.g., GraphLab [UAI 10, VLDB 12]) -Malewicz et al. [SIGMOD 10] Parallelism: run multiple vertex programs simultaneously 11

Graph-parallel Abstractions Better for Machine Learning Messaging i Shared State i Synchronous Dynamic Asynchronous 12

The GraphLab Vertex Program Vertex Programs directly access adjacent vertices and edges GraphLab_PageRank(i) // Compute sum over neighbors total = 0 foreach( j in neighbors(i)): total = total + R[j] * w ji // Update the PageRank R[i] = 0.15 + total R[4] * w 41 4 1 + + 3 2 // Trigger neighbors to run again priority = R[i] oldr[i] if R[i] not converged then signal neighborsof(i) with priority 13

Benefit of Dynamic PageRank Num-Vertices Better 100000000 1000000 10000 100 1 51% updated only once! 0 10 20 30 40 50 60 70 Number of Updates 14

GraphLab Asynchronous Execution The scheduler determines the order that vertices are executed CPU 1 a b c d Scheduler ba hi CPU 2 h e f g i j k Scheduler can prioritize vertices.

Asynchronous Belief Propagation Challenge = Boundaries Many Updates Synthetic Noisy Image Cumulative Vertex Updates Few Updates Graphical Model Algorithm identifies and focuses on hidden sequential structure

GraphLab Ensures a Serializable Execution Enables: Gauss-Seidel iterations, Gibbs Sampling, Graph Coloring,

Never Ending Learner Project (CoEM) Language modeling: named entity recognition Hadoop (BSP) 95 Cores 7.5 hrs GraphLab 16 Cores 30 min Distributed GraphLab 6x 32 fewer EC2 CPUs! machines 80 secs 15x Faster! 0.3% of Hadoop time 18

Thus far GraphLab provided a powerful new abstraction But We couldn t scale up to Altavista Webgraph from 2002 1.4B vertices, 6.6B edges

Natural Graphs Graphs derived from natural phenomena 20

Properties of Natural Graphs Regular Mesh Natural Graph Power-Law Degree Distribution 21

Number of Vertices count Power-Law Degree Distribution 10 10 10 8 10 6 10 4 10 2 More than 10 8 vertices have one neighbor. Top High-Degree 1% of vertices are adjacent to Vertices 50% of the edges! AltaVista WebGraph 1.4B Vertices, 6.6B Edges 10 0 10 0 10 2 10 Degree 4 degree 10 6 10 8 22

Power-Law Degree Distribution Star Like Motif President Obama Followers 23

Challenges of High-Degree Vertices Sequentially process edges Sends many messages (Pregel) Touches a large fraction of graph (GraphLab) Edge meta-data too large for single machine Asynchronous Execution requires heavy locking (GraphLab) Synchronous Execution prone to stragglers (Pregel) 24

Graph Partitioning Graph parallel abstractions rely on partitioning: Minimize communication Balance computation and storage Comm. Cost O(# cut edges) Machine 1 Machine 2 25

Power-Law Graphs are Difficult to Partition CPU 1 CPU 2 Power-Law graphs do not have low-cost balanced cuts [Leskovec et al. 08, Lang 04] Traditional graph-partitioning algorithms perform poorly on Power-Law Graphs. [Abou-Rjeili et al. 06] 26

Random Partitioning GraphLab resorts to random (hashed) partitioning on natural graphs 10 Machines 90% of edges cut 100 Machines 99% of edges cut! Machine 1 Machine 2 27

Program For This Run on This Machine 1 Machine 2 Split High-Degree vertices New Abstraction Equivalence on Split Vertices 28

A Common Pattern for Vertex-Programs GraphLab_PageRank(i) // Compute sum over neighbors total = 0 foreach( j in neighbors(i)): total = total + R[j] * w ji // Update the PageRank R[i] = total // Trigger neighbors to run again priority = R[i] oldr[i] if R[i] not converged then signal neighbors(i) with priority Gather Information About Neighborhood Update Vertex Signal Neighbors & Modify Edge Data 29

Formal GraphLab2 Semantics Gather(SrcV, Edge, DstV) A Collect information from neighbors Sum(A, A) A Commutative associative Sum Apply(V, A) V Update the vertex Scatter(SrcV, Edge, DstV) (Edge, signal) Update edges and signal neighbors 30

PageRank in GraphLab2 GraphLab2_PageRank(i) Gather( j i ) : return w ji * R[j] sum(a, b) : return a + b; Apply(i, Σ) : R[i] = 0.15 + Σ Scatter( i j ) : if R[i] changed then trigger j to be recomputed 31

GAS Decomposition Machine 1 Machine 2 Gather Apply Master Y Y Y Σ Σ 1 + + + Σ 2 Y Mirror Scatter Σ 3 Σ 4 Machine 3 Mirror Mirror Machine 4 32

Minimizing Communication in PowerGraph Communication Y is linear in New Theorem: the number of machines For any edge-cut we can directly each vertex spans construct a vertex-cut which requires strictly less communication and storage. A vertex-cut minimizes machines each vertex spans Percolation theory suggests that power law graphs have good vertex cuts. [Albert et al. 2000] 33

Constructing Vertex-Cuts Evenly assign edges to machines Minimize machines spanned by each vertex Assign each edge as it is loaded Touch each edge only once Propose two distributed approaches: Random Vertex Cut Greedy Vertex Cut 34

Random Vertex-Cut Randomly assign edges to machines Machine 1 Machine 2 Machine 3 Balanced Vertex-Cut Y Spans 3 Machines Z Spans 2 Machines Not cut! YY Y Z 35

Random Vertex-Cuts vs. Edge-Cuts Expected improvement from vertex-cuts: 100 Reduction in Comm. and Storage 10 1 Order of Magnitude Improvement 0 50 100 150 Number of Machines 36

Streaming Greedy Vertex-Cuts Place edges on machines which already have the vertices in that edge. A B B C Machine1 Machine 2 AB DE 37

Greedy Vertex-Cuts Improve Performance Runtime Relative to Random 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 PageRank Collaborative Filtering Shortest Path Random Greedy Greedy partitioning improves computation performance. 38

System Design PowerGraph (GraphLab2) System MPI/TCP-IP PThreads HDFS EC2 HPC Nodes Implemented as C++ API Uses HDFS for Graph Input and Output Fault-tolerance is achieved by check-pointing Snapshot time < 5 seconds for twitter network 39

Implemented Many Algorithms Collaborative Filtering Alternating Least Squares Stochastic Gradient Descent SVD Non-negative MF Statistical Inference Loopy Belief Propagation Max-Product Linear Programs Gibbs Sampling Graph Analytics PageRank Triangle Counting Shortest Path Graph Coloring K-core Decomposition Computer Vision Image stitching Language Modeling LDA 40

PageRank on the Twitter Follower Graph Total Network (GB) 40 35 30 25 20 15 10 5 0 Natural Graph with 40M Users, 1.4 Billion Links Communication GraphLab Pregel (Piccolo) PowerGraph Seconds 70 60 50 40 30 20 10 0 GraphLab Runtime Pregel (Piccolo) PowerGraph Reduces Communication Runs Faster 32 Nodes x 8 Cores (EC2 HPC cc1.4x) 41

PageRank on Twitter Follower Graph Natural Graph with 40M Users, 1.4 Billion Links Runtime Per Iteration 0 50 100 150 200 Hadoop GraphLab Twister Piccolo Order of magnitude by exploiting properties of Natural Graphs PowerGraph Hadoop results from [Kang et al. '11] Twister (in-memory MapReduce) [Ekanayake et al. 10] 42

GraphLab2 is Scalable Yahoo Altavista Web Graph (2002): One of the largest publicly available web graphs 1.4 Billion Webpages, 6.6 Billion Links 7 Seconds per Iter. 1B links processed per second 64 HPC Nodes 30 lines of user code 1024 Cores (2048 HT) 43

Topic Modeling English language Wikipedia 2.6M Documents, 8.3M Words, 500M Tokens Computationally intensive algorithm Million Tokens Per Second 0 20 40 60 80 100 120 140 160 Smola et al. 100 Yahoo! Machines Specifically engineered for this task PowerGraph 64 cc2.8xlarge EC2 Nodes 200 lines of code & 4 human hours 44

Triangle Counting For each vertex in graph, count number of triangles containing it Measures both popularity of the vertex and cohesiveness of the vertex s community: Fewer Triangles Weaker Community More Triangles Stronger Community

Triangle Counting on The Twitter Graph Identify individuals with strong communities. Counted: 34.8 Billion Triangles Hadoop [WWW 11] 1536 Machines 423 Minutes 64 Machines 1.5 Minutes 282 x Faster Why? Wrong Abstraction Broadcast O(degree 2 ) messages per Vertex S. Suri and S. Vassilvitskii, Counting triangles and the curse of the last reducer, WWW 11 46

Machine Learning and Data-Mining Toolkits Graph Analytics Graphical Models Computer Vision Clustering Topic Modeling Collaborative Filtering GraphLab2 System MPI/TCP-IP PThreads HDFS EC2 HPC Nodes http://graphlab.org Apache 2 License

GraphChi: Going small with GraphLab Solve huge problems on small or embedded devices? Key: Exploit non-volatile memory (starting with SSDs and HDs)

GraphChi disk-based GraphLab Novel Parallel Sliding Windows algorithm Single-Machine Parallel, asynchronous execution Solves big problems That are normally solved in cloud Efficiently exploits disks Optimized for stream acces Efficient on both SSD and hard-drives

Triangle Counting in Twitter Graph 40M Users 1.2B Edges Total: 34.8 Billion Triangles Hadoop GraphChi PowerGraph 1536 Machines 423 Minutes 59 Minutes, 1 Mac Mini! 64 Machines, 1024 Cores 1.5 Minutes Hadoop results from [Suri & Vassilvitskii '11]

Apache 2 License http://graphlab.org Documentation Code Tutorials (more on the way)

Active Work Cross language support (Python/Java) Support for incremental graph computation Integration with Graph Databases Declarative representations of GAS decomposition: my.pr := nbrs.in.map(x => x.pr).reduce( (a,b) => a + b ) http://graphlab.org Joseph E. Gonzalez Postdoc, UC Berkeley jegonzal@eecs.berkeley.edu jegonzal@cs.cmu.edu 52

Why not use Map-Reduce for Graph Parallel algorithms?

Data Dependencies are Difficult Difficult to express dependent data in Map Reduce Substantial data transformations User managed graph structure Costly data replication Independent Data Records

Iterative Computation is Difficult System is not optimized for iteration: Iterations Data CPU 1 Data CPU 1 Data CPU 1 Data Data Data Data Data Data Startup Penalty CPU 2 CPU 3 Disk Penalty Data Data Data Data Data Startup Penalty CPU 2 CPU 3 Disk Penalty Data Data Data Data Data Startup Penalty CPU 2 CPU 3 Disk Penalty Data Data Data Data Data Data Data Data Data

The Pregel Abstraction Vertex-Programs interact by sending messages. Pregel_PageRank(i, messages) : // Receive all the messages total = 0 foreach( msg in messages) : total = total + msg i // Update the rank of this vertex R[i] = total // Send new messages to neighbors foreach(j in out_neighbors[i]) : Send msg(r[i]) to vertex j Malewicz et al. [PODC 09, SIGMOD 10] 56

Pregel Synchronous Execution Compute Communicate Barrier

Communication Overhead for High-Degree Vertices Fan-In vs. Fan-Out 58

Pregel Message Combiners on Fan-In A B + Sum D C Machine 1 Machine 2 User defined commutative associative (+) message operation: 59

Pregel Struggles with Fan-Out A B D C Machine 1 Machine 2 Broadcast sends many copies of the same message to the same machine! 60

Fan-In and Fan-Out Performance PageRank on synthetic Power-Law Graphs Piccolo was used to simulate Pregel with combiners Total Comm. (GB) 10 8 6 4 2 0 1.8 1.9 2 2.1 2.2 Power-Law Constant α More high-degree vertices 61

GraphLab Ghosting A A B D B D C Machine 1 Ghost C Machine 2 Changes to master are synced to ghosts 62

GraphLab Ghosting A A B D B D C Machine 1 C Ghost Machine 2 Changes to neighbors of high degree vertices creates substantial network traffic 63

Fan-In and Fan-Out Performance PageRank on synthetic Power-Law Graphs GraphLab is undirected Total Comm. (GB) 10 8 6 4 2 0 1.8 1.9 2 2.1 2.2 Power-Law Constant alpha More high-degree vertices 64

Comparison with GraphLab & Pregel Total Network (GB) PageRank on Synthetic Power-Law Graphs: 10 8 6 4 2 0 1.8 Communication Pregel (Piccolo) GraphLab Power-Law Constant α High-degree vertices Seconds 30 25 20 15 10 5 0 1.8 Runtime Pregel (Piccolo) GraphLab Power-Law Constant α High-degree vertices GraphLab2 is robust to high-degree vertices. 65

GraphLab on Spark #include <graphlab.hpp> struct vertex_data : public graphlab::is_pod_type { float rank; vertex_data() : rank(1) { } }; typedef graphlab::empty edge_data; typedef graphlab::distributed_graph<vertex_data, edge_data> graph_type; class pagerank : public graphlab::ivertex_program<graph_type, float>, public graphlab::is_pod_type { float last_change; public: float gather(icontext_type& context, const vertex_type& vertex, edge_type& edge) const { return edge.source().data().rank / edge.source().num_out_edges(); } void apply(icontext_type& context, vertex_type& vertex, const gather_type& total) { const double newval = 0.15*total + 0.85; last_change = std::fabs(newval - vertex.data().rank); vertex.data().rank = newval; } void scatter(icontext_type& context, const vertex_type& vertex, edge_type& edge) const { if (last_change > TOLERANCE) context.signal(edge.target()); } }; struct pagerank_writer { std::string save_vertex(graph_type::vertex_type v) { std::stringstream strm; strm << v.id() << "\t" << v.data() << "\n"; return strm.str(); } std::string save_edge(graph_type::edge_type e) { return ""; } }; int main(int argc, char** argv) { graphlab::mpi_tools::init(argc, argv); graphlab::distributed_control dc; graphlab::command_line_options clopts("pagerank algorithm."); graph_type graph(dc, clopts); graph.load_format( biggraph.tsv, "tsv"); import spark.graphlab._ val sc = spark.sparkcontext(master, pagerank ) val graph = Graph.textFile( biggraph.tsv ) val vertices = graph.outdegree().mapvalues((_, 1.0, 1.0)) val pr = Graph(vertices, graph.edges).iterate( (meid, e) => e.source.data._2 / e.source.data._1, (a: Double, b: Double) => a + b, (v, accum) => (v.data._1, (0.15 + 0.85*a), v.data._2), (meid, e) => abs(e.source.data._2-e.source.data._1)>0.01) pr.vertices.saveastextfile( results ) Interactive! graphlab::omni_engine<pagerank> engine(dc, graph, clopts); engine.signal_all(); engine.start(); graph.save(saveprefix, pagerank_writer(), false, true false); graphlab::mpi_tools::finalize(); return EXIT_SUCCESS; } 66