Distributed R for Big Data

Size: px

Start display at page:

Download "Distributed R for Big Data"

Evan Cross
8 years ago
Views:

1 Distributed R for Big Data Indrajit Roy, HP Labs November 2013 Team: Shivara m Erik Kyungyon g Alvin Rob Vanish

2 A Big Data story Once upon a time, a customer in distress had. 2+ billion rows of financial data (TBs of data) wanted to model defaults on mortgage and credit cards by running regression analysis analysis Alas! traditional databases don t support regression custom code can take from hours to days Moral of the story: Customers need platform+programming model for complex analysis 2

cards by running regression analysis analysis Alas!

3 Big data, complex algorithms PageRank (Dominant eigenvector) Recommendations (Matrix factorization) Machine learning + Graph algorithms Anomaly detection (Top-K eigenvalues) Iterative Linear Algebra Operations User Importance (Vertex Centrality) 3

Graph algorithms Anomaly detection (Top-K eigenvalues)

4 Example: PageRank using matrices Simplified algorithm repeat { p = M*p } Linear Algebra Operations on Sparse Matrices p M p Power method Dominant eigenvector M = Web graph matrix p = PageRank vector 4

5 Towards Distributed R Variety (Efficiency) Machine learning, images, graphs SQL, database R/Matlab RDBMS (col. store) Scale+ Complex Analytics search, sort Hadoop 5 Volume (Scalability) * very simplified view

6 Enter the world of R

7 R: Arrays for analysis What is R? R is a programming language and environment for statistical computing Array-oriented Millions of users, thousands of free packages Traditional applications of R Machine Learning Graph Algorithms Bioinformatics 7

computing Array-oriented Millions of users, thousands of free

8 R is used everwhere but R is Not parallel Not distributed Limited by dataset size Few parallel algorithms! 8

9 Enter Distributed R

10 Challenge 1: R has limited support for parallelism R is single-threaded Multi-process solutions offered by extensions Threads/processes share data through pipes or network Time-inefficient (sending copies) Space-inefficient (extra copies) Server 1 Server 2 R process data R process copy of data R process copy of data R process copy of data local copy network copy network copy 10

Time-inefficient (sending copies) Space-inefficient (extra copies) Server 1 Server 2 R process

11 Challenge 2: R is memory bound Data needs to fit DRAM Current research solution: Uses custom bigarray objects with limited functionality Even simple operations like x+y may not work 11

12 Challenge 3: Sparse datasets cause load imbalance Block density (normalized ) LiveJournal Netflix ClueWeb-1B Computation + communication imbalance! Block ID 12

ClueWeb-1B 10000 Computation + communication

13 Distributed R for Big Data Challenges in scaling R Programming model Mechanisms Applications and results 13

14 Large scale analytics frameworks Data-parallel frameworks MapReduce/Dryad (2004) Process each record in parallel Use case: Computing sufficient statistics, analytics queries Graph-centric frameworks Pregel/GraphLab (2010) Process each vertex in parallel Use case: Graphical models Array-based frameworks Process blocks of array in parallel Use case: Linear Algebra Operations Our approach* *Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices. S. Venkataraman, E. Bodzsar, I. Roy, A. AuYoung, R. Schreiber. Eurosys

Graphical models Array-based frameworks Process blocks of array in parallel Use case: Linear Algebra Operations Our approach* *Presto:

15 Enhancement #1: Distributed data structures Relies on user defined partitioning Also support for distributed data-frames darray 15

16 Enhancement #2: Distributed loop Express computations over partitions Execute across the cluster foreach f (x) 16

17 Distributed PageRank N s P 1 P 1 s P 1 P 1 P 2 P 2 P 2 P 2 N P N/s P P N/s M P N/s P_old P N/s Z 17 M ß darray(dim=c(n,n),blocks=(s,n)) P ß darray(dim=c(n,1),blocks=(s,1)) while(..){ foreach(i,1:len, function(p=splits(p,i),m=splits(m,i) }) P_old ß P } x=splits(p_old),z=splits(z,i)) { p ß (m*x)+z update(p) Create Distributed Array

$.){ foreach(i,1:len, function(p=splits(p,i),m=splits(m,i) }) P_old ß P }$

18 Distributed PageRank N s P 1 P 1 s P 1 P 1 P 2 P 2 P 2 P 2 N P N/s P P N/s M P N/s P_old P N/s Z 18 M ß darray(dim=c(n,n),blocks=(s,n)) P ß darray(dim=c(n,1),blocks=(s,1)) while(..){ foreach(i,1:len, function(p=splits(p,i),m=splits(m,i) }) P_old ß P } x=splits(p_old),z=splits(z,i)) { p ß (m*x)+z update(p) Execute function in a cluster Pass array partitions

$.){ foreach(i,1:len, function(p=splits(p,i),m=splits(m,i) }) P_old ß P }$

19 Distributed R for Big Data Challenges in scaling R Programming model Mechanisms Applications and results 19

20 Architecture Scheduler: performs I/O and task scheduling Worker: executes tasks and I/O operations Master Scheduler Worker Worker Executor pool I/O Engine Executor pool I/O Engine R instance R instance R instance DRAM SSDs HDDs R instance R instance R instance DRAM SSDs HDDs 20

Executor pool I/O Engine Executor pool I/O Engine R instance R

21 Locality based computation: Part 1 foreach(i,1:4, function(p=splits(p,i)) { } Ship functions to data Task 1 Task 2 Task 3 Task 4 P 1 P 2 P 3 P 4 21 M 1 M 2 M 3 M 4

22 Locality based computation: Part 2 foreach(i,1:1, function(p=splits(p)) { } Re-assemble data Run Task P 1 P 2 P 3 P 4 22 M 1 M 2 M 3 M 4

23 Distributed R for Big Data Challenges in scaling R Programming model Mechanisms Applications and results 23

24 Applications in Distributed R Application Algorithm LOC PageRank Eigenvector calculation 41 Triangle counting Top-K eigenvalues 121 Netflix recommendation Matrix factorization 130 Fewer than 140 lines of code Centrality measure Graph algorithm 132 SSSP Graph algorithm 62 k-means Clustering 71 Logistic regression Data mining *LOC for core of the applications

25 Distribtued R has good performance Algorithm: PageRank (power method) Dataset: ClueWeb graph, 100M vertices, 1.2B edges, 20GB Setup: 8 SL 390 servers, 8 cores/server, 96GB RAM Distributed R 2.05 Hadoop 161 Spark Time per iteration (s) *Shorter is better 25

26 Summary Big data requires complex analysis : machine learning, graph processing, etc. Matrices and arrays are surprisingly handy data structures Distributed R: simplifies distributed analytics Still, much remains: formalism, single node performance, compiler improvements, package contributions, 26

27 Thank you HPL Team: Alvin AuYoung, Rob Schreiber, Vanish Talwar Interns: Shivaram Venkataraman (UC Berkeley), Erik Bodzsar (U. Chicago), Kyungyong Lee (UFL) HP Vertica Developers Collaborators: Prof. Andrew Chien (U. Chicago)

Presto/Blockus: Towards Scalable R Data Analysis

Presto/Blockus: Towards Scalable R Data Analysis /Blockus: Towards Scalable R Data Analysis Andrew A. Chien University of Chicago and Argonne ational Laboratory IRIA-UIUC-AL Joint Institute Potential Collaboration ovember 19, 2012 ovember 19, 2012 Andrew