Distributed R for Big Data


 Evan Cross
 1 years ago
 Views:
Transcription
1 Distributed R for Big Data Indrajit Roy, HP Labs November 2013 Team: Shivara m Erik Kyungyon g Alvin Rob Vanish
2 A Big Data story Once upon a time, a customer in distress had. 2+ billion rows of financial data (TBs of data) wanted to model defaults on mortgage and credit cards by running regression analysis analysis Alas! traditional databases don t support regression custom code can take from hours to days Moral of the story: Customers need platform+programming model for complex analysis 2
3 Big data, complex algorithms PageRank (Dominant eigenvector) Recommendations (Matrix factorization) Machine learning + Graph algorithms Anomaly detection (TopK eigenvalues) Iterative Linear Algebra Operations User Importance (Vertex Centrality) 3
4 Example: PageRank using matrices Simplified algorithm repeat { p = M*p } Linear Algebra Operations on Sparse Matrices p M p Power method Dominant eigenvector M = Web graph matrix p = PageRank vector 4
5 Towards Distributed R Variety (Efficiency) Machine learning, images, graphs SQL, database R/Matlab RDBMS (col. store) Scale+ Complex Analytics search, sort Hadoop 5 Volume (Scalability) * very simplified view
6 Enter the world of R
7 R: Arrays for analysis What is R? R is a programming language and environment for statistical computing Arrayoriented Millions of users, thousands of free packages Traditional applications of R Machine Learning Graph Algorithms Bioinformatics 7
8 R is used everwhere but R is Not parallel Not distributed Limited by dataset size Few parallel algorithms! 8
9 Enter Distributed R
10 Challenge 1: R has limited support for parallelism R is singlethreaded Multiprocess solutions offered by extensions Threads/processes share data through pipes or network Timeinefficient (sending copies) Spaceinefficient (extra copies) Server 1 Server 2 R process data R process copy of data R process copy of data R process copy of data local copy network copy network copy 10
11 Challenge 2: R is memory bound Data needs to fit DRAM Current research solution: Uses custom bigarray objects with limited functionality Even simple operations like x+y may not work 11
12 Challenge 3: Sparse datasets cause load imbalance Block density (normalized ) LiveJournal Netflix ClueWeb1B Computation + communication imbalance! Block ID 12
13 Distributed R for Big Data Challenges in scaling R Programming model Mechanisms Applications and results 13
14 Large scale analytics frameworks Dataparallel frameworks MapReduce/Dryad (2004) Process each record in parallel Use case: Computing sufficient statistics, analytics queries Graphcentric frameworks Pregel/GraphLab (2010) Process each vertex in parallel Use case: Graphical models Arraybased frameworks Process blocks of array in parallel Use case: Linear Algebra Operations Our approach* *Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices. S. Venkataraman, E. Bodzsar, I. Roy, A. AuYoung, R. Schreiber. Eurosys
15 Enhancement #1: Distributed data structures Relies on user defined partitioning Also support for distributed dataframes darray 15
16 Enhancement #2: Distributed loop Express computations over partitions Execute across the cluster foreach f (x) 16
17 Distributed PageRank N s P 1 P 1 s P 1 P 1 P 2 P 2 P 2 P 2 N P N/s P P N/s M P N/s P_old P N/s Z 17 M ß darray(dim=c(n,n),blocks=(s,n)) P ß darray(dim=c(n,1),blocks=(s,1)) while(..){ foreach(i,1:len, function(p=splits(p,i),m=splits(m,i) }) P_old ß P } x=splits(p_old),z=splits(z,i)) { p ß (m*x)+z update(p) Create Distributed Array
18 Distributed PageRank N s P 1 P 1 s P 1 P 1 P 2 P 2 P 2 P 2 N P N/s P P N/s M P N/s P_old P N/s Z 18 M ß darray(dim=c(n,n),blocks=(s,n)) P ß darray(dim=c(n,1),blocks=(s,1)) while(..){ foreach(i,1:len, function(p=splits(p,i),m=splits(m,i) }) P_old ß P } x=splits(p_old),z=splits(z,i)) { p ß (m*x)+z update(p) Execute function in a cluster Pass array partitions
19 Distributed R for Big Data Challenges in scaling R Programming model Mechanisms Applications and results 19
20 Architecture Scheduler: performs I/O and task scheduling Worker: executes tasks and I/O operations Master Scheduler Worker Worker Executor pool I/O Engine Executor pool I/O Engine R instance R instance R instance DRAM SSDs HDDs R instance R instance R instance DRAM SSDs HDDs 20
21 Locality based computation: Part 1 foreach(i,1:4, function(p=splits(p,i)) { } Ship functions to data Task 1 Task 2 Task 3 Task 4 P 1 P 2 P 3 P 4 21 M 1 M 2 M 3 M 4
22 Locality based computation: Part 2 foreach(i,1:1, function(p=splits(p)) { } Reassemble data Run Task P 1 P 2 P 3 P 4 22 M 1 M 2 M 3 M 4
23 Distributed R for Big Data Challenges in scaling R Programming model Mechanisms Applications and results 23
24 Applications in Distributed R Application Algorithm LOC PageRank Eigenvector calculation 41 Triangle counting TopK eigenvalues 121 Netflix recommendation Matrix factorization 130 Fewer than 140 lines of code Centrality measure Graph algorithm 132 SSSP Graph algorithm 62 kmeans Clustering 71 Logistic regression Data mining *LOC for core of the applications
25 Distribtued R has good performance Algorithm: PageRank (power method) Dataset: ClueWeb graph, 100M vertices, 1.2B edges, 20GB Setup: 8 SL 390 servers, 8 cores/server, 96GB RAM Distributed R 2.05 Hadoop 161 Spark Time per iteration (s) *Shorter is better 25
26 Summary Big data requires complex analysis : machine learning, graph processing, etc. Matrices and arrays are surprisingly handy data structures Distributed R: simplifies distributed analytics Still, much remains: formalism, single node performance, compiler improvements, package contributions, 26
27 Thank you HPL Team: Alvin AuYoung, Rob Schreiber, Vanish Talwar Interns: Shivaram Venkataraman (UC Berkeley), Erik Bodzsar (U. Chicago), Kyungyong Lee (UFL) HP Vertica Developers Collaborators: Prof. Andrew Chien (U. Chicago)
HiTune: DataflowBased Performance Analysis for Big Data Cloud
HiTune: DataflowBased Performance Analysis for Big Data Cloud Jinquan Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel AsiaPacific Research and Development Ltd Shanghai, P.R.China, 200241 {jason.dai,
More informationHaLoop: Efficient Iterative Data Processing on Large Clusters
: Efficient Iterative Data Processing on Large Clusters Yingyi Bu Bill Howe Magdalena Balazinska Michael D. Ernst Department of Computer Science and Engineering University of Washington, Seattle, WA, U.S.A.
More informationPlug Into The Cloud. with Oracle Database 12c ORACLE WHITE PAPER DECEMBER 2014
Plug Into The Cloud with Oracle Database 12c ORACLE WHITE PAPER DECEMBER 2014 Disclaimer The following is intended to outline our general product direction. It is intended for information purposes only,
More informationHow Well do GraphProcessing Platforms Perform? An Empirical Performance Evaluation and Analysis
How Well do GraphProcessing Platforms Perform? An Empirical Performance Evaluation and Analysis Yong Guo, Marcin Biczak, Ana Lucia Varbanescu, Alexandru Iosup, Claudio Martella and Theodore L. Willke
More informationPROCESSING large data sets has become crucial in research
TRANSACTIONS ON CLOUD COMPUTING, VOL. AA, NO. B, JUNE 2014 1 Deploying LargeScale Data Sets ondemand in the Cloud: Treats and Tricks on Data Distribution. Luis M. Vaquero 1, Member, IEEE Antonio Celorio
More informationBuilding Rome in a Day
Building Rome in a Day Sameer Agarwal 1, Noah Snavely 2 Ian Simon 1 Steven. Seitz 1 Richard Szeliski 3 1 University of Washington 2 Cornell University 3 icrosoft Research Abstract We present a system that
More informationConvergence of Big Data and Cloud
American Journal of Engineering Research (AJER) eissn : 23200847 pissn : 23200936 Volume03, Issue05, pp266270 www.ajer.org Research Paper Open Access Convergence of Big Data and Cloud Sreevani.Y.V.
More informationWTF: The Who to Follow Service at Twitter
WTF: The Who to Follow Service at Twitter Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, Reza Zadeh Twitter, Inc. @pankaj @ashishgoel @lintool @aneeshs @dongwang218 @reza_zadeh ABSTRACT
More informationImplementing Graph Pattern Mining for Big Data in the Cloud
Implementing Graph Pattern Mining for Big Data in the Cloud Chandana Ojah M.Tech in Computer Science & Engineering Department of Computer Science & Engineering, PES College of Engineering, Mandya Ojah.chandana@gmail.com
More informationNo One (Cluster) Size Fits All: Automatic Cluster Sizing for Dataintensive Analytics
No One (Cluster) Size Fits All: Automatic Cluster Sizing for Dataintensive Analytics Herodotos Herodotou Duke University hero@cs.duke.edu Fei Dong Duke University dongfei@cs.duke.edu Shivnath Babu Duke
More informationA Study of NoSQL and NewSQL databases for data aggregation on Big Data
A Study of NoSQL and NewSQL databases for data aggregation on Big Data ANANDA SENTRAYA PERUMAL MURUGAN Master s Degree Project Stockholm, Sweden 2013 TRITAICTEX2013:256 A Study of NoSQL and NewSQL
More informationMAD Skills: New Analysis Practices for Big Data
Whitepaper MAD Skills: New Analysis Practices for Big Data By Jeffrey Cohen, Greenplum; Brian Dolan, Fox Interactive Media; Mark Dunlap, Evergreen Technologies; Joseph M. Hellerstein, U.C. Berkeley and
More informationBig Data Analytics with Small Footprint: Squaring the Cloud
Big Data Analytics with Small Footprint: Squaring the Cloud John Canny Huasha Zhao University of California Berkeley, CA 94720 jfc@cs.berkeley.edu, hzhao@cs.berkeley.edu ABSTRACT This paper describes the
More informationBoosting Retail Revenue and Efficiency with Big Data Analytics
white paper Boosting Retail Revenue and Efficiency with Big Data Analytics A Simplified, Automated Approach to Big Data Applications: StackIQ Enterprise Data Management and Monitoring Abstract Contents
More informationBig Data Mining Services and Knowledge Discovery Applications on Clouds
Big Data Mining Services and Knowledge Discovery Applications on Clouds Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy talia@dimes.unical.it Data Availability or Data Deluge? Some decades
More informationDetecting LargeScale System Problems by Mining Console Logs
Detecting LargeScale System Problems by Mining Console Logs Wei Xu Ling Huang Armando Fox David Patterson Michael I. Jordan EECS Department University of California at Berkeley, USA {xuw,fox,pattrsn,jordan}@cs.berkeley.edu
More informationHur hanterar vi utmaningar inom området  Big Data. Jan Östling Enterprise Technologies Intel Corporation, NER
Hur hanterar vi utmaningar inom området  Big Data Jan Östling Enterprise Technologies Intel Corporation, NER Legal Disclaimers All products, computer systems, dates, and figures specified are preliminary
More informationFederated Cloudbased Big Data Platform in Telecommunications
Federated Cloudbased Big Data Platform in Telecommunications Chao Deng dengchao@chinamobilecom Yujian Du duyujian@chinamobilecom Ling Qian qianling@chinamobilecom Zhiguo Luo luozhiguo@chinamobilecom Meng
More informationCloud Computing and Big Data Analytics: What Is New from Databases Perspective?
Cloud Computing and Big Data Analytics: What Is New from Databases Perspective? Rajeev Gupta, Himanshu Gupta, and Mukesh Mohania IBM Research, India {grajeev,higupta8,mkmukesh}@in.ibm.com Abstract. Many
More informationGreenCloud: Economicsinspired Scheduling, Energy and Resource Management in Cloud Infrastructures
GreenCloud: Economicsinspired Scheduling, Energy and Resource Management in Cloud Infrastructures Rodrigo Tavares Fernandes rodrigo.fernandes@tecnico.ulisboa.pt Instituto Superior Técnico Avenida Rovisco
More informationCloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 20102015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
More informationClustering Versus Shared Nothing: A Case Study
2009 33rd Annual IEEE International Computer Software and Applications Conference Clustering Versus Shared Nothing: A Case Study Jonathan Lifflander, Adam McDonald, Orest Pilskalns School of Engineering
More informationEnergy Consumption Analysis of ARMbased
Aalto University School of Science Degree Programme of Mobile Computing Bo Pang Energy Consumption Analysis of ARMbased System Master s Thesis Espoo, Augest 31, 2011 Supervisor: Instructor: Professor Antti
More informationBIG DATAASASERVICE
White Paper BIG DATAASASERVICE What Big Data is about What service providers can do with Big Data What EMC can do to help EMC Solutions Group Abstract This white paper looks at what service providers
More informationBIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing
More information
Oracle s Big Data solutions. Roger Wullschleger.
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
More informationIntroduction to Data Mining and Knowledge Discovery
Introduction to Data Mining and Knowledge Discovery Third Edition by Two Crows Corporation RELATED READINGS Data Mining 99: Technology Report, Two Crows Corporation, 1999 M. Berry and G. Linoff, Data Mining
More informationAn Oracle White Paper June 2013. Oracle: Big Data for the Enterprise
An Oracle White Paper June 2013 Oracle: Big Data for the Enterprise Executive Summary... 2 Introduction... 3 Defining Big Data... 3 The Importance of Big Data... 4 Building a Big Data Platform... 5 Infrastructure
More informationScalable Progressive Analytics on Big Data in the Cloud
Scalable Progressive Analytics on Big Data in the Cloud Badrish Chandramouli 1 Jonathan Goldstein 1 Abdul Quamar 2 1 Microsoft Research, Redmond 2 University of Maryland, College Park {badrishc, jongold}@microsoft.com,
More informationTupleware: Big Data, Big Analytics, Small Clusters
Tupleware: Big Data, Big Analytics, Small Clusters Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Ugur Cetintemel, Stan Zdonik Department of Computer Science, Brown University {crottyan, agg,
More information