Distributed R for Big Data



Similar documents
Presto/Blockus: Towards Scalable R Data Analysis

Distributed R for Big Data

SEIZE THE DATA SEIZE THE DATA. 2015

Package HPdcluster. R topics documented: May 29, 2015

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

DATA ANALYSIS II. Matrix Algorithms

Spark: Cluster Computing with Working Sets

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

How To Scale Out Of A Nosql Database

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Fast Analytics on Big Data with H20

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Report: Declarative Machine Learning on MapReduce (SystemML)

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Spark and the Big Data Library

A1 and FARM scalable graph database on top of a transactional memory layer

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Big Graph Processing: Some Background

Machine Learning over Big Data

Unified Big Data Processing with Apache Spark. Matei

A Performance Analysis of Distributed Indexing using Terrier

Map-Reduce for Machine Learning on Multicore

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Streaming items through a cluster with Spark Streaming

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

Big Data and Scripting Systems beyond Hadoop

Bringing Big Data Modelling into the Hands of Domain Experts

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Architectures for Big Data Analytics A database perspective

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

SYSTAP / bigdata. Open Source High Performance Highly Available. 1 bigdata Presented to CSHALS 2/27/2014

The Internet of Things and Big Data: Intro

Scaling Out With Apache Spark. DTL Meeting Slides based on

Challenges for Data Driven Systems

Advanced In-Database Analytics

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

GraySort on Apache Spark by Databricks

SEIZE THE DATA SEIZE THE DATA. 2015

Big Data Systems CS 5965/6965 FALL 2015

MLlib: Scalable Machine Learning on Spark

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Hadoop Ecosystem B Y R A H I M A.

Large-Scale Data Processing

extreme Datamining mit Oracle R Enterprise

RevoScaleR Speed and Scalability

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Estimating PageRank Values of Wikipedia Articles using MapReduce

Graph Database Proof of Concept Report

Big Data and Scripting Systems build on top of Hadoop

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Vertica Live Aggregate Projections

The Data Mining Process

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Application of Predictive Analytics for Better Alignment of Business and IT

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

What Does Big Data Mean and Who Will Win? Michael Stonebraker

Parallel Computing for Data Science

Evaluating partitioning of big graphs

Hadoop SNS. renren.com. Saturday, December 3, 11

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Apache Hama Design Document v0.6

How Companies are! Using Spark

The Stratosphere Big Data Analytics Platform

Advanced Big Data Analytics with R and Hadoop

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database


HIGH PERFORMANCE BIG DATA ANALYTICS

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Big Data Research in the AMPLab: BDAS and Beyond

HiBench Introduction. Carson Wang Software & Services Group

Big Data from a Database Theory Perspective

The PageRank Citation Ranking: Bring Order to the Web

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Increasing Flash Throughput for Big Data Applications (Data Management Track)

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

Big Data Database Revenue and Market Forecast,

HP Vertica. Echtzeit-Analyse extremer Datenmengen und Einbindung von Hadoop. Helmut Schmitt Sales Manager DACH

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Integrating Apache Spark with an Enterprise Data Warehouse

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Big Data on Microsoft Platform

How To Make A Credit Risk Model For A Bank Account

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Distributed Computing and Big Data: Hadoop and MapReduce

Fast Multipole Method for particle interactions: an open source parallel library component

Scaling Up HBase, Hive, Pegasus

Transcription:

Distributed R for Big Data Indrajit Roy, HP Labs November 2013 Team: Shivara m Erik Kyungyon g Alvin Rob Vanish

A Big Data story Once upon a time, a customer in distress had. 2+ billion rows of financial data (TBs of data) wanted to model defaults on mortgage and credit cards by running regression analysis analysis Alas! traditional databases don t support regression custom code can take from hours to days Moral of the story: Customers need platform+programming model for complex analysis 2

Big data, complex algorithms PageRank (Dominant eigenvector) Recommendations (Matrix factorization) Machine learning + Graph algorithms Anomaly detection (Top-K eigenvalues) Iterative Linear Algebra Operations User Importance (Vertex Centrality) 3

Example: PageRank using matrices Simplified algorithm repeat { p = M*p } Linear Algebra Operations on Sparse Matrices p M p Power method Dominant eigenvector M = Web graph matrix p = PageRank vector 4

Towards Distributed R Variety (Efficiency) Machine learning, images, graphs SQL, database R/Matlab RDBMS (col. store) Scale+ Complex Analytics search, sort Hadoop 5 Volume (Scalability) * very simplified view

Enter the world of R

R: Arrays for analysis What is R? R is a programming language and environment for statistical computing Array-oriented Millions of users, thousands of free packages Traditional applications of R Machine Learning Graph Algorithms Bioinformatics 7

R is used everwhere but R is Not parallel Not distributed Limited by dataset size Few parallel algorithms! 8

Enter Distributed R

Challenge 1: R has limited support for parallelism R is single-threaded Multi-process solutions offered by extensions Threads/processes share data through pipes or network Time-inefficient (sending copies) Space-inefficient (extra copies) Server 1 Server 2 R process data R process copy of data R process copy of data R process copy of data local copy network copy network copy 10

Challenge 2: R is memory bound Data needs to fit DRAM Current research solution: Uses custom bigarray objects with limited functionality Even simple operations like x+y may not work 11

Challenge 3: Sparse datasets cause load imbalance Block density (normalized ) LiveJournal Netflix ClueWeb-1B 10000 Computation + communication imbalance! 1000 100 10 1 1 11 21 31 41 51 61 71 81 91 Block ID 12

Distributed R for Big Data Challenges in scaling R Programming model Mechanisms Applications and results 13

Large scale analytics frameworks Data-parallel frameworks MapReduce/Dryad (2004) Process each record in parallel Use case: Computing sufficient statistics, analytics queries Graph-centric frameworks Pregel/GraphLab (2010) Process each vertex in parallel Use case: Graphical models Array-based frameworks Process blocks of array in parallel Use case: Linear Algebra Operations Our approach* *Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices. S. Venkataraman, E. Bodzsar, I. Roy, A. AuYoung, R. Schreiber. Eurosys 2013. 14

Enhancement #1: Distributed data structures Relies on user defined partitioning Also support for distributed data-frames darray 15

Enhancement #2: Distributed loop Express computations over partitions Execute across the cluster foreach f (x) 16

Distributed PageRank N s P 1 P 1 s P 1 P 1 P 2 P 2 P 2 P 2 N P N/s P P N/s M P N/s P_old P N/s Z 17 M ß darray(dim=c(n,n),blocks=(s,n)) P ß darray(dim=c(n,1),blocks=(s,1)) while(..){ foreach(i,1:len, function(p=splits(p,i),m=splits(m,i) }) P_old ß P } x=splits(p_old),z=splits(z,i)) { p ß (m*x)+z update(p) Create Distributed Array

Distributed PageRank N s P 1 P 1 s P 1 P 1 P 2 P 2 P 2 P 2 N P N/s P P N/s M P N/s P_old P N/s Z 18 M ß darray(dim=c(n,n),blocks=(s,n)) P ß darray(dim=c(n,1),blocks=(s,1)) while(..){ foreach(i,1:len, function(p=splits(p,i),m=splits(m,i) }) P_old ß P } x=splits(p_old),z=splits(z,i)) { p ß (m*x)+z update(p) Execute function in a cluster Pass array partitions

Distributed R for Big Data Challenges in scaling R Programming model Mechanisms Applications and results 19

Architecture Scheduler: performs I/O and task scheduling Worker: executes tasks and I/O operations Master Scheduler Worker Worker Executor pool I/O Engine Executor pool I/O Engine R instance R instance R instance DRAM SSDs HDDs R instance R instance R instance DRAM SSDs HDDs 20

Locality based computation: Part 1 foreach(i,1:4, function(p=splits(p,i)) { } Ship functions to data Task 1 Task 2 Task 3 Task 4 P 1 P 2 P 3 P 4 21 M 1 M 2 M 3 M 4

Locality based computation: Part 2 foreach(i,1:1, function(p=splits(p)) { } Re-assemble data Run Task P 1 P 2 P 3 P 4 22 M 1 M 2 M 3 M 4

Distributed R for Big Data Challenges in scaling R Programming model Mechanisms Applications and results 23

Applications in Distributed R Application Algorithm LOC PageRank Eigenvector calculation 41 Triangle counting Top-K eigenvalues 121 Netflix recommendation Matrix factorization 130 Fewer than 140 lines of code Centrality measure Graph algorithm 132 SSSP Graph algorithm 62 k-means Clustering 71 Logistic regression Data mining 120 24 *LOC for core of the applications

Distribtued R has good performance Algorithm: PageRank (power method) Dataset: ClueWeb graph, 100M vertices, 1.2B edges, 20GB Setup: 8 SL 390 servers, 8 cores/server, 96GB RAM Distributed R 2.05 Hadoop 161 Spark 64.85 1 10 100 1000 Time per iteration (s) *Shorter is better 25

Summary Big data requires complex analysis : machine learning, graph processing, etc. Matrices and arrays are surprisingly handy data structures Distributed R: simplifies distributed analytics Still, much remains: formalism, single node performance, compiler improvements, package contributions, 26

Thank you HPL Team: Alvin AuYoung, Rob Schreiber, Vanish Talwar Interns: Shivaram Venkataraman (UC Berkeley), Erik Bodzsar (U. Chicago), Kyungyong Lee (UFL) HP Vertica Developers Collaborators: Prof. Andrew Chien (U. Chicago) http://www.hpl.hp.com/research/distributedr.html