Presto/Blockus: Towards Scalable R Data Analysis



Similar documents
Distributed R for Big Data

Large-Scale Data Processing

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Big Graph Processing: Some Background

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Hadoop Parallel Data Processing

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Spark: Making Big Data Interactive & Real-Time

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Package HPdcluster. R topics documented: May 29, 2015

Processing NGS Data with Hadoop-BAM and SeqPig

Spark: Cluster Computing with Working Sets

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Machine Learning over Big Data

Distributed R for Big Data

Challenges for Data Driven Systems

Apache Flink Next-gen data analysis. Kostas

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

DATA ANALYSIS II. Matrix Algorithms

GraySort on Apache Spark by Databricks

Hadoop-BAM and SeqPig

Introduction to DISC and Hadoop

Scaling Out With Apache Spark. DTL Meeting Slides based on

Spark and the Big Data Library

Big Data Systems CS 5965/6965 FALL 2015

The Stratosphere Big Data Analytics Platform

Unified Big Data Processing with Apache Spark. Matei

Streaming Data, Concurrency And R

Map-Reduce for Machine Learning on Multicore

Hadoop: Embracing future hardware

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein

Big Data and Scripting Systems beyond Hadoop

Unified Big Data Analytics Pipeline. 连 城

HIGH PERFORMANCE BIG DATA ANALYTICS

Bringing Big Data Modelling into the Hands of Domain Experts

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Future Prospects of Scalable Cloud Computing

Big Data Analytics. Lucas Rego Drumond

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Management & Analysis of Big Data in Zenith Team

Parallel Computing for Data Science

Architecture Support for Big Data Analytics

Parallel Computing. Benson Muite. benson.

Hadoop SNS. renren.com. Saturday, December 3, 11

The Hadoop Implementation. Thomas Zimmermann Philipp Berger

Big Data for Big Intel

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Architectures for Big Data Analytics A database perspective

Big Data and Scripting Systems build on top of Hadoop

Beyond Hadoop with Apache Spark and BDAS

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Systems and Algorithms for Big Data Analytics

Ali Ghodsi Head of PM and Engineering Databricks

Duke University

Big Data Analytics. Lucas Rego Drumond

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Outdated Architectures Are Holding Back the Cloud

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Big Fast Data Hadoop acceleration with Flash. June 2013

Report: Declarative Machine Learning on MapReduce (SystemML)

MLlib: Scalable Machine Learning on Spark

Development of nosql data storage for the ATLAS PanDA Monitoring System

HPC ABDS: The Case for an Integrating Apache Big Data Stack

CSE-E5430 Scalable Cloud Computing Lecture 11

An Open Source Memory-Centric Distributed Storage System

SIAM PP 2014! MapReduce in Scientific Computing! February 19, 2014

SEIZE THE DATA SEIZE THE DATA. 2015

Fast Analytics on Big Data with H20

YALES2 porting on the Xeon- Phi Early results

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Big-data Analytics: Challenges and Opportunities

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Conquering Big Data with BDAS (Berkeley Data Analytics)

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Analysis of Web Archives. Vinay Goel Senior Data Engineer

BSPCloud: A Hybrid Programming Library for Cloud Computing *

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Fast Multipole Method for particle interactions: an open source parallel library component

FPGA-based Multithreading for In-Memory Hash Joins

Nobody ever got fired for using Hadoop on a cluster

MATE-EC2: A Middleware for Processing Data with AWS

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

How To Create A Graphlab Framework For Parallel Machine Learning

Search Engine Architecture

Application of Predictive Analytics for Better Alignment of Business and IT

Hadoop & Spark Using Amazon EMR

Apache Hama Design Document v0.6

A survey on platforms for big data analytics

Transcription:

/Blockus: Towards Scalable R Data Analysis Andrew A. Chien University of Chicago and Argonne ational Laboratory IRIA-UIUC-AL Joint Institute Potential Collaboration ovember 19, 2012 ovember 19, 2012 Andrew A. Chien, 2012 LSSG Research Projects Blockus: Big Computations on small Memories GVR: Resilient Computation at massive scale 10x10: Systematic Heterogeneous, Energy-efficient Architecture Page 1

ovember 19, 2012 Andrew A. Chien, 2012 History: Towards Scalable R (2010-) Started by HP, collaboration with Berkeley Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, Robert Schreiber, Erik Bodzsar Blockus (2011-) Started by Uchicago, merged as a collaboration with Erik Bodzsar, Andrew Chien, Indrajit Roy, Robert Schreiber, Partha Ranganathan => One larger project /Blockus Outline Motivation Programming model Applications and Results 4 ovember 19, 2012 Andrew A. Chien, 2012 Page 2

Big data analytics in R What is R? R is a programming language and environment for statistical computing Array-oriented Large, diverse user base Examples of big data applications of R Machine Learning Graph Algorithms Bioinformatics 5 Big Data, Complex Algorithms PageRank (Dominant eigenvector) Machine learning + Graph algorithms Recommendations (Matrix factorization) Iterative Linear Algebra Operations Anomaly detection (Top-K eigenvalues) 6 ovember 19, 2012 User Importance (Vertex Centrality) Andrew A. Chien, 2012 Page 3

Challenge 1: R is not an efficient, parallel system R is single-threaded Multi-process solutions offered by extensions Threads/processes share data through pipes or network Time-inefficient (sending copies) Space-inefficient (extra copies) R instance data copy R instance copy of data 7 Challenge 2: R is memory-bound Current research solution: bigmemory package Uses custom bigarray objects Relies on mmap and OS paging à Inefficient eed custom functions to access mmap contents 8 Page 4

Outline Motivation Programming model Applications and Results 9 ovember 19, 2012 Andrew A. Chien, 2012 : distributed execution engine for R Transparent data distribution and parallel execution Workers are still single-threaded and memory-bound Master execution Task dispatch Worker Data transfer R instance Task execution Data storage Worker Data transfer R instance Task execution Data storage 10 Page 5

Added Feature #1: Distributed arrays Relies on user-defined data partitioning Computation is expressed over partitions s s Worker 1 Worker 2 Worker /s 11 Added Feature #2: Versioning + Change Triggered Computation Mechanics of versioning update: Increment version number onchange: Bind a version number for the array before executing the handler 12 ovember 19, 2012 Andrew A. Chien, 2012 Page 6

Incremental Updates ew movie ratings Refine recommendations Better suggestions Incremental computation on consistent view of data 13 ovember 19, 2012 Andrew A. Chien, 2012 PageRank Using s s 14 M darray(dim=c(,),blocks=(s,)) P darray(dim=c(,1),blocks=(s,1)) while(..){ foreach(i,1:len, calculate(p=splits(p,i),m=splits(m,i), x=splits(p_old),z=splits(z,i)) { p ß (m*x)+z } ) P_old P } ovember 19, 2012 P M Andrew A. Chien, 2012 P_ol d Z Create Distributed Array Page 7

PageRank Using s s M darray(dim=c(,),blocks=(s,)) P darray(dim=c(,1),blocks=(s,1)) while(..){ foreach(i,1:len, calculate(p=splits(p,i), m=splits(m,i), x=splits(p_old), z=splits(z,i)) { p ß (m*x)+z } ) P_old P 15 } ovember 19, 2012 P M Andrew A. Chien, 2012 P_ol d Z Execute function in a cluster Pass array partitions Incremental PageRank s s 16 M darray(dim=c(,),blocks=(s,)) P darray(dim=c(,1),blocks=(s1)) onchange(m) { while(..){ foreach(i,1:len, calculate(p=splits(p,i), m=splits(m,i), x=splits(p_old), z=splits(z,i)) { p ß (m*x)+z update(p) }) P_old P ovember }} 19, 2012 P M Andrew A. Chien, 2012 P_ol d Z Execute when data changes Update page rank vector Page 8

Outline Motivation Programming model Applications and Results - Scale Out - Scale Vertical 17 ovember 19, 2012 Andrew A. Chien, 2012 Applications Implemented in 18 Application Algorithm LOC PageRank Eigenvector calculation 41 Triangle counting Top-K eigenvalues 121 Fewer than 140 lines of code etflix recommendation Matrix factorization 130 Centrality measure Graph algorithm 132 k-path connectivity Graph algorithm 30 k-means Clustering 71 Sequence alignment Smith-Waterman 64 ovember 19, 2012 Andrew A. Chien, 2012 Page 9

Achieving High Performance (w/ R runtime) Good Parallelism Load Balance Efficient Exploitation of Multicore (SM, memory management)... A whole list of other things... 19 vs Hadoop: PageRank Hadoop PageRank Time per iteration (sec) 500 200 100 50 20 10 5 2 1 Hadoop mem Hadoop mem Transfers Hadoop mem 8 16 32 64 # cores Compute Hadoop mem 20 Page 10

vs. Spark PageRank Spark PageRank Time per iteration (sec) 500 200 100 50 20 10 5 2 1 Spark Spark Transfers Spark 8 16 32 64 # cores Compute Spark 21 Performance improvement with multi-core support Average iteration time (s) 70 60 50 40 30 20 10 0 Multi-core support Multiple workers per machine Multi-core support Pagerank iteration time (lower is better) Multiple workers per machine Multi-core support Multiple workers per machine Multi-core support 1 2 4 8 umber of cores utilized per machine Multiple workers per machine etwork Series2 Pagerank on 13 GB dataset (100M nodes, 1.2B edges) on 5 machines, varying the number of cores utilized per machine Machine configuration: 12 cores, 96GB RAM 22 Page 11

Outline Motivation Programming model Applications and Results - Scale Out - Scale Vertical 23 ovember 19, 2012 Andrew A. Chien, 2012 Blockus architecture Worker I/O engine: executes all I/O operations Scheduler: performs I/O and task scheduling Master Scheduler Worker Worker Executor pool I/O Engine Executor pool I/O Engine R instance R instance R instance DRAM SSDs HDDs R instance R instance R instance DRAM SSDs HDDs 24 Page 12

Simple Blockus faster than OS paging Pagerank on 24.5GB dataset (134M nodes, 2B edges) In-memory - 4 cores, 96 GB memory Out-of-core 8GB memory, 200MB/s SSD with 197 MB/s read MMAP OS based IO scheduling Blockus simple prefetch IO scheduling mmap Blockus In-memory Average iteration time (seconds) 324 171 113 Speedup over mmap 1 1.89 2.86 25 Blockus Cost-effective Quantify out-of-core computation cost-effectiveness with EC2 prices Data set size is 24.5GB (in memory) Assuming SSD available (not true on EC2) Memory (total) Cores (total) Cost ($/ hour) Approx. iteration time (s) ormalized cost 1 Large Instance 1 High- Memory Double XL Instance 4 Large Instances 7.5GB 4 0.320 171 1 34.2GB 12 0.900 70 1.14 30GB 16 1.280 49 1.14 Out-of-core 26 Page 13

Future Work: Scheduler policies scheduler Assumes everything fits in DRAM Schedules each task on worker which has most bytes of its input arrays Transfers non-local input data greedily (no network scheduling) Blockus: better scheduling policies Load balancing (in memory) Intelligent Prefetching Intelligent computation prioritization Adaptive Caching 27 Summary Broad focus on Big Data and applications Surprising how broadaly useful linear algebra abstraction is Easily express machine learning, graph algorithms Challenges: Sparse matrices, Incremental data /Blockus extends R Exploit for Scale Out and Scale Vertical Open source version soon 28 ovember 19, 2012 Andrew A. Chien, 2012 Page 14

Related Work on-r Systems - SciPy parallel work (IPython1, MPI4py, parallel python, POSH), - Star-P (MIT) - MR/Hadoop, Bloom, Spark(Berkeley), Pig (Yahoo!), Ciel (Cambridge/) - Pregel, Pregel 2? (Google) - Graphlab, (CMU) Parallel/Distributed R systems threads, PVM/MPI, GridR - Multicore: threads - Rmpi, Snow: parallel abstraction over Rmpi, map/reduce - Rmr: interface to Hadoop (by Revolution Analytics) - GridR Globus/Condor access, SwiftR - Bigmemory: mmap arrays -... 29 ovember 19, 2012 Andrew A. Chien, 2012 Page 15