How To Write A Data Processing Pipeline In R



Similar documents

Kafka & Redis for Big Data Solutions

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

How To Choose A Data Flow Pipeline From A Data Processing Platform

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Big Data and Scripting map/reduce in Hadoop

Big Data and Parallel Work with R

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Best Practices for Hadoop Data Analysis with Tableau

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez ( ) Sameer Kumar ( )

BigMemory and Hadoop: Powering the Real-time Intelligent Enterprise

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Cloudera Certified Developer for Apache Hadoop

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Fast Analytics on Big Data with H20

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

Benchmarking Hadoop & HBase on Violin

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast

Advanced Big Data Analytics with R and Hadoop

Hadoop Design and k-means Clustering

Distributed Computing and Big Data: Hadoop and MapReduce

Integrating Apache Spark with an Enterprise Data Warehouse

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

HADOOP PERFORMANCE TUNING

RevoScaleR Speed and Scalability

Bringing Big Data Modelling into the Hands of Domain Experts

A very short Intro to Hadoop

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

CSE-E5430 Scalable Cloud Computing Lecture 2

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Introduction to Hadoop

Similarity Search in a Very Large Scale Using Hadoop and HBase

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Open source large scale distributed data management with Google s MapReduce and Bigtable

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

HadoopRDF : A Scalable RDF Data Analysis System

Hadoop Architecture. Part 1

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

SEIZE THE DATA SEIZE THE DATA. 2015

Apache Flink Next-gen data analysis. Kostas

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Parquet. Columnar storage for the people

Chapter 7. Using Hadoop Cluster and MapReduce

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Introduction to Big Data with Apache Spark UC BERKELEY

From GWS to MapReduce: Google s Cloud Technology in the Early Days

NoSQL and Hadoop Technologies On Oracle Cloud

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Parallel Large-Scale Visualization

Hadoop SNS. renren.com. Saturday, December 3, 11

CS 378 Big Data Programming

Distributed R for Big Data

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Disco: Beyond MapReduce

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Safe Harbor Statement

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Policy-based Pre-Processing in Hadoop

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Management & Analysis of Big Data in Zenith Team

MongoDB Aggregation and Data Processing Release 3.0.4

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Parallel Options for R

GigaSpaces Real-Time Analytics for Big Data

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Semester Thesis Traffic Monitoring in Sensor Networks

L3: Statistical Modeling with Hadoop

Big Data Processing with Google s MapReduce. Alexandru Costan

Hadoop WordCount Explained! IT332 Distributed Systems

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Big Data and Market Surveillance. April 28, 2014

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Transcription:

New features and old concepts for handling large and streaming data in practice Simon Urbanek R Foundation

Overview Motivation Custom connections Data processing pipelines Parallel processing Back-end experiments: Hadoop, RDFS Call for participation 2

Motivation R s in memory model is fast - RAM prices declining steadily (unlike CPUs), [ca. $8/Gb for server RAM now] - Billion+ rows in R workable Problem 1: parallelization s <- split(df,...) ### slow and ineffcient! y <- mclapply(s, function(x)...) - splitting up data is expensive Problem 2: streaming - conceptually cannot have all data at once 3

Old, simple idea: chunking Process data in (big) chunks Parallelization: - feed each process/worker with chunks, collect results - can process chunks in parallel (if the processing can be independent); no copying Streaming: - keep a mutable state - process chunks as they come in, modifying state and creating results Issue: R has no explicit framework/api for this 4

Connections R has connections: abstraction for data access and transport - completely back-end opaque! New in R 3.0.0: custom connection support - packages can create new connection implementations - some examples: zmqc - 0MQ PUB/SUB connections - read from 0MQ streaming feeds directly hdfsc - read files from HDFS - just like any other file f = HDFS( /data/foo ) d = read.table(f) 5

Data pipeline mean(read.table(hdfs( foo ))$x) Source - delivers data connection (text or binary) Data parser - converts data format to R objects data frame (or other R-native object) Filtering, processing, computing,... result (aggregates, models, graphics,...) 6

Streaming Source - delivers data connection (text or binary) Data parser - converts data format to R objects data frame (or other R-native object) Filtering, processing, computing,... result (aggregates, models, graphics,...) mutable state 7

Parallel processing Source - delivers data Data parser Data parser Data parser Computing,... Computing,... Computing,... results 8

Proposal: Chunks in a pipeline Connections - define available classes of data sources contribute! Read from sources in big chunks Parsers - transform data representation to R objects Compute contribute! - algorithms that work on chunks contribute! (serial processing + mutable state = streaming, independence = parallel) Collect - algorithms to combine parallel chunks contribute! 9

Example: streaming Use 0MQ PUB/SUB: buffered per subscriber (slow subscribers don t affect others; can detect dropped records vis framing) feed = zmqc( ipc:///my-feed.0mq, r ) Read chunks max = 1000 (lines), parse (read.table) state = numeric() Update while mutable (TRUE) structure { (hash table, model,...) d = read.table(feed, FALSE, nrows=max) Serve results mix = (file c(state output, * 0.9, web table(d[,2])) service via Rhttpd,...) state = tapply(mix, names(mix), sum) } if (any(state <= 1)) state = state[state > 1] 10

Parallel processing At least three stages: - split (often implicit) - compute - combine Define functions using this paradigm simple examples: cc.sum <- function(x) cc(x, sum, sum) cc.table <- function(x) cc(x, table, function(x) tapply(x, names(x), sum)) cc.mean <- function(x) cc(x, function(x) c(sum(x), length(x)), function(x) sum(x[1,]) / sum(x[2,])) 11

Practical considerations The implementation can be seamless: use special distributed vector class and dispatch on it Typically source is big, so splitting is implicit since the data does not reside in R (e.g. sequence in a file) Leverage distributed storage: run computing where the chunks are stored Examples: - Hadoop - RDFS 12

Hadoop A lot of companies invest in Hadoop clusters (we have to live with it even if there are many better solutions) Literal map/reduce based on key/value is very inefficient for R since it is not a vector operation Hadoop can be (ab)used for chunk-wise processing: streaming mode - use HDFS chunks as input, compute is map on the entire chunk, combine is reduce 13

Example Aggregate point locations by ZIP code (match points against ZCTA US/Census 2010 shapefiles) r <- read.table(hmr( hinput("/data/2013/06"), function(x) table(zcta2010.db()[ inside(zcta2010.shp(), x[,4], x[,5]), 1]), function(x) ctapply(x, names(x), sum))) Fairly native R programming Implicit defaults (read.table parser, conversion of named vectors to key/value entries) Result is an HDFS connection 14

R Distributed File System - Experiment Purely R-based (R client, R server, R code) Uses Rserve for fast access (no setup cost, optional authentication, users switching, transport encryption for free) Any storage available (RData, ASCII,...), all storage is R-native - parsing step can be removed No name node, all nodes are equal Scales only to moderate cluster sizes (hundreds of nodes), but is very fast (milliseconds for job setup, no need to leave R) 15

Call for Participation More users, more use cases - is this powerful enough? - if not, what is missing? Make it part of R - so developers can rely on it Start writing functions and packages - help to create critical mass Theoretical work - methods and approaches that give bounds for approximation error, necessary assumptions etc. 16

Related work Purdue Univ: Divide/Recombine - results for linear model approximations - RHipe - very specialized vehicle for the above using specific version and brand of Hadoop Iterators (also used by foreach) - idea of running code in iterations; does include chunks - focused on inner code (chunk processing) 17

Conclusions New in R 3.0.0: custom connections, to be used as building blocks for data pipelines Read from connections in chunks, compute and collect Generic framework that can be applied to streaming and parallel processing Let us work together to see if it is powerful enough to build an official R interface that everyone can use and contribute to Back-end agnostic - testing on Hadoop and RDFS 18

Contact Most packages available on RForge.net (source also on GitHub) - http://rforge.net/zmqc - http://rforge.net/hdfsc Remaining packages (iotools, rdfs,...) in the process of being pushed, check RForge.net and - https://github.com/s-u Simon URBANEK simon.urbanek@r-project.org http://urbanek.info 19