Bigger data analysis. Hadley Wickham. @hadleywickham Chief Scientist, RStudio. Thursday, July 18, 13

Similar documents
Accessing bigger datasets in R using SQLite and dplyr

Lecture 4: Tools for data analysis, exploration, and transformation: plyr and reshape2

Journal of Statistical Software

Visualising big data in R

Teaching Precursors to Data Science in Introductory and Second Courses in Statistics

Hands-On Data Science with R Dealing with Big Data. Graham.Williams@togaware.com. 27th November 2014 DRAFT

Big data in R EPIC 2015

Getting started with qplot

Scientific data visualization

Data Visualization with R Language

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

A Performance Analysis of Distributed Indexing using Terrier

SQL Server In-Memory by Design. Anu Ganesan August 8, 2014

Lecture 25: Database Notes

How good can databases deal with Netflow data

Bringing Big Data Modelling into the Hands of Domain Experts

CitusDB Architecture for Real-Time Big Data

rm(list=ls()) library(sqldf) system.time({large = read.csv.sql("large.csv")}) # seconds, 4.23GB of memory used by R

VOL. 5, NO. 2, August 2015 ISSN ARPN Journal of Systems and Software AJSS Journal. All rights reserved

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Fast Analytics on Big Data with H20

Using distributed technologies to analyze Big Data

Practical Cassandra. Vitalii

HTSQL is a comprehensive navigational query language for relational databases.

Powering Monitoring Analytics with ELK stack

Oracle Database In-Memory The Next Big Thing

Database Scalability and Oracle 12c

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

MySQL Storage Engines

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

Introduction Course in SPSS - Evening 1

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

How To Synchronize With A Cwr Mobile Crm 2011 Data Management System

Using the SQL Server Linked Server Capability

Introduction to the data.table package in R

hmetrix Revolutionizing Healthcare Analytics with Vertica & Tableau

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

SQL DBA Bundle. Data Sheet. Data Sheet. Introduction. What does it cost. What s included in the SQL DBA Bundle. Feedback for the SQL DBA Bundle

ANDROID APPS DEVELOPMENT FOR MOBILE GAME

Your Best Next Business Solution Big Data In R 24/3/2010

Connecting Software Connect Bridge - Mobile CRM Android User Manual

Introducing DocumentDB

Comparing SQL and NOSQL databases

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Report Paper: MatLab/Database Connectivity

Securing and Accelerating Databases In Minutes using GreenSQL

Data processing goes big

Using APSIM, C# and R to Create and Analyse Large Datasets

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Oracle Data Miner (Extension of SQL Developer 4.0)

Feature Factory: A Crowd Sourced Approach to Variable Discovery From Linked Data

Splice Machine: SQL-on-Hadoop Evaluation Guide

BIG DATA What it is and how to use?

Cloud Computing at Google. Architecture

2015 The MathWorks, Inc. 1

The Brave New World of Power BI and Hybrid Cloud

Media Upload and Sharing Website using HBASE

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.

Spatialite-gui. a GUI tool to manage SQLite and SpatiaLite databases. Just few very short notes showing How to get started as quick as possible

Making OData requests from jquery and/or the Lianja HTML5 Client in a Web App is extremely straightforward and simple.

RevoScaleR Speed and Scalability

Mike Canney. Application Performance Analysis

PostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor

Connecting Software. CB Mobile CRM Windows Phone 8. User Manual

Introduction to SQL for Data Scientists

Big Data Challenges in Bioinformatics

Tushar Joshi Turtle Networks Ltd

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Report Builder. Microsoft SQL Server is great for storing departmental or company data. It is. A Quick Guide to. In association with

OPTIMIZATION OF DATABASE STRUCTURE FOR HYDROMETEOROLOGICAL MONITORING SYSTEM

Predictive Analytics

Big Analytics in the Cloud. Matt Winkler PM, Big

Data Visualization in R

Data mining as a tool of revealing the hidden connection of the plant

Exploratory Data Analysis for Ecological Modelling and Decision Support

Conquer the 5 Most Common Magento Coding Issues to Optimize Your Site for Performance

INFORMATION BROCHURE Certificate Course in Web Design Using PHP/MySQL

User Guide. Analytics Desktop Document Number:

Installation & User Guide

6 Steps to Faster Data Blending Using Your Data Warehouse

Similarity Search in a Very Large Scale Using Hadoop and HBase

Product Guide. Sawmill Analytics, Swindon SN4 9LZ UK tel:

5 Correlation and Data Exploration

Lecture 2: Exploratory Data Analysis with R

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

Transcription:

http://bit.ly/bigrdata3 Bigger data analysis Hadley Wickham @hadleywickham Chief Scientist, RStudio July 2013

http://bit.ly/bigrdata3 1. What is data analysis? 2. Transforming data 3. Visualising data

What is data analysis?

Data analysis Data analysis the process is the process by which by data which becomes data becomes understanding, understanding, knowledge knowledge and insight and insight

Data analysis is the process by which data becomes understanding, knowledge and insight

Visualise Tidy Transform Model

Frequent data analysis learn to program http://www.flickr.com/photos/compleo/5414489782

Cognition time Computation time http://www.flickr.com/photos/mutsmuts/4695658106

Visualise ggplot2 Tidy reshape2 stringr lubridate Transform plyr Model

Computation time Cognition time

Visualise bigvis Tidy Transform dplyr Model

Studio Data Every commercial US flight 2000-2011: ~76 million flights Total database: ~11 Gb >100 variables, but I ll focus on a handful: airline, delay, distance, flight time and speed.

Transformation

Split Apply Combine name n total name n Al 2 2 Al 2 name n Bo 4 Bo 4 total name total Al 2 Bo 0 Bo 0 9 Bo 9 Bo 5 Bo 5 Ed 15 Ed 5 name n total Ed 10 Ed 5 15 Ed 10

array data frame list nothing array aaply adply alply a_ply data frame daply ddply dlply d_ply list laply ldply llply l_ply n replicates raply rdply rlply r_ply function arguments maply mdply mlply m_ply

array data frame list nothing array aaply adply alply a_ply data frame daply ddply dlply d_ply list laply ldply llply l_ply n replicates raply rdply rlply r_ply function arguments maply mdply mlply m_ply

a_ply alply aaply l_ply fun daply adply laply d_ply use Never Occassionally Often All the time llply dlply ldply ddply 0 50 100 150 count

Data analysis verbs select: subset variables filter: subset rows mutate: add new columns summarise: reduce to a single row arrange: re-order the rows

Data analysis verbs + group by select: subset variables filter: subset rows mutate: add new columns summarise: reduce to a single row arrange: re-order the rows

h <- readrds("houston.rdata") # ~2,100,000 x 6, ~57 meg; not huge, but substantial library(plyr) ddply(h, c("year", "Month", "DayofMonth"), summarise, n = length(year)) # user system elapsed # 2.320 0.330 2.649 count(h, c("year", "Month", "DayofMonth")) # user system elapsed # 0.687 0.183 0.869

# Often work with the same grouping variables # multiple times, so define upfront. Also refer # to variables in the same way daily_df <- group_by(h, Year, Month, DayofMonth) # Now summarise knows how to deal with grouped # data frames summarise(daily_df, n()) # user system elapsed # 0.095 0.015 0.110 # 20x faster!

library(data.table) h_dt <- data.table(h) daily_dt <- group_by(h_dt, Year, Month, DayofMonth) summarise(daily_dt, n()) # user system elapsed # 0.045 0.000 0.045 # Exactly the same syntax, but 2.5x faster! # Don't need to learn the idiosyncrasies of # data.table; just 2 lines of code

# And dplyr also works seamlessly with databases: ontime <- source_sqlite("flights.sqlite3", "ontime") h_db <- filter(ontime, Origin == "IAH") daily_db <- group_by(h_db, Year, Month, DayofMonth) summarise(daily_db, n()) # user system elapsed # 22.190 0.546 22.734 # user system elapsed # 5.565 0.425 5.986 # Much slower, but not restricted to a predefined subset # Could speed up by carefully crafting indices

# Behind the scenes library(dplyr) ontime <- source_sqlite("../flights.sqlite3", "ontime") translate_sql(year > 2005, ontime) # <SQL> Year > 2005.0 translate_sql(year > 2005L, ontime) # <SQL> Year > 2005 translate_sql(origin == "IAD" Dest == "IAD", ontime) # <SQL> Origin = 'IAD' OR Dest = 'IAD' years <- 2000:2005 translate_sql(year %in% years, ontime) # <SQL> Year IN (2000, 2001, 2002, 2003, 2004, 2005)

Data sources Data frames (dplyr) Data tables (dplyr) SQLite tables (dplyr) Postgresql, MySql, SQL server,... MonetDB (planned) Google bigquery (bigrquery)

daily_df <- group_by(h, Year, Month, DayofMonth) summarise(daily_df, n()) daily_dt <- group_by(h_dt, Year, Month, DayofMonth) summarise(daily_dt, n()) daily_db <- group_by(h_db, Year, Month, DayofMonth) summarise(daily_db, n()) # It doesn't matter how your data is stored

# It might even live on the web library(bigrquery) library(dplyr) library(bigrquery) h_bq <- source_bigquery(billing_project, "ontime", "houston") daily_bq <- group_by(h_bq, Year, Month, DayofMonth) system.time(summarise(daily_bq, n())) # ~2 seconds # Storage = $80 / TB / Month # Query = $35 / TB (100 GB free)

dplyr Currently experimental and incomplete, but it works, and you re welcome to try it out. library(devtools) install_github("assertthat") install_github("dplyr") install_github("bigrquery") Needs a development environment (http://www.rstudio.com/ide/docs/packages/prerequisites)

Google for: split apply combine dplyr

Visualisation

Studio library(ggplot2) library(bigvis) # Can't use data frames :( dist <- readrds("dist.rds") delay <- readrds("delay.rds") time <- readrds("time.rds") speed <- dist / time * 60 # There's always bad data time[time < 0] <- NA speed[speed < 0] <- NA speed[speed > 761.2] <- NA

qplot(dist, speed, colour = delay) + scale_colour_gradient2()

One hour later... qplot(dist, speed, colour = delay) + scale_colour_gradient2()

x <- runif(2e5) y <- runif(2e5) system.time(plot(x, y))

user system elapsed 2.785 0.010 2.806

Studio Goals Support exploratory analysis (e.g. in R) Fast on commodity hardware 100,000,000 in <5s 108 obs = 0.8 Gb, ~20 vars in 16 Gb

Studio Insight Bottleneck is number of pixels: 1d 3,000; 2d: 3,000,000 Process: Condense (bin & summarise) Smooth Visualise

Bin x origin width

Summarise Count Histogram, KDE Mean Regression, Loess Std. dev. Quantiles Boxplots, Quantile regression smoothing

Studio 1500000 1000000.count 500000 0 0 1000 2000 3000 4000 5000 dist dist_s <- condense(bin(dist, 10)) autoplot(dist_s)

Studio 1500000 user system elapsed 2.642 0.972 3.613 1000000.count 500000 0 0 1000 2000 3000 4000 5000 dist dist_s <- condense(bin(dist, 10)) autoplot(dist_s)

Studio NA 1500000.count 1000000 500000 0 0 1000 2000 3000 time time_s <- condense(bin(time, 1)) autoplot(time_s)

Studio 750000.count 500000 250000 0 0 250 500 750 1000 time autoplot(time_s, na.rm = TRUE)

Studio 750000.count 500000 250000 0 0 100 200 300 400 500 time autoplot(time_s[time_s < 500, ])

Studio 1500000 1000000.count 500000 0 0 20 40 60 time autoplot(time_s %% 60)

speed 600 400.count 1e+06 1e+04 1e+02 1e+00 200 0 1000 2000 3000 4000 5000 dist

speed 600 400.count 1e+06 1e+04 1e+02 1e+00 200 sd1 <- condense(bin(dist, 10), z = speed) 0 1000 2000 3000 4000 5000 autoplot(sd1) + ylab("speed") dist

user system elapsed 2.568 0.767 3.339 speed 600 400.count 1e+06 1e+04 1e+02 1e+00 200 sd1 <- condense(bin(dist, 10), z = speed) 0 1000 2000 3000 4000 5000 autoplot(sd1) + ylab("speed") dist

800 600 speed 400.count 6e+05 5e+05 4e+05 3e+05 2e+05 1e+05 0e+00 200 0 0 1000 2000 3000 4000 5000 dist

800 600 speed 400.count 6e+05 5e+05 4e+05 3e+05 2e+05 1e+05 0e+00 200 0 sd2 <- condense(bin(dist, 20), bin(speed, 20)) 0 1000 2000 3000 4000 5000 autoplot(sd2) dist

800 user system elapsed 7.366 1.190 8.552 600 speed 400.count 6e+05 5e+05 4e+05 3e+05 2e+05 1e+05 0e+00 200 0 sd2 <- condense(bin(dist, 20), bin(speed, 20)) 0 1000 2000 3000 4000 5000 autoplot(sd2) dist

Studio Demo shiny::runapp("mt/", 8002)

Google for: bigvis

Conclusions

Visualise bigvis Tidy Transform dplyr Model