Bigger data analysis. Hadley Wickham. @hadleywickham Chief Scientist, RStudio. Thursday, July 18, 13

http://bit.ly/bigrdata3 Bigger data analysis Hadley Wickham @hadleywickham Chief Scientist, RStudio July 2013

http://bit.ly/bigrdata3 1. What is data analysis? 2. Transforming data 3. Visualising data

What is data analysis?

Data analysis Data analysis the process is the process by which by data which becomes data becomes understanding, understanding, knowledge knowledge and insight and insight

Data analysis is the process by which data becomes understanding, knowledge and insight

Visualise Tidy Transform Model

Frequent data analysis learn to program http://www.flickr.com/photos/compleo/5414489782

Cognition time Computation time http://www.flickr.com/photos/mutsmuts/4695658106

Visualise ggplot2 Tidy reshape2 stringr lubridate Transform plyr Model

Computation time Cognition time

Visualise bigvis Tidy Transform dplyr Model

Studio Data Every commercial US flight 2000-2011: ~76 million flights Total database: ~11 Gb >100 variables, but I ll focus on a handful: airline, delay, distance, flight time and speed.

Transformation

Split Apply Combine name n total name n Al 2 2 Al 2 name n Bo 4 Bo 4 total name total Al 2 Bo 0 Bo 0 9 Bo 9 Bo 5 Bo 5 Ed 15 Ed 5 name n total Ed 10 Ed 5 15 Ed 10

array data frame list nothing array aaply adply alply a_ply data frame daply ddply dlply d_ply list laply ldply llply l_ply n replicates raply rdply rlply r_ply function arguments maply mdply mlply m_ply

a_ply alply aaply l_ply fun daply adply laply d_ply use Never Occassionally Often All the time llply dlply ldply ddply 0 50 100 150 count

Data analysis verbs select: subset variables filter: subset rows mutate: add new columns summarise: reduce to a single row arrange: re-order the rows

Data analysis verbs + group by select: subset variables filter: subset rows mutate: add new columns summarise: reduce to a single row arrange: re-order the rows

h <- readrds("houston.rdata") # ~2,100,000 x 6, ~57 meg; not huge, but substantial library(plyr) ddply(h, c("year", "Month", "DayofMonth"), summarise, n = length(year)) # user system elapsed # 2.320 0.330 2.649 count(h, c("year", "Month", "DayofMonth")) # user system elapsed # 0.687 0.183 0.869

# Often work with the same grouping variables # multiple times, so define upfront. Also refer # to variables in the same way daily_df <- group_by(h, Year, Month, DayofMonth) # Now summarise knows how to deal with grouped # data frames summarise(daily_df, n()) # user system elapsed # 0.095 0.015 0.110 # 20x faster!

library(data.table) h_dt <- data.table(h) daily_dt <- group_by(h_dt, Year, Month, DayofMonth) summarise(daily_dt, n()) # user system elapsed # 0.045 0.000 0.045 # Exactly the same syntax, but 2.5x faster! # Don't need to learn the idiosyncrasies of # data.table; just 2 lines of code

# And dplyr also works seamlessly with databases: ontime <- source_sqlite("flights.sqlite3", "ontime") h_db <- filter(ontime, Origin == "IAH") daily_db <- group_by(h_db, Year, Month, DayofMonth) summarise(daily_db, n()) # user system elapsed # 22.190 0.546 22.734 # user system elapsed # 5.565 0.425 5.986 # Much slower, but not restricted to a predefined subset # Could speed up by carefully crafting indices

# Behind the scenes library(dplyr) ontime <- source_sqlite("../flights.sqlite3", "ontime") translate_sql(year > 2005, ontime) # <SQL> Year > 2005.0 translate_sql(year > 2005L, ontime) # <SQL> Year > 2005 translate_sql(origin == "IAD" Dest == "IAD", ontime) # <SQL> Origin = 'IAD' OR Dest = 'IAD' years <- 2000:2005 translate_sql(year %in% years, ontime) # <SQL> Year IN (2000, 2001, 2002, 2003, 2004, 2005)

Data sources Data frames (dplyr) Data tables (dplyr) SQLite tables (dplyr) Postgresql, MySql, SQL server,... MonetDB (planned) Google bigquery (bigrquery)

daily_df <- group_by(h, Year, Month, DayofMonth) summarise(daily_df, n()) daily_dt <- group_by(h_dt, Year, Month, DayofMonth) summarise(daily_dt, n()) daily_db <- group_by(h_db, Year, Month, DayofMonth) summarise(daily_db, n()) # It doesn't matter how your data is stored

# It might even live on the web library(bigrquery) library(dplyr) library(bigrquery) h_bq <- source_bigquery(billing_project, "ontime", "houston") daily_bq <- group_by(h_bq, Year, Month, DayofMonth) system.time(summarise(daily_bq, n())) # ~2 seconds # Storage = $80 / TB / Month # Query = $35 / TB (100 GB free)

dplyr Currently experimental and incomplete, but it works, and you re welcome to try it out. library(devtools) install_github("assertthat") install_github("dplyr") install_github("bigrquery") Needs a development environment (http://www.rstudio.com/ide/docs/packages/prerequisites)

Google for: split apply combine dplyr

Visualisation

Studio library(ggplot2) library(bigvis) # Can't use data frames :( dist <- readrds("dist.rds") delay <- readrds("delay.rds") time <- readrds("time.rds") speed <- dist / time * 60 # There's always bad data time[time < 0] <- NA speed[speed < 0] <- NA speed[speed > 761.2] <- NA

qplot(dist, speed, colour = delay) + scale_colour_gradient2()

One hour later... qplot(dist, speed, colour = delay) + scale_colour_gradient2()

x <- runif(2e5) y <- runif(2e5) system.time(plot(x, y))

user system elapsed 2.785 0.010 2.806

Studio Goals Support exploratory analysis (e.g. in R) Fast on commodity hardware 100,000,000 in <5s 108 obs = 0.8 Gb, ~20 vars in 16 Gb

Studio Insight Bottleneck is number of pixels: 1d 3,000; 2d: 3,000,000 Process: Condense (bin & summarise) Smooth Visualise

Bin x origin width

Summarise Count Histogram, KDE Mean Regression, Loess Std. dev. Quantiles Boxplots, Quantile regression smoothing

Studio 1500000 1000000.count 500000 0 0 1000 2000 3000 4000 5000 dist dist_s <- condense(bin(dist, 10)) autoplot(dist_s)

Studio 1500000 user system elapsed 2.642 0.972 3.613 1000000.count 500000 0 0 1000 2000 3000 4000 5000 dist dist_s <- condense(bin(dist, 10)) autoplot(dist_s)

Studio NA 1500000.count 1000000 500000 0 0 1000 2000 3000 time time_s <- condense(bin(time, 1)) autoplot(time_s)

Studio 750000.count 500000 250000 0 0 250 500 750 1000 time autoplot(time_s, na.rm = TRUE)

Studio 750000.count 500000 250000 0 0 100 200 300 400 500 time autoplot(time_s[time_s < 500, ])

Studio 1500000 1000000.count 500000 0 0 20 40 60 time autoplot(time_s %% 60)

speed 600 400.count 1e+06 1e+04 1e+02 1e+00 200 0 1000 2000 3000 4000 5000 dist

speed 600 400.count 1e+06 1e+04 1e+02 1e+00 200 sd1 <- condense(bin(dist, 10), z = speed) 0 1000 2000 3000 4000 5000 autoplot(sd1) + ylab("speed") dist

user system elapsed 2.568 0.767 3.339 speed 600 400.count 1e+06 1e+04 1e+02 1e+00 200 sd1 <- condense(bin(dist, 10), z = speed) 0 1000 2000 3000 4000 5000 autoplot(sd1) + ylab("speed") dist

800 600 speed 400.count 6e+05 5e+05 4e+05 3e+05 2e+05 1e+05 0e+00 200 0 0 1000 2000 3000 4000 5000 dist

800 600 speed 400.count 6e+05 5e+05 4e+05 3e+05 2e+05 1e+05 0e+00 200 0 sd2 <- condense(bin(dist, 20), bin(speed, 20)) 0 1000 2000 3000 4000 5000 autoplot(sd2) dist

800 user system elapsed 7.366 1.190 8.552 600 speed 400.count 6e+05 5e+05 4e+05 3e+05 2e+05 1e+05 0e+00 200 0 sd2 <- condense(bin(dist, 20), bin(speed, 20)) 0 1000 2000 3000 4000 5000 autoplot(sd2) dist

Studio Demo shiny::runapp("mt/", 8002)

Google for: bigvis

Conclusions

Visualise bigvis Tidy Transform dplyr Model