http://bit.ly/bigrdata3 Bigger data analysis Hadley Wickham @hadleywickham Chief Scientist, RStudio July 2013
http://bit.ly/bigrdata3 1. What is data analysis? 2. Transforming data 3. Visualising data
What is data analysis?
Data analysis Data analysis the process is the process by which by data which becomes data becomes understanding, understanding, knowledge knowledge and insight and insight
Data analysis is the process by which data becomes understanding, knowledge and insight
Visualise Tidy Transform Model
Frequent data analysis learn to program http://www.flickr.com/photos/compleo/5414489782
Cognition time Computation time http://www.flickr.com/photos/mutsmuts/4695658106
Visualise ggplot2 Tidy reshape2 stringr lubridate Transform plyr Model
Computation time Cognition time
Visualise bigvis Tidy Transform dplyr Model
Studio Data Every commercial US flight 2000-2011: ~76 million flights Total database: ~11 Gb >100 variables, but I ll focus on a handful: airline, delay, distance, flight time and speed.
Transformation
Split Apply Combine name n total name n Al 2 2 Al 2 name n Bo 4 Bo 4 total name total Al 2 Bo 0 Bo 0 9 Bo 9 Bo 5 Bo 5 Ed 15 Ed 5 name n total Ed 10 Ed 5 15 Ed 10
array data frame list nothing array aaply adply alply a_ply data frame daply ddply dlply d_ply list laply ldply llply l_ply n replicates raply rdply rlply r_ply function arguments maply mdply mlply m_ply
array data frame list nothing array aaply adply alply a_ply data frame daply ddply dlply d_ply list laply ldply llply l_ply n replicates raply rdply rlply r_ply function arguments maply mdply mlply m_ply
a_ply alply aaply l_ply fun daply adply laply d_ply use Never Occassionally Often All the time llply dlply ldply ddply 0 50 100 150 count
Data analysis verbs select: subset variables filter: subset rows mutate: add new columns summarise: reduce to a single row arrange: re-order the rows
Data analysis verbs + group by select: subset variables filter: subset rows mutate: add new columns summarise: reduce to a single row arrange: re-order the rows
h <- readrds("houston.rdata") # ~2,100,000 x 6, ~57 meg; not huge, but substantial library(plyr) ddply(h, c("year", "Month", "DayofMonth"), summarise, n = length(year)) # user system elapsed # 2.320 0.330 2.649 count(h, c("year", "Month", "DayofMonth")) # user system elapsed # 0.687 0.183 0.869
# Often work with the same grouping variables # multiple times, so define upfront. Also refer # to variables in the same way daily_df <- group_by(h, Year, Month, DayofMonth) # Now summarise knows how to deal with grouped # data frames summarise(daily_df, n()) # user system elapsed # 0.095 0.015 0.110 # 20x faster!
library(data.table) h_dt <- data.table(h) daily_dt <- group_by(h_dt, Year, Month, DayofMonth) summarise(daily_dt, n()) # user system elapsed # 0.045 0.000 0.045 # Exactly the same syntax, but 2.5x faster! # Don't need to learn the idiosyncrasies of # data.table; just 2 lines of code
# And dplyr also works seamlessly with databases: ontime <- source_sqlite("flights.sqlite3", "ontime") h_db <- filter(ontime, Origin == "IAH") daily_db <- group_by(h_db, Year, Month, DayofMonth) summarise(daily_db, n()) # user system elapsed # 22.190 0.546 22.734 # user system elapsed # 5.565 0.425 5.986 # Much slower, but not restricted to a predefined subset # Could speed up by carefully crafting indices
# Behind the scenes library(dplyr) ontime <- source_sqlite("../flights.sqlite3", "ontime") translate_sql(year > 2005, ontime) # <SQL> Year > 2005.0 translate_sql(year > 2005L, ontime) # <SQL> Year > 2005 translate_sql(origin == "IAD" Dest == "IAD", ontime) # <SQL> Origin = 'IAD' OR Dest = 'IAD' years <- 2000:2005 translate_sql(year %in% years, ontime) # <SQL> Year IN (2000, 2001, 2002, 2003, 2004, 2005)
Data sources Data frames (dplyr) Data tables (dplyr) SQLite tables (dplyr) Postgresql, MySql, SQL server,... MonetDB (planned) Google bigquery (bigrquery)
daily_df <- group_by(h, Year, Month, DayofMonth) summarise(daily_df, n()) daily_dt <- group_by(h_dt, Year, Month, DayofMonth) summarise(daily_dt, n()) daily_db <- group_by(h_db, Year, Month, DayofMonth) summarise(daily_db, n()) # It doesn't matter how your data is stored
# It might even live on the web library(bigrquery) library(dplyr) library(bigrquery) h_bq <- source_bigquery(billing_project, "ontime", "houston") daily_bq <- group_by(h_bq, Year, Month, DayofMonth) system.time(summarise(daily_bq, n())) # ~2 seconds # Storage = $80 / TB / Month # Query = $35 / TB (100 GB free)
dplyr Currently experimental and incomplete, but it works, and you re welcome to try it out. library(devtools) install_github("assertthat") install_github("dplyr") install_github("bigrquery") Needs a development environment (http://www.rstudio.com/ide/docs/packages/prerequisites)
Google for: split apply combine dplyr
Visualisation
Studio library(ggplot2) library(bigvis) # Can't use data frames :( dist <- readrds("dist.rds") delay <- readrds("delay.rds") time <- readrds("time.rds") speed <- dist / time * 60 # There's always bad data time[time < 0] <- NA speed[speed < 0] <- NA speed[speed > 761.2] <- NA
qplot(dist, speed, colour = delay) + scale_colour_gradient2()
One hour later... qplot(dist, speed, colour = delay) + scale_colour_gradient2()
x <- runif(2e5) y <- runif(2e5) system.time(plot(x, y))
user system elapsed 2.785 0.010 2.806
Studio Goals Support exploratory analysis (e.g. in R) Fast on commodity hardware 100,000,000 in <5s 108 obs = 0.8 Gb, ~20 vars in 16 Gb
Studio Insight Bottleneck is number of pixels: 1d 3,000; 2d: 3,000,000 Process: Condense (bin & summarise) Smooth Visualise
Bin x origin width
Summarise Count Histogram, KDE Mean Regression, Loess Std. dev. Quantiles Boxplots, Quantile regression smoothing
Studio 1500000 1000000.count 500000 0 0 1000 2000 3000 4000 5000 dist dist_s <- condense(bin(dist, 10)) autoplot(dist_s)
Studio 1500000 user system elapsed 2.642 0.972 3.613 1000000.count 500000 0 0 1000 2000 3000 4000 5000 dist dist_s <- condense(bin(dist, 10)) autoplot(dist_s)
Studio NA 1500000.count 1000000 500000 0 0 1000 2000 3000 time time_s <- condense(bin(time, 1)) autoplot(time_s)
Studio 750000.count 500000 250000 0 0 250 500 750 1000 time autoplot(time_s, na.rm = TRUE)
Studio 750000.count 500000 250000 0 0 100 200 300 400 500 time autoplot(time_s[time_s < 500, ])
Studio 1500000 1000000.count 500000 0 0 20 40 60 time autoplot(time_s %% 60)
speed 600 400.count 1e+06 1e+04 1e+02 1e+00 200 0 1000 2000 3000 4000 5000 dist
speed 600 400.count 1e+06 1e+04 1e+02 1e+00 200 sd1 <- condense(bin(dist, 10), z = speed) 0 1000 2000 3000 4000 5000 autoplot(sd1) + ylab("speed") dist
user system elapsed 2.568 0.767 3.339 speed 600 400.count 1e+06 1e+04 1e+02 1e+00 200 sd1 <- condense(bin(dist, 10), z = speed) 0 1000 2000 3000 4000 5000 autoplot(sd1) + ylab("speed") dist
800 600 speed 400.count 6e+05 5e+05 4e+05 3e+05 2e+05 1e+05 0e+00 200 0 0 1000 2000 3000 4000 5000 dist
800 600 speed 400.count 6e+05 5e+05 4e+05 3e+05 2e+05 1e+05 0e+00 200 0 sd2 <- condense(bin(dist, 20), bin(speed, 20)) 0 1000 2000 3000 4000 5000 autoplot(sd2) dist
800 user system elapsed 7.366 1.190 8.552 600 speed 400.count 6e+05 5e+05 4e+05 3e+05 2e+05 1e+05 0e+00 200 0 sd2 <- condense(bin(dist, 20), bin(speed, 20)) 0 1000 2000 3000 4000 5000 autoplot(sd2) dist
Studio Demo shiny::runapp("mt/", 8002)
Google for: bigvis
Conclusions
Visualise bigvis Tidy Transform dplyr Model