Bigger data analysis. Hadley Chief Scientist, RStudio. Thursday, July 18, 13

Size: px

Start display at page:

Download "Bigger data analysis. Hadley Wickham. @hadleywickham Chief Scientist, RStudio. Thursday, July 18, 13"

Clyde Lucas
10 years ago
Views:

1 Bigger data analysis Hadley Chief Scientist, RStudio July 2013

2 1. What is data analysis? 2. Transforming data 3. Visualising data

3 What is data analysis?

4 Data analysis Data analysis the process is the process by which by data which becomes data becomes understanding, understanding, knowledge knowledge and insight and insight

5 Data analysis is the process by which data becomes understanding, knowledge and insight

6 Visualise Tidy Transform Model

7 Frequent data analysis learn to program

8 Cognition time Computation time

9 Visualise ggplot2 Tidy reshape2 stringr lubridate Transform plyr Model

10 Computation time Cognition time

11 Visualise bigvis Tidy Transform dplyr Model

12 Studio Data Every commercial US flight : ~76 million flights Total database: ~11 Gb >100 variables, but I ll focus on a handful: airline, delay, distance, flight time and speed.

13 Transformation

14 Split Apply Combine name n total name n Al 2 2 Al 2 name n Bo 4 Bo 4 total name total Al 2 Bo 0 Bo 0 9 Bo 9 Bo 5 Bo 5 Ed 15 Ed 5 name n total Ed 10 Ed 5 15 Ed 10

15 array data frame list nothing array aaply adply alply a_ply data frame daply ddply dlply d_ply list laply ldply llply l_ply n replicates raply rdply rlply r_ply function arguments maply mdply mlply m_ply

16 array data frame list nothing array aaply adply alply a_ply data frame daply ddply dlply d_ply list laply ldply llply l_ply n replicates raply rdply rlply r_ply function arguments maply mdply mlply m_ply

17 a_ply alply aaply l_ply fun daply adply laply d_ply use Never Occassionally Often All the time llply dlply ldply ddply count

18 Data analysis verbs select: subset variables filter: subset rows mutate: add new columns summarise: reduce to a single row arrange: re-order the rows

19 Data analysis verbs + group by select: subset variables filter: subset rows mutate: add new columns summarise: reduce to a single row arrange: re-order the rows

20 h <- readrds("houston.rdata") # ~2,100,000 x 6, ~57 meg; not huge, but substantial library(plyr) ddply(h, c("year", "Month", "DayofMonth"), summarise, n = length(year)) # user system elapsed # count(h, c("year", "Month", "DayofMonth")) # user system elapsed #

21 # Often work with the same grouping variables # multiple times, so define upfront. Also refer # to variables in the same way daily_df <- group_by(h, Year, Month, DayofMonth) # Now summarise knows how to deal with grouped # data frames summarise(daily_df, n()) # user system elapsed # # 20x faster!

22 library(data.table) h_dt <- data.table(h) daily_dt <- group_by(h_dt, Year, Month, DayofMonth) summarise(daily_dt, n()) # user system elapsed # # Exactly the same syntax, but 2.5x faster! # Don't need to learn the idiosyncrasies of # data.table; just 2 lines of code

23 # And dplyr also works seamlessly with databases: ontime <- source_sqlite("flights.sqlite3", "ontime") h_db <- filter(ontime, Origin == "IAH") daily_db <- group_by(h_db, Year, Month, DayofMonth) summarise(daily_db, n()) # user system elapsed # # user system elapsed # # Much slower, but not restricted to a predefined subset # Could speed up by carefully crafting indices

24 # Behind the scenes library(dplyr) ontime <- source_sqlite("../flights.sqlite3", "ontime") translate_sql(year > 2005, ontime) # <SQL> Year > translate_sql(year > 2005L, ontime) # <SQL> Year > 2005 translate_sql(origin == "IAD" Dest == "IAD", ontime) # <SQL> Origin = 'IAD' OR Dest = 'IAD' years <- 2000:2005 translate_sql(year %in% years, ontime) # <SQL> Year IN (2000, 2001, 2002, 2003, 2004, 2005)

25 Data sources Data frames (dplyr) Data tables (dplyr) SQLite tables (dplyr) Postgresql, MySql, SQL server,... MonetDB (planned) Google bigquery (bigrquery)

26 daily_df <- group_by(h, Year, Month, DayofMonth) summarise(daily_df, n()) daily_dt <- group_by(h_dt, Year, Month, DayofMonth) summarise(daily_dt, n()) daily_db <- group_by(h_db, Year, Month, DayofMonth) summarise(daily_db, n()) # It doesn't matter how your data is stored

27 # It might even live on the web library(bigrquery) library(dplyr) library(bigrquery) h_bq <- source_bigquery(billing_project, "ontime", "houston") daily_bq <- group_by(h_bq, Year, Month, DayofMonth) system.time(summarise(daily_bq, n())) # ~2 seconds # Storage = $80 / TB / Month # Query = $35 / TB (100 GB free)

28 dplyr Currently experimental and incomplete, but it works, and you re welcome to try it out. library(devtools) install_github("assertthat") install_github("dplyr") install_github("bigrquery") Needs a development environment (

29 Google for: split apply combine dplyr

30 Visualisation

31 Studio library(ggplot2) library(bigvis) # Can't use data frames :( dist <- readrds("dist.rds") delay <- readrds("delay.rds") time <- readrds("time.rds") speed <- dist / time * 60 # There's always bad data time[time < 0] <- NA speed[speed < 0] <- NA speed[speed > 761.2] <- NA

32 qplot(dist, speed, colour = delay) + scale_colour_gradient2()

33 One hour later... qplot(dist, speed, colour = delay) + scale_colour_gradient2()

34 x <- runif(2e5) y <- runif(2e5) system.time(plot(x, y))

36 user system elapsed

37 Studio Goals Support exploratory analysis (e.g. in R) Fast on commodity hardware 100,000,000 in <5s 108 obs = 0.8 Gb, ~20 vars in 16 Gb

38 Studio Insight Bottleneck is number of pixels: 1d 3,000; 2d: 3,000,000 Process: Condense (bin & summarise) Smooth Visualise

39 Bin x origin width

40 Summarise Count Histogram, KDE Mean Regression, Loess Std. dev. Quantiles Boxplots, Quantile regression smoothing

41 Studio count dist dist_s <- condense(bin(dist, 10)) autoplot(dist_s)

42 Studio user system elapsed count dist dist_s <- condense(bin(dist, 10)) autoplot(dist_s)

43 Studio NA count time time_s <- condense(bin(time, 1)) autoplot(time_s)

44 Studio count time autoplot(time_s, na.rm = TRUE)

45 Studio count time autoplot(time_s[time_s < 500, ])

46 Studio count time autoplot(time_s %% 60)

47 speed count 1e+06 1e+04 1e+02 1e dist

48 speed count 1e+06 1e+04 1e+02 1e sd1 <- condense(bin(dist, 10), z = speed) autoplot(sd1) + ylab("speed") dist

49 user system elapsed speed count 1e+06 1e+04 1e+02 1e sd1 <- condense(bin(dist, 10), z = speed) autoplot(sd1) + ylab("speed") dist

50 speed 400.count 6e+05 5e+05 4e+05 3e+05 2e+05 1e+05 0e dist

51 speed 400.count 6e+05 5e+05 4e+05 3e+05 2e+05 1e+05 0e sd2 <- condense(bin(dist, 20), bin(speed, 20)) autoplot(sd2) dist

52 800 user system elapsed speed 400.count 6e+05 5e+05 4e+05 3e+05 2e+05 1e+05 0e sd2 <- condense(bin(dist, 20), bin(speed, 20)) autoplot(sd2) dist

53 Studio Demo shiny::runapp("mt/", 8002)

54 Google for: bigvis

55 Conclusions

56 Visualise bigvis Tidy Transform dplyr Model

Accessing bigger datasets in R using SQLite and dplyr

Accessing bigger datasets in R using SQLite and dplyr Amherst College, Amherst, MA, USA March 24, 2015 [email protected] Thanks to Revolution Analytics for their financial support to the Five College