# Semiparametric Regression of Big Data in R

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Semiparametric Regression of Big Data in R Nathaniel E. Helwig Department of Statistics University of Illinois at Urbana-Champaign CSE Big Data Workshop: May 29, 2014 Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 1

2 Outline of Talk 1) Introduction to R Downloading R Basic calculations Using R functions Object classes in R 3) Flights Example Reading data into R Parametric analysis Nonparametric analysis Semiparametric analysis 2) High Performance Computing Limitations of R Optimized BLAS Parallel computing Big data issues 4) Miscellaneous El Niño example EEG example Twitter example Future work Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 2

3 Introduction to R Downloading R R = Free and Open-Source Statistics R is a free and open-source software environment for statistics. Created by Ross Ihaka and Robert Gentleman (at the University of Auckland, New Zealand) Based on S language created by John Chambers (at Bell Labs) Currently managed by The R Project for Statistical Computing You can freely download R for various operating systems: Mac Windows Linux Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 3

4 Introduction to R Downloading R RStudio IDE RStudio IDE is a free and open-source integrated development environment (IDE) for R. Basic R interface is a bit rough (particularly on Windows) RStudio offers a nice environment through which you can use R Freely available at You can freely download RStudio IDE for various operating systems: Mac Windows Linux Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 4

5 Introduction to R Basic Calculations Storing and Manipulating Values in R Define objects x and y with values of 3 and 2, respectively: > x=3 > y=2 Some calculations with the defined objects x and y: > x+y [1] 5 > x-y [1] 1 > x*y [1] 6 > x/y [1] 1.5 > x^y [1] 9 > x==y [1] FALSE Warning: R is case sensitve, so x and X are not the same object. Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 5

6 Introduction to R Using R Functions Some Basic R Functions Define objects x and y: > x=c(1,3,4,6,8) > y=c(2,3,5,7,9) Calculate the correlation: > cor(x,y) [1] Calculate the covariance: > cov(x,y) [1] 7.65 Combine as columns > cbind(x,y) x y [1,] 1 2 [2,] 3 3 [3,] 4 5 [4,] 6 7 [5,] 8 9 Combine as rows > rbind(x,y) [,1] [,2] [,3] [,4] [,5] x y Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 6

7 Introduction to R Object Classes in R Object-Oriented Style Programming R is an object-oriented language, where an object is a general term. Any R object X has an associated class, which indicates the type of object that X represents. S3 classes: simple naming convention used by most R packages S4 classes: more advanced (true) object-oriented class system Overall idea of object-oriented style programming: Some R functions are only defined for a particular class of input X Other R functions perform different operations depending on the class of the input object X. Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 7

8 Introduction to R Object Classes in R Some Basic R Classes numeric class: > x=c(1,3,-2) > x [1] > class(x) [1] "numeric" integer class: > x=c(1l,3l,-2l) > x [1] > class(x) [1] "integer" character class: > x=c("a","a","b") > x [1] "a" "a" "b" > class(x) [1] "character" factor class: > x=factor(c("a","a","b")) > x [1] a a b Levels: a b > class(x) [1] "factor" Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 8

9 Introduction to R Object Classes in R Some More R Classes matrix class: > x=c(1,3,-2) > y=c(2,0,7) > z=cbind(x,y) > z x y [1,] 1 2 [2,] 3 0 [3,] -2 7 > class(z) [1] "matrix" data.frame class: > x=c(1,3,-2) > y=c("a","a","b") > z=data.frame(x,y) > z x y 1 1 a 2 3 a 3-2 b > class(z) [1] "data.frame" Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 9

10 High Performance Computing Two Major Limitations of R Limitations of R R is great for statistics and data visualization, but... 1 R is NOT optimized for parallel computing Default build uses single-threaded BLAS Default build has no parallel computing ability 2 R reads all data into virtual memory Default build can NOT read data from file on demand Need to buy more RAM?? Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 10

11 High Performance Computing Optimized Basic Linear Algebra Subprograms (BLAS) Linking R to Faster BLAS R s standard BLAS library is very stable, but single-threaded. Can link R to optimized BLAS. Popular choices include: OpenBLAS: ATLAS: MKL: https://software.intel.com/en-us/intel-mkl For your particular OS and BLAS combination, instructions to link to R can (typically) be found on someone s blog... Google it OpenBLAS is simple to link to R on Unbuntu (I ve heard) Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 11

12 High Performance Computing Optimized Basic Linear Algebra Subprograms (BLAS) Linking R to Faster BLAS in Mac OS Mac OS comes with veclib BLAS, which are simple to link to R. Supported by Apple s Accelerate Framework In R 2.15 and before, input the following into Terminal: cd /Library/Frameworks/R.framework/Resources/lib ln -sf librblas.veclib.dylib librblas.dylib In R 3.0 and above, librblas.veclib.dylib is not included with the R download, but you can still link to veclib by replacing librblas.veclib.dylib with /System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Versions/Current/libBLAS.dylib Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 12

13 High Performance Computing Optimized Basic Linear Algebra Subprograms (BLAS) Linking R to Faster BLAS in Mac OS (example) Example on MacBook Pro (2.53 GHz Intel Core 2 Duo, 4GB RAM): # with R s standard BLAS > set.seed(123) > x=matrix(rnorm(10^6),10^4,100) > system.time({svd(x)}) user system elapsed # with Apple s veclib BLAS > set.seed(123) > x=matrix(rnorm(10^6),10^4,100) > system.time({svd(x)}) user system elapsed Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 13

14 High Performance Computing Parallel Computing Parallel/Multicore Computing Packages R has many add-on packages for parallel computing: multicore: basic multicore/parallel processing Rmpi: interface (wrapper) to message-passing interface snow: Simple Network Of Workstations snowfall: interface (wrapper) to snow Note that there are several others too (see link below) Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 14

15 High Performance Computing Parallel Package Comparisons Parallel Computing 16 State of the Art in Parallel Computing with R Helpful table from Schmidberger et al. (2009): Learnability Efficiency Memorability Errors Satisfaction RmpiR rpvm + nws snow snowft snowfall papply biopara 0 0 taskpr Table 3: Assessment of the usability of the R packages for computer clusters (ranging from ++ to ). ++ is good (well-developed and stable) is bad resources in order to get optimal resource utilization. Both aspects are very important in the field of parallel(under-developed computing. and unstable). Only the packages snowft and biopara have integrated solutions for fault tolerance. Thus for many common use cases the users must accommodate the possibility of failure during lengthy Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 15

16 High Performance Computing Big Data Issues R = Big Data Software R is NOT a big data software R holds all objects in virtual memory R has implicit memory limits Maximum array dimension: Address space memory limits (for all objects) are system-specific Windows - 32-bit: 2Gb OS-imposed limit - 64-bit: 8Tb OS-imposed limit Unix - 32-bit: 4Gb OS-imposed limit - 64-bit: essentially infinite (e.g., 128Tb for Linux on x86_64 cpus) Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 16

17 High Performance Computing Big Data Issues Some Notes on bigmemory Package The bigmemory package features the big.matrix object class. Points to data structure in C++ big.matrix objects are call by reference, so need special analysis functions (typical R objects are call by value) biganalytics and bigtabulate packages provide analysis functions for big.matrix objects Uses standard C++ data structures: Type Bytes Approx. Range double 8 1.7E ± 308 integer 4 ±2, 147, 483, 647 short 2 ±32, 767 char 1 ±127 Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 17

18 Reading Data into R Airline On-Time Performance From Statistical Computing and Statistical Graphics 2009 Data Expo, American Statistical Association Full data set contains flight arrival/departure details for all US commercial flights from October 1987 to April Have 29 variables about arrival/departure Have about 120 million records (flights) Data size: 12GB We will focus on data from (approximately 4GB). Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 18

19 Reading Data into R Big Data Problems and Solutions Problem: With 4GB of RAM, how can I even get the 4GB of data into R?? Solution 1: Preprocess the data outside of R, and input the (smaller) subset of data you want to analyze in R. Solution 2: You could use the big.matrix function from bigmemory package (can read data from memory or file, so you can exceed RAM) Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 19

20 Reading Data into R Simple BASH Preprocessing (via awk) Shell script that finds all flights with non-missing data and minute delays (and a valid departure time): #!/bin/bash # flights.sh # # # Created by Nathaniel E. Helwig on 5/20/14. # Copyright 2014 University of Illinois. All rights reserved. cd "/Users/Nate/Desktop/CSE_2014/Rcode/data/" pr *.csv cut -d "," -f1,2,5,15,16 \ awk -F, (!/Year/) && (!/NA/) && (\$4 > 0) && (\$4 <= 120) && (\$5 > 0) && (\$5 <= 120) \ { if(length(\$3)==3 && substr(\$3,2,2)<=59) { \ print \$1","\$2","substr(\$3,1,1)","\$4","\$5 \ } else if(length(\$3)==4 && substr(\$3,1,2)<=24 && substr(\$3,3,2)<=59) { \ print \$1","\$2","substr(\$3,1,2)","\$4","\$5 \ } \ } > flights.csv wc -l flights.csv Saves Year, Month, DepHour, DepDelay, and ArrDelay in file flights.csv; then prints number of lines in flights.csv Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 20

21 Reading Data into R Reading Flights Data into R (first time) First we need to open R and load the big packages: > library(bigmemory) # ver Loading required package: bigmemory.sri Loading required package: BH bigmemory >= 4.0 is a major revision since 3.1.2; please see packages biganalytics and and bigtabulate and for more information. > library(biganalytics) # ver > library(biglm) # ver Loading required package: DBI > library(bigsplines) # ver First time you read-in data use read.big.matrix function: > mypath="/users/nate/desktop/cse_2014/rcode/data/" > flights<-read.big.matrix(filename=paste(mypath,"flights.csv",sep=""), col.names=c("year","month","dephour","depdelay","arrdelay"), type="short",backingfile="flights.bin",backingpath=mypath, descriptorfile="flights.desc") and create a backingfile and descriptorfile for later read-ins. Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 21

22 Reading Data into R Reading Flights Data into R (another time) Look at data we just read-in: > class(flights) [1] "big.matrix" attr(,"package") [1] "bigmemory" > flights[1:4,] Year Month DepHour DepDelay ArrDelay [1,] [2,] [3,] [4,] When you reread-in the data, use attach.big.matrix function: > library(bigmemory) > mypath="/users/nate/desktop/cse_2014/rcode/data/" > flights<-attach.big.matrix("flights.desc",backingfile="flights.bin",backingpath=mypath) > flights[1:4,] Year Month DepHour DepDelay ArrDelay [1,] [2,] [3,] [4,] Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 22

23 Some Descriptive Statistics Reading Data into R # print data dimensions and column names > dim(flights) [1] > colnames(flights) [1] "Year" "Month" "DepHour" "DepDelay" "ArrDelay" # look at variable ranges > apply(flights,2,range) Year Month DepHour DepDelay ArrDelay [1,] [2,] # look at variable means > apply(flights,2,mean) Year Month DepHour DepDelay ArrDelay # look at correlation between DepDelay and ArrDelay > cor(flights[,4],flights[,5]) [1] Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 23

24 Parametric Analysis Simple Linear Regression (SLR) Model The simple linear regression model has the form y i = µ + βx i + i or y = Xβ + for i {1,...,n} where y =(y 1,...,y n ) with y i R µ R is the regression intercept β R is the regression slope and β =(µ, β) x i R is the predictor for the i-th observation X =[1 n, x] is the design matrix with x =(x 1,...,x n ) =( 1,..., n ) with i iid N(0, σ 2 ) Ordinary Least-Squares solution: ˆβ =(X X) 1 X y N(β, σ 2 (X X) 1 ) Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 24

25 Parametric Analysis SLR of Full Flights Data Predict ArrDelay from DepDelay: # fit big linear regression model (using big.matrix interface) > linmod=biglm.big.matrix(arrdelay~depdelay,data=flights) > linsum=summary(linmod) > linsum Large data regression model: biglm(formula = formula, data = data,...) Sample size = Coef (95% CI) SE p (Intercept) DepDelay > linsum\$rsq [1] ArrDelay = (DepDelay) R 2 = Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 25

26 Plot Full SLR Results Flights Example Parametric Analysis Plot regression line with 95% pointwise confidence interval: Exp. Arrival Delay (min) Departure Delay (min) Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 26

27 R Code for Full SLR Plot Parametric Analysis # create prediction function > newdata=data.frame(depdelay=seq(1,120,length=200),arrdelay=rep(0,200)) > linpred=predict(linmod,newdata,se.fit=true,make.function=true) > yhat=linpred[[1]](newdata\$depdelay) > yhatse=sqrt(diag(linpred[[2]](newdata\$depdelay))) # plot regression line with pointwise confidence intervals > x11(width=6,height=6) > plot(newdata\$depdelay,yhat,type="l", + xlab="departure Delay (min)", + ylab="exp. Arrival Delay (min)") > lines(newdata\$depdelay,yhat+2*yhatse,lty=2,col="red") > lines(newdata\$depdelay,yhat-2*yhatse,lty=2,col="red") Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 27

28 Parametric Analysis SLR of Random Subset of Flights Data Predict ArrDelay from DepDelay using 10 6 observations: # get subset of data > set.seed(123) > subidx=sample.int(nrow(flights),10^6) > flightsub=as.data.frame(flights[subidx,]) # fit big linear regression model (using biglm) > linmods=biglm(arrdelay~depdelay,data=flightsub) > linsums=summary(linmods) # compare solutions > linsum Large data regression model: biglm(formula = formula, data = data,...) Sample size = Coef (95% CI) SE p (Intercept) DepDelay > linsums Large data regression model: biglm(arrdelay ~ DepDelay, data = flightsub) Sample size = Coef (95% CI) SE p (Intercept) DepDelay Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 28

29 Parametric Analysis Multiple Linear Regression (MLR) Model The multiple linear regression model has the form y i = µ + p j=1 β jx ij + i or y = Xβ + for i {1,...,n} where y =(y 1,...,y n ) with y i R µ R is the regression intercept β j R is the j-th predictor s slope and β =(µ, β 1,...,β p ) x ij R is the j-th predictor for the i-th observation X =[1 n, x 1,...,x p ] is the design matrix with x j =(x 1j,...,x nj ) =( 1,..., n ) with i iid N(0, σ 2 ) Ordinary Least-Squares solution: ˆβ =(X X) 1 X y N(β, σ 2 (X X) 1 ) Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 29

30 Parametric Analysis MLR of Full Flights Data Predict ArrDelay from Year, Month, DepHour, and DepDelay: # fit big linear regression model (using big.matrix interface) > mlrmod=biglm.big.matrix(arrdelay~year+month+dephour+depdelay,data=flights) > mlrsum=summary(mlrmod) > mlrsum Large data regression model: biglm(formula = formula, data = data,...) Sample size = Coef (95% CI) SE p (Intercept) Year Month DepHour DepDelay > mlrsum\$rsq [1] ArrDelay = (Year) (Month) (DepHour) (DepDelay) R 2 = Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 30

31 Plot Full MLR Results Flights Example Parametric Analysis Plot regression lines (with 95% CIs) at average covariate values: Exp. Arrival Delay (min) Exp. Arrival Delay (min) Departure Year Departure Month Exp. Arrival Delay (min) Exp. Arrival Delay (min) Departure Time (24 hr) Departure Delay (min) Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 31

32 R Code for Full MLR Plot Parametric Analysis # create new data and prediction function > newdata=data.frame(year=seq(2003,2008,length=200),month=seq(1,12,length=200), + DepHour=seq(1,24,length=200),DepDelay=seq(1,120,length=200),ArrDelay=0) > mlrpred=predict(mlrmod,newdata,se.fit=true,make.function=true) > mfs=apply(flights,2,mean) # get variable means # plot line and 95% pointwise CI for Year > yhat=mlrpred[[1]](cbind(newdata[,1],mfs[2],mfs[3],mfs[4])) > yhatse=sqrt(diag(mlrpred[[2]](cbind(newdata[,1],mfs[2],mfs[3],mfs[4])))) > plot(newdata\$year,yhat,type="l",xlab="departure Year",ylab="Exp. Arrival Delay (min)") > lines(newdata\$year,yhat+2*yhatse,lty=2,col="red") > lines(newdata\$year,yhat-2*yhatse,lty=2,col="red") # plot line and 95% pointwise CI for Month > yhat=mlrpred[[1]](cbind(mfs[1],newdata[,2],mfs[3],mfs[4])) > yhatse=sqrt(diag(mlrpred[[2]](cbind(mfs[1],newdata[,2],mfs[3],mfs[4])))) > plot(newdata\$month,yhat,type="l",xlab="departure Month",ylab="Exp. Arrival Delay (min)") > lines(newdata\$month,yhat+2*yhatse,lty=2,col="red") > lines(newdata\$month,yhat-2*yhatse,lty=2,col="red") # plot line and 95% pointwise CI for DepHour > yhat=mlrpred[[1]](cbind(mfs[1],mfs[2],newdata[,3],mfs[4])) > yhatse=sqrt(diag(mlrpred[[2]](cbind(mfs[1],mfs[2],newdata[,3],mfs[4])))) > plot(newdata\$dephour,yhat,type="l",xlab="departure Time (24 hr)",ylab="exp. Arrival Delay (min)") > lines(newdata\$dephour,yhat+2*yhatse,lty=2,col="red") > lines(newdata\$dephour,yhat-2*yhatse,lty=2,col="red") # plot line and 95% pointwise CI for DepDelay > yhat=mlrpred[[1]](cbind(mfs[1],mfs[2],mfs[3],newdata[,4])) > yhatse=sqrt(diag(mlrpred[[2]](cbind(mfs[1],mfs[2],mfs[3],newdata[,4])))) > plot(newdata\$depdelay,yhat,type="l",xlab="departure Delay (min)",ylab="exp. Arrival Delay (min)") > lines(newdata\$depdelay,yhat+2*yhatse,lty=2,col="red") > lines(newdata\$depdelay,yhat-2*yhatse,lty=2,col="red") Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 32

33 Parametric Analysis MLR of Random Subset of Flights Data Predict ArrDelay from other variables using 10 6 observations: # fit big linear regression model (using biglm) > mlrmods=biglm(arrdelay~year+month+dephour+depdelay,data=flightsub) > mlrsums=summary(mlrmods) # compare solutions > mlrsum Large data regression model: biglm(formula = formula, data = data,...) Sample size = Coef (95% CI) SE p (Intercept) Year Month DepHour DepDelay > mlrsums Large data regression model: biglm(arrdelay ~ Year + Month + DepHour + DepDelay, data = flightsub) Sample size = Coef (95% CI) SE p (Intercept) Year Month DepHour DepDelay Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 33

34 Nonparametric Analysis Nonparametric Regression (NPR) Model The nonparametric regression model has the form where y i = η(x i )+ i y i R is the real-valued response for the i-th observation η is an unknown smooth function x i X is the predictor vector for the i-th observation i iid N(0, σ 2 ) is Gaussian measurement error Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 34

35 Nonparametric Analysis Smoothing Spline Approach to NPR Smoothing spline approach estimates η in tensor product reproducing kernel Hilbert space (see Gu, 2013; Helwig, 2013; Wahba, 1990). Find the η that minimizes the penalized least-squares functional where 1 n n (y i η(x i )) 2 + λj(η) i=1 λ 0 is global smoothing parameter J is nonnegative penalty functional quantifying roughness of η Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 35

36 Smoothing Spline Estimation Optimal η can be approximated as Nonparametric Analysis η λ (x) = u v=1 d vφ v (x)+ q t=1 c tρ c (x, x t ) where {φ v } u v=1 are basis functions spanning null space ρ c denotes the reproducing kernel of contrast space (if x is multidimensional, θ k parameters are buried in ρ c ) { x t } q t=1 {x i} n i=1 are the selected spline knots d = {d v } u 1 and c = {c t } q 1 are unknown coefficients λ =(λ, θ 1,...,θ k ) are unknown smoothing parameters Given fixed λ, there is a closed form solution for optimal coefficients (see Kim & Gu, 2004; Helwig, 2013; Helwig & Ma, in press). Select λ by minimizing GCV score (Craven & Wahba, 1979) Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 36

37 Nonparametric Analysis Some Notes on bigsplines Package Currently, four primary functions in bigsplines package: bigspline: Unidimensional cubic smoothing splines - Unconstrained or periodic bigtps: Cubic thin-plate splines - Supports 1-, 2-, or 3-dimensional predictors bigssa: Smoothing Spline Anova (tensor product splines) - Supports cubic, cubic periodic, thin-plate, and nominal splines bigssp: Smoothing Splines with Parametric effects - More general than bigssa; supports parametric predictors too Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 37

38 NPR of Full Flights Data Nonparametric Analysis Four separate models using bigspline function: y i = η(x i )+ i R 2 = R 2 = Exp. Arrival Delay (min) Exp. Arrival Delay (min) Departure Year Departure Month R 2 = R 2 = Exp. Arrival Delay (min) Exp. Arrival Delay (min) Departure Time (24 hr) Departure Delay (min) Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 38

39 Nonparametric Analysis NPR of Random Subset of Flights Data Same four models using random subset of 10 6 observations: R 2 = R 2 = Exp. Arrival Delay (min) Exp. Arrival Delay (min) Departure Year Departure Month R 2 = R 2 = Exp. Arrival Delay (min) Exp. Arrival Delay (min) Departure Time (24 hr) Departure Delay (min) Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 39

40 R Code for Full NPR Plots Nonparametric Analysis # Year vs. ArrDelay using cubic spline with 4 knots > smod=bigspline(flights[,1],flights[,5],nknots=4) > newdata=data.frame(year=seq(2003,2008,length=50)) > spred=predict(smod,newdata,se.fit=true) > yhat=spred[[1]] > yhatse=spred[[2]] > plot(newdata\$year,yhat,type="l",xlab="departure Year", + ylab="exp. Arrival Delay (min)",main=bquote(r^2==.(round(smod\$info[2],4)))) > lines(newdata\$year,yhat+2*yhatse,lty=2,col="red") > lines(newdata\$year,yhat-2*yhatse,lty=2,col="red") # Month vs. ArrDelay using periodic cubic spline with 6 knots > smod=bigspline(flights[,2],flights[,5],type="per",nknots=6) > newdata=data.frame(month=seq(1,12,length=200)) > spred=predict(smod,newdata,se.fit=true) > yhat=spred[[1]] > yhatse=spred[[2]] > plot(newdata\$month,yhat,type="l",xlab="departure Month", + ylab="exp. Arrival Delay (min)",main=bquote(r^2==.(round(smod\$info[2],4)))) > lines(newdata\$month,yhat+2*yhatse,lty=2,col="red") > lines(newdata\$month,yhat-2*yhatse,lty=2,col="red") Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 40

41 Nonparametric Analysis R Code for Full NPR Plots (continued) # DepHour vs. ArrDelay using periodic cubic spline with 12 knots > smod=bigspline(flights[,3],flights[,5],type="per",nknots=12) > newdata=data.frame(dephour=seq(1,24,length=200)) > spred=predict(smod,newdata,se.fit=true) > yhat=spred[[1]] > yhatse=spred[[2]] > plot(newdata\$dephour,yhat,type="l",xlab="departure Time (24 hr)", + ylab="exp. Arrival Delay (min)",main=bquote(r^2==.(round(smod\$info[2],4)))) > lines(newdata\$dephour,yhat+2*yhatse,lty=2,col="red") > lines(newdata\$dephour,yhat-2*yhatse,lty=2,col="red") # DepDelay vs. ArrDelay using cubic spline with 10 knots > smod=bigspline(flights[,4],flights[,5],type="cub",nknots=10) > newdata=data.frame(depdelay=seq(1,120,length=200)) > spred=predict(smod,newdata,se.fit=true) > yhat=spred[[1]] > yhatse=spred[[2]] > plot(newdata\$depdelay,yhat,type="l",xlab="departure Delay (min)", + ylab="exp. Arrival Delay (min)",main=bquote(r^2==.(round(smod\$info[2],4)))) > lines(newdata\$depdelay,yhat+2*yhatse,lty=2,col="red") > lines(newdata\$depdelay,yhat-2*yhatse,lty=2,col="red") Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 41

42 Semiparametric Analysis Semiparametric Regression (SPR) Model The semiparametric regression model has the form for i {1,...,n} where y i = µ + p j=1 β jx ij + η(z i )+ i y i R is the real-valued response for the i-th observation µ R is the regression intercept β j R is the j-th predictor s regression slope x ij is j-th parametric predictor for i-th observation η is an unknown smooth function z i Z is nonparametric predictor vector for i-th observation i iid N(0, σ 2 ) is Gaussian measurement error Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 42

43 Semiparametric Analysis SPR of Random Subset of Flights Data SPR via bigssp using random subset of 10 6 observations: y i = µ + η 1 (Year i )+η 2 (Month i )+η 3 (DepHour i )+βdepdelay i + i Year effect Month effect Exp. Arrival Delay (min) Exp. Arrival Delay (min) Departure Year Departure Month DepHour effect DepDelay effect Exp. Arrival Delay (min) Exp. Arrival Delay (min) Departure Time (24 hr) Departure Delay (min) Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 43

44 R Code for Fitting SPR Model Semiparametric Analysis Cubic spline for Year, periodic cubic splines for Month and DepHour, and parametric effect for DepDelay. # fit semiparametric model > smod=bigssp(arrdelay~year+month+dephour+depdelay,data=flightsub, + type=list(year="cub",month="per",dephour="per",depdelay="prm"),nknots=30, + rparm=list(year=0.01,month=0.01,dephour=0.01,depdelay=5),skip.iter=false) > smod Predictor Types: Year Month DepHour DepDelay cub per per prm Fit Statistics: gcv rsq aic bic e e e e+06 Algorithm Converged: TRUE See Helwig (2013) and Helwig and Ma (in prep) for discussion of rounding parameters. Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 44

45 Semiparametric Analysis NPR of Random Subset of Flights Data NPR via bigssp using random subset of 10 6 observations: y i = µ + η 1 (Year i )+η 2 (Month i )+η 3 (DepHour i )+η 4 (DepDelay i )+ i Year effect Month effect Exp. Arrival Delay (min) Exp. Arrival Delay (min) Departure Year Departure Month DepHour effect DepDelay effect Exp. Arrival Delay (min) Exp. Arrival Delay (min) Departure Time (24 hr) Departure Delay (min) Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 45

46 R Code for Fitting NPR Model Semiparametric Analysis Cubic spline for Year, periodic cubic splines for Month and DepHour, and cubic spline for DepDelay. # fit nonparametric model > smod=bigssp(arrdelay~year+month+dephour+depdelay,data=flightsub, + type=list(year="cub",month="per",dephour="per",depdelay="cub"),nknots=30, + rparm=list(year=0.02,month=0.01,dephour=0.01,depdelay=0.02),skip.iter=false) > smod Predictor Types: Year Month DepHour DepDelay cub per per cub Fit Statistics: gcv rsq aic bic e e e e+06 Algorithm Converged: TRUE Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 46

47 R Code for SPR/NPR Plots Semiparametric Analysis # plot line and 95% Bayesian CI for Year > newdata=data.frame(year=seq(2003,2008,length=50)) > spred=predict(smod,newdata,se.fit=true,include="year") > yhat=spred[[1]] > yhatse=spred[[2]] > plot(newdata\$year,yhat,type="l",ylim=c(-.8,.4),xlab="departure Year", + ylab="exp. Arrival Delay (min)",main="year effect") > lines(newdata\$year,yhat+2*yhatse,lty=2,col="red") > lines(newdata\$year,yhat-2*yhatse,lty=2,col="red") # plot line and 95% Bayesian CI for Month > newdata=data.frame(month=seq(1,12,length=50)) > spred=predict(smod,newdata,se.fit=true,include="month") > yhat=spred[[1]] > yhatse=spred[[2]] > plot(newdata\$month,yhat,type="l",ylim=c(-.5,.5),xlab="departure Month", + ylab="exp. Arrival Delay (min)",main="month effect") > lines(newdata\$month,yhat+2*yhatse,lty=2,col="red") > lines(newdata\$month,yhat-2*yhatse,lty=2,col="red") Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 47

48 Semiparametric Analysis R Code for SPR/NPR Plots (continued) # plot line and 95% Bayesian CI for DepHour > newdata=data.frame(dephour=seq(1,24,length=50)) > spred=predict(smod,newdata,se.fit=true,include="dephour") > yhat=spred[[1]] > yhatse=spred[[2]] > plot(newdata\$dephour,yhat,type="l",xlab="departure Time (24 hr)", + ylab="exp. Arrival Delay (min)",main="dephour effect") > lines(newdata\$dephour,yhat+2*yhatse,lty=2,col="red") > lines(newdata\$dephour,yhat-2*yhatse,lty=2,col="red") # plot line and 95% Bayesian CI for DepDelay > newdata=data.frame(depdelay=seq(1,120,length=50)) > spred=predict(smod,newdata,se.fit=true,include="depdelay") > yhat=spred[[1]] > yhatse=spred[[2]] > plot(newdata\$depdelay,yhat,type="l",xlab="departure Delay (min)", + ylab="exp. Arrival Delay (min)",main="depdelay effect") > lines(newdata\$depdelay,yhat+2*yhatse,lty=2,col="red") > lines(newdata\$depdelay,yhat-2*yhatse,lty=2,col="red") Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 48

49 Miscellaneous Future Work Things to Come... Methodological plans for the bigsplines package include: bigssg: Smoothing Spline Generalized regression (extension of bigssp for non-gaussian data) bigssd: Smoothing Spline Density estimation Significance testing for non/parametric effects Support for random effects Computational plans for the bigsplines package include: Further algorithm improvements (can be 10x faster) Support for parallel computation Support for big.matrix objects Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 53

50 Contact Information Summer contact: Nathaniel E. Helwig, Visiting Assistant Professor Department of Statistics University of Illinois at Urbana-Champaign Future contact: Nathaniel E. Helwig, Assistant Professor School of Statistics and Department of Psychology University of Minnesota (Twin Cities) Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 54

51 References Craven, P. and G. Wahba (1979). Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik 31, Gu, C. (2013). Smoothing Spline ANOVA Models (Second ed.). New York: Springer-Verlag. Helwig, N. E. (2013, May). Fast and stable smoothing spline analysis of variance models for large samples with applications to electroencephalography data analysis. Ph.D. thesis, University of Illinois at Urbana-Champaign. Helwig, N. E. (in prep.). Nonparametric analysis of group differences in multivariate time series data: Application to electroencephalography data analysis. Helwig, N. E. and P. Ma (in prep.). Nonparametric Gaussian regression for ultra large samples: Scalable computation via rounding parameters. Helwig, N. E. and P. Ma (in press). Fast and stable multiple smoothing parameter selection in smoothing spline analysis of variance models with large samples. Journal of Computational and Graphical Statistics. Helwig, N. E., P. Ma, and S. Wang (in prep.). Nonparametric analysis of spatiotemporal trends in social media data. Kim, Y.-J. and C. Gu (2004). Smoothing spline gaussian regression: More scalable computation via efficient approximation. Journal of the Royal Statistical Society, Series B 66, Schmidberger, M., M. Morgan, D. Eddelbuettel, H. Yu, L. Tierney, and U. Mansmann (2009). State of the art in parallel computing with R. Journal of Statistical Software 31, Wahba, G. (1990). Spline models for observational data. Philadelphia: Society for Industrial and Applied Mathematics. Nathaniel E. Helwig (University of Illinois) Semiparametric Regression of Big Data in R CSE Big Data Workshop Slide 55

### Semiparametric Regression of Big Data in R

Semiparametric Regression of Big Data in R Nathaniel E. Helwig Department of Statistics University of Illinois at Urbana-Champaign CSE Big Data Workshop: May 29, 2014 Nathaniel E. Helwig (University of

Your Best Next Business Solution Big Data In R 24/3/2010 Big Data In R R Works on RAM Causing Scalability issues Maximum length of an object is 2^31-1 Some packages developed to help overcome this problem

### Deal with big data in R using bigmemory package

Deal with big data in R using bigmemory package Xiaojuan Hao Department of Statistics University of Nebraska-Lincoln April 28, 2015 Background What Is Big Data Size (We focus on) Complexity Rate of growth

### Big Data and Parallel Work with R

Big Data and Parallel Work with R What We'll Cover Data Limits in R Optional Data packages Optional Function packages Going parallel Deciding what to do Data Limits in R Big Data? What is big data? More

### Large Datasets and You: A Field Guide

Large Datasets and You: A Field Guide Matthew Blackwell m.blackwell@rochester.edu Maya Sen msen@ur.rochester.edu August 3, 2012 A wind of streaming data, social data and unstructured data is knocking at

Lecture 16: Generalized Additive Models Regression III: Advanced Methods Bill Jacoby Michigan State University http://polisci.msu.edu/jacoby/icpsr/regress3 Goals of the Lecture Introduce Additive Models

### Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics

### Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

### Curriculum Vitae. Nathaniel E. Helwig

Contact Information Curriculum Vitae Nathaniel E. Helwig University: Department of Psychology & School of Statistics, University of Minnesota Address: Psych: N650 Elliot Hall, 75 E River Road, Minneapolis,

### Web-based Supplementary Materials for. Modeling of Hormone Secretion-Generating. Mechanisms With Splines: A Pseudo-Likelihood.

Web-based Supplementary Materials for Modeling of Hormone Secretion-Generating Mechanisms With Splines: A Pseudo-Likelihood Approach by Anna Liu and Yuedong Wang Web Appendix A This appendix computes mean

### BayesX - Software for Bayesian Inference in Structured Additive Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich

### The ff package: Handling Large Data Sets in R with Memory Mapped Pages of Binary Flat Files

UseR! 2007, Iowa State University, Ames, August 8-108 2007 The ff package: Handling Large Data Sets in R with Memory Mapped Pages of Binary Flat Files D. Adler, O. Nenadić, W. Zucchini, C. Gläser Institute

### Smoothing and Non-Parametric Regression

Smoothing and Non-Parametric Regression Germán Rodríguez grodri@princeton.edu Spring, 2001 Objective: to estimate the effects of covariates X on a response y nonparametrically, letting the data suggest

### Joint models for classification and comparison of mortality in different countries.

Joint models for classification and comparison of mortality in different countries. Viani D. Biatat 1 and Iain D. Currie 1 1 Department of Actuarial Mathematics and Statistics, and the Maxwell Institute

### Regression Modeling Strategies

Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

### ON THE DEGREES OF FREEDOM IN RICHLY PARAMETERISED MODELS

COMPSTAT 2004 Symposium c Physica-Verlag/Springer 2004 ON THE DEGREES OF FREEDOM IN RICHLY PARAMETERISED MODELS Salvatore Ingrassia and Isabella Morlini Key words: Richly parameterised models, small data

### Cluster Computing at HRI

Cluster Computing at HRI J.S.Bagla Harish-Chandra Research Institute, Chhatnag Road, Jhunsi, Allahabad 211019. E-mail: jasjeet@mri.ernet.in 1 Introduction and some local history High performance computing

### Adding large data support to R

Adding large data support to R Luke Tierney Department of Statistics & Actuarial Science University of Iowa January 4, 2013 Luke Tierney (U. of Iowa) Large data in R January 4, 2013 1 / 15 Introduction

### Applied Multivariate Analysis - Big data analytics

Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of

### Package dsmodellingclient

Package dsmodellingclient Maintainer Author Version 4.1.0 License GPL-3 August 20, 2015 Title DataSHIELD client site functions for statistical modelling DataSHIELD

### Fitting Subject-specific Curves to Grouped Longitudinal Data

Fitting Subject-specific Curves to Grouped Longitudinal Data Djeundje, Viani Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh, EH14 4AS, UK E-mail: vad5@hw.ac.uk Currie,

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### Parallelization Strategies for Multicore Data Analysis

Parallelization Strategies for Multicore Data Analysis Wei-Chen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management

### Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

### Streaming Data, Concurrency And R

Streaming Data, Concurrency And R Rory Winston rory@theresearchkitchen.com About Me Independent Software Consultant M.Sc. Applied Computing, 2000 M.Sc. Finance, 2008 Apache Committer Working in the financial

### Simple Predictive Analytics Curtis Seare

Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

### Parametric versus Semi/nonparametric Regression Models

Parametric versus Semi/nonparametric Regression Models Hamdy F. F. Mahmoud Virginia Polytechnic Institute and State University Department of Statistics LISA short course series- July 23, 2014 Hamdy Mahmoud

### I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Beckman HLM Reading Group: Questions, Answers and Examples Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Linear Algebra Slide 1 of

Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL

### Introducing the Multilevel Model for Change

Department of Psychology and Human Development Vanderbilt University GCM, 2010 1 Multilevel Modeling - A Brief Introduction 2 3 4 5 Introduction In this lecture, we introduce the multilevel model for change.

### The basic unit in matrix algebra is a matrix, generally expressed as: a 11 a 12. a 13 A = a 21 a 22 a 23

(copyright by Scott M Lynch, February 2003) Brief Matrix Algebra Review (Soc 504) Matrix algebra is a form of mathematics that allows compact notation for, and mathematical manipulation of, high-dimensional

### GLAM Array Methods in Statistics

GLAM Array Methods in Statistics Iain Currie Heriot Watt University A Generalized Linear Array Model is a low-storage, high-speed, GLAM method for multidimensional smoothing, when data forms an array,

### Penalized regression: Introduction

Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

### Psychology 205: Research Methods in Psychology

Psychology 205: Research Methods in Psychology Using R to analyze the data for study 2 Department of Psychology Northwestern University Evanston, Illinois USA November, 2012 1 / 38 Outline 1 Getting ready

### Yiming Peng, Department of Statistics. February 12, 2013

Regression Analysis Using JMP Yiming Peng, Department of Statistics February 12, 2013 2 Presentation and Data http://www.lisa.stat.vt.edu Short Courses Regression Analysis Using JMP Download Data to Desktop

### GETTING STARTED WITH R AND DATA ANALYSIS

GETTING STARTED WITH R AND DATA ANALYSIS [Learn R for effective data analysis] LEARN PRACTICAL SKILLS REQUIRED FOR VISUALIZING, TRANSFORMING, AND ANALYZING DATA IN R One day course for people who are just

### Publication List. Chen Zehua Department of Statistics & Applied Probability National University of Singapore

Publication List Chen Zehua Department of Statistics & Applied Probability National University of Singapore Publications Journal Papers 1. Y. He and Z. Chen (2014). A sequential procedure for feature selection

### Estimation of σ 2, the variance of ɛ

Estimation of σ 2, the variance of ɛ The variance of the errors σ 2 indicates how much observations deviate from the fitted surface. If σ 2 is small, parameters β 0, β 1,..., β k will be reliably estimated

### 2013 MBA Jump Start Program. Statistics Module Part 3

2013 MBA Jump Start Program Module 1: Statistics Thomas Gilbert Part 3 Statistics Module Part 3 Hypothesis Testing (Inference) Regressions 2 1 Making an Investment Decision A researcher in your firm just

### Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 What is machine learning? Data description and interpretation

### Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Department of Industrial Engineering and Management Sciences Northwestern University September 15th, 2014

### How To Run Statistical Tests in Excel

How To Run Statistical Tests in Excel Microsoft Excel is your best tool for storing and manipulating data, calculating basic descriptive statistics such as means and standard deviations, and conducting

### CE 504 Computational Hydrology Computational Environments and Tools Fritz R. Fiedler

CE 504 Computational Hydrology Computational Environments and Tools Fritz R. Fiedler 1) Operating systems a) Windows b) Unix and Linux c) Macintosh 2) Data manipulation tools a) Text Editors b) Spreadsheets

June 2010 From: StatSoft Analytics White Papers To: Internal release Re: Performance comparison of STATISTICA Version 9 on multi-core 64-bit machines with current 64-bit releases of SAS (Version 9.2) and

### The Not Quite R (NQR) Project: Explorations Using the Parrot Virtual Machine

The Not Quite R (NQR) Project: Explorations Using the Parrot Virtual Machine Michael J. Kane 1 and John W. Emerson 2 1 Yale Center for Analytical Sciences, Yale University 2 Department of Statistics, Yale

### Towards Terrabytes of TAQ

Towards Terrabytes of TAQ John W. Emerson (Jay) and Michael J. Kane (Mike) Yale University john.emerson@yale.edu, michael.kane@yale.edu http://www.stat.yale.edu/~jay/rinfinance2012/ R in Finance 2012 Motivation

### Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round \$200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

### Regression Analysis: A Complete Example

Regression Analysis: A Complete Example This section works out an example that includes all the topics we have discussed so far in this chapter. A complete example of regression analysis. PhotoDisc, Inc./Getty

### SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

Neural Network Add-in Version 1.5 Software User s Guide Contents Overview... 2 Getting Started... 2 Working with Datasets... 2 Open a Dataset... 3 Save a Dataset... 3 Data Pre-processing... 3 Lagging...

### Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

### Predictor Coef StDev T P Constant 970667056 616256122 1.58 0.154 X 0.00293 0.06163 0.05 0.963. S = 0.5597 R-Sq = 0.0% R-Sq(adj) = 0.

Statistical analysis using Microsoft Excel Microsoft Excel spreadsheets have become somewhat of a standard for data storage, at least for smaller data sets. This, along with the program often being packaged

### business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel

### Package biganalytics

Package biganalytics February 19, 2015 Version 1.1.1 Date 2012-09-20 Title A library of utilities for big.matrix objects of package bigmemory. Author John W. Emerson and Michael

### The importance of graphing the data: Anscombe s regression examples

The importance of graphing the data: Anscombe s regression examples Bruce Weaver Northern Health Research Conference Nipissing University, North Bay May 30-31, 2008 B. Weaver, NHRC 2008 1 The Objective

### Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### High-Performance Processing of Large Data Sets via Memory Mapping A Case Study in R and C++

High-Performance Processing of Large Data Sets via Memory Mapping A Case Study in R and C++ Daniel Adler, Jens Oelschlägel, Oleg Nenadic, Walter Zucchini Georg-August University Göttingen, Germany - Research

### Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Applied Statistics J. Blanchet and J. Wadsworth Institute of Mathematics, Analysis, and Applications EPF Lausanne An MSc Course for Applied Mathematicians, Fall 2012 Outline 1 Model Comparison 2 Model

### Week 5: Multiple Linear Regression

BUS41100 Applied Regression Analysis Week 5: Multiple Linear Regression Parameter estimation and inference, forecasting, diagnostics, dummy variables Robert B. Gramacy The University of Chicago Booth School

### Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

### CSE 494 CSE/CBS 598 (Fall 2007): Numerical Linear Algebra for Data Exploration Clustering Instructor: Jieping Ye

CSE 494 CSE/CBS 598 Fall 2007: Numerical Linear Algebra for Data Exploration Clustering Instructor: Jieping Ye 1 Introduction One important method for data compression and classification is to organize

### PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0 15 th January 2014 Al Chrosny Director, Software Engineering TreeAge Software, Inc. achrosny@treeage.com Andrew Munzer Director, Training and Customer

### 22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC

### High Performance Predictive Analytics in R and Hadoop:

High Performance Predictive Analytics in R and Hadoop: Achieving Big Data Big Analytics Presented by: Mario E. Inchiosa, Ph.D. US Chief Scientist August 27, 2013 1 Polling Questions 1 & 2 2 Agenda Revolution

### SAS R IML (Introduction at the Master s Level)

SAS R IML (Introduction at the Master s Level) Anton Bekkerman, Ph.D., Montana State University, Bozeman, MT ABSTRACT Most graduate-level statistics and econometrics programs require a more advanced knowledge

### Random Vectors and the Variance Covariance Matrix

Random Vectors and the Variance Covariance Matrix Definition 1. A random vector X is a vector (X 1, X 2,..., X p ) of jointly distributed random variables. As is customary in linear algebra, we will write

### Introduction to Regression and Data Analysis

Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

### Quick Tour of Mathcad and Examples

Fall 6 Quick Tour of Mathcad and Examples Mathcad provides a unique and powerful way to work with equations, numbers, tests and graphs. Features Arithmetic Functions Plot functions Define you own variables

### Regression Clustering

Chapter 449 Introduction This algorithm provides for clustering in the multiple regression setting in which you have a dependent variable Y and one or more independent variables, the X s. The algorithm

### Big-data Analytics: Challenges and Opportunities

Big-data Analytics: Challenges and Opportunities Chih-Jen Lin Department of Computer Science National Taiwan University Talk at 台 灣 資 料 科 學 愛 好 者 年 會, August 30, 2014 Chih-Jen Lin (National Taiwan Univ.)

### Principal Component Analysis

Principal Component Analysis ERS70D George Fernandez INTRODUCTION Analysis of multivariate data plays a key role in data analysis. Multivariate data consists of many different attributes or variables recorded

### Statistical Data Analysis via R and PHP: A Case Study Of the Relationship Between GDP and Foreign Direct Investments for The Republic Of Moldova

Statistical Data Analysis via R and PHP: A Case Study Of the Relationship Between GDP and Foreign Direct Investments for The Republic Of Moldova. PhD Candidate Ştefan Cristian CIUCU (stefanciucu@yahoo.com)

### Why High-Order Polynomials Should Not be Used in Regression Discontinuity Designs

Why High-Order Polynomials Should Not be Used in Regression Discontinuity Designs Andrew Gelman Guido Imbens 2 Aug 2014 Abstract It is common in regression discontinuity analysis to control for high order

### EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

### Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

### R Language Fundamentals

R Language Fundamentals Data Types and Basic Maniuplation Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Where did R come from? Overview Atomic Vectors Subsetting

### Illustration (and the use of HLM)

Illustration (and the use of HLM) Chapter 4 1 Measurement Incorporated HLM Workshop The Illustration Data Now we cover the example. In doing so we does the use of the software HLM. In addition, we will

### An Introduction to Modeling Longitudinal Data

An Introduction to Modeling Longitudinal Data Session I: Basic Concepts and Looking at Data Robert Weiss Department of Biostatistics UCLA School of Public Health robweiss@ucla.edu August 2010 Robert Weiss

### Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

### Technology Step-by-Step Using StatCrunch

Technology Step-by-Step Using StatCrunch Section 1.3 Simple Random Sampling 1. Select Data, highlight Simulate Data, then highlight Discrete Uniform. 2. Fill in the following window with the appropriate

### Inferences About Differences Between Means Edpsy 580

Inferences About Differences Between Means Edpsy 580 Carolyn J. Anderson Department of Educational Psychology University of Illinois at Urbana-Champaign Inferences About Differences Between Means Slide

### Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

### Nonparametric statistics and model selection

Chapter 5 Nonparametric statistics and model selection In Chapter, we learned about the t-test and its variations. These were designed to compare sample means, and relied heavily on assumptions of normality.

### Mathematics Course 111: Algebra I Part IV: Vector Spaces

Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are

### R: A Free Software Project in Statistical Computing

R: A Free Software Project in Statistical Computing Achim Zeileis Institut für Statistik & Wahrscheinlichkeitstheorie http://www.ci.tuwien.ac.at/~zeileis/ Acknowledgments Thanks: Alex Smola & Machine Learning

### Statistical Models in Data Mining

Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

### Spreadsheet software for linear regression analysis

Spreadsheet software for linear regression analysis Robert Nau Fuqua School of Business, Duke University Copies of these slides together with individual Excel files that demonstrate each program are available

### L10: Probability, statistics, and estimation theory

L10: Probability, statistics, and estimation theory Review of probability theory Bayes theorem Statistics and the Normal distribution Least Squares Error estimation Maximum Likelihood estimation Bayesian

### Linear Models and Conjoint Analysis with Nonlinear Spline Transformations

Linear Models and Conjoint Analysis with Nonlinear Spline Transformations Warren F. Kuhfeld Mark Garratt Abstract Many common data analysis models are based on the general linear univariate model, including

### MANIPULATION OF LARGE DATABASES WITH "R"

MANIPULATION OF LARGE DATABASES WITH "R" Ana Maria DOBRE, Andreea GAGIU National Institute of Statistics, Bucharest Abstract Nowadays knowledge is power. In the informational era, the ability to manipulate

### wu.cloud: Insights Gained from Operating a Private Cloud System

wu.cloud: Insights Gained from Operating a Private Cloud System Stefan Theußl, Institute for Statistics and Mathematics WU Wirtschaftsuniversität Wien March 23, 2011 1 / 14 Introduction In statistics we

### Introduction to MSI* for PubH 8403

Introduction to MSI* for PubH 8403 Sep 30, 2015 Nancy Rowe *The Minnesota Supercomputing Institute for Advanced Computational Research Overview MSI at a Glance MSI Resources Access System Access - Physical

### Lecture 20: Clustering

Lecture 20: Clustering Wrap-up of neural nets (from last lecture Introduction to unsupervised learning K-means clustering COMP-424, Lecture 20 - April 3, 2013 1 Unsupervised learning In supervised learning,

### Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

### Notes for STA 437/1005 Methods for Multivariate Data

Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.