The R Environment A high-level overview Deepayan Sarkar Indian Statistical Institute, Delhi 22 July 2013
The R language ˆ Tutorials in the Astrostatistics workshop will use R ˆ...in the form of silent hands-on tutorials ˆ This talk gives a very high-level introduction to R
What exactly is R? ˆ R is a language and environment for statistical computing and graphics. ˆ It is a Free Software project which is similar to the S language and environment which was developed at Bell Laboratories by John Chambers and colleagues. ˆ R can be considered as a different implementation of S.
What exactly is R? ˆ R is a language and environment for statistical computing and graphics. ˆ It is a Free Software project which is similar to the S language and environment which was developed at Bell Laboratories by John Chambers and colleagues. ˆ R can be considered as a different implementation of S.
The origins of S ˆ Developed at Bell Labs (statistics research department, 1970s onward) ˆ Primary goals ˆ Interactivity: Exploratory Data Analysis vs batch mode ˆ Flexibility: Novel vs routine methodology ˆ Practical: For actual use, not (just) academic research
R 1993 Started as teaching tool by Robert Gentleman & Ross Ihaka at the Univesity of Auckland... because S didn t run on the Apple computers they had 1995 Convinced by Martin Maechler to release as Free Software 2000 Version 1.0 released
R 1993 Started as teaching tool by Robert Gentleman & Ross Ihaka at the Univesity of Auckland... because S didn t run on the Apple computers they had 1995 Convinced by Martin Maechler to release as Free Software 2000 Version 1.0 released
R 1993 Started as teaching tool by Robert Gentleman & Ross Ihaka at the Univesity of Auckland... because S didn t run on the Apple computers they had 1995 Convinced by Martin Maechler to release as Free Software 2000 Version 1.0 released
ˆ Not really that different from S, but the Free Software/Open Source development model has made it a larger success, particularly in academia R
Why the success? ˆ Rapid prototyping ˆ Interfaces to external software ˆ Easy dissemination of research (through packages)
Why the success? ˆ Rapid prototyping ˆ Interfaces to external software ˆ Easy dissemination of research (through packages)
Why the success? ˆ Rapid prototyping ˆ Interfaces to external software ˆ Easy dissemination of research (through packages)
Rapid prototyping S is a programming language and environment for all kinds of computing involving data. It has a simple goal: To turn ideas into software, quickly and faithfully John Chambers Programming with Data
Example: Fibonacci sequence Sequence {F n } defined by the recurrence relation F n = F n 1 + F n 2 with seed values F 0 = 0, F 1 = 1
S is a programming language > fibonacci <- function(n) { x <- c(0, 1) while (length(x) < n) { x <- c(x, sum(tail(x, 2))) } return (x) } > fib10 <- fibonacci(10) > fib10 [1] 0 1 1 2 3 5 8 13 21 34
Basic usage of R ˆ Think of R as a (sophisticated) calculator ˆ Expressions are typed at the command line ˆ R evaluates it and prints result ˆ...unless result is stored in a variable (using <- or =)
Basic usage of R ˆ Think of R as a (sophisticated) calculator ˆ Expressions are typed at the command line ˆ R evaluates it and prints result ˆ...unless result is stored in a variable (using <- or =)
S is a programming language > fibonacci(20) [1] 0 1 1 2 3 5 8 13 21 34 55 89 [13] 144 233 377 610 987 1597 2584 4181 > fib20 <- fibonacci(20) > fib20 [1] 0 1 1 2 3 5 8 13 21 34 55 89 [13] 144 233 377 610 987 1597 2584 4181
File fib.c: #include <Rdefines.h> Easy to call C for efficiency void cfib(int *x, int n) { int i; x[0] = 0; x[1] = 1; for (i = 2; i < n; i++) x[i] = x[i-1] + x[i-2]; } SEXP do_fibonacci(sexp nr) { SEXP ans = PROTECT(NEW_INTEGER(INTEGER_VALUE(nr))); cfib(integer_pointer(ans), length(ans)); UNPROTECT(1); return ans; }
Easy to call C for efficiency $ R CMD SHLIB fib.c gcc -std=gnu99 -I/home/deepayan/Rinstall/R-devel/lib/R/incl gcc -std=gnu99 -shared -L/usr/local/lib -o fib.so fib.o -L/ > dyn.load("fib.so") > cfib10 =.Call("do_fibonacci", as.integer(10)) > cfib10 [1] 0 1 1 2 3 5 8 13 21 34
Vectorized computation The Fibonacci series has a closed-form expression as well. F (n) = φn (1 φ) n 5, where φ = 1 + 5 2 > phi <- (1 + sqrt(5)) / 2 > n <- 0:9 > n [1] 0 1 2 3 4 5 6 7 8 9 > (phi^n - (1 - phi)^n) / sqrt(5) [1] 0 1 1 2 3 5 8 13 21 34
S is a programming language ˆ...designed for interactive use ˆ...with a focus on data analysis ˆ Basic data structures are vectors ˆ Large collection of statistical functions ˆ Advanced statistical graphics capabilities
S is a programming language ˆ...designed for interactive use ˆ...with a focus on data analysis ˆ Basic data structures are vectors ˆ Large collection of statistical functions ˆ Advanced statistical graphics capabilities
Ideal for simulation studies Example: How many heads when we toss a fair coin 500 times? > rbinom(1, size = 500, prob = 0.5) [1] 234 > rbinom(10, size = 500, prob = 0.5) [1] 260 249 241 234 263 243 256 269 253 271
Ideal for simulation studies > x <- rbinom(1000, size = 500, prob = 0.5) > histogram(x) 20 Percent of Total 15 10 5 0 220 240 260 280 x
What if the coin is not fair? > x <- rbinom(1000, size = 500, prob = 0.1) > histogram(x) 25 20 Percent of Total 15 10 5 0 30 40 50 60 70 x
Statistical models ˆ S can fit all kinds of statistical models ˆ A brief example: ˆ Michelson s 1879 experiment on measuring the speed of light ˆ Measurements of the speed of light in air ˆ Made between 5th June and 2nd July, 1879 ˆ Five experiments, each consisting of 20 consecutive runs ˆ Speed recorded in km/s 299000 ˆ The currently accepted value on this scale is 734.5
Statistical models ˆ S can fit all kinds of statistical models ˆ A brief example: ˆ Michelson s 1879 experiment on measuring the speed of light ˆ Measurements of the speed of light in air ˆ Made between 5th June and 2nd July, 1879 ˆ Five experiments, each consisting of 20 consecutive runs ˆ Speed recorded in km/s 299000 ˆ The currently accepted value on this scale is 734.5
Speed Run Expt 1 850 1 1 2 740 2 1 3 900 3 1 4 1070 4 1 5 930 5 1 6 850 6 1 7 950 7 1 8 980 8 1 9 980 9 1 10 880 10 1 11 1000 11 1 12 980 12 1 13 930 13 1 14 650 14 1 15 760 15 1 The Michelson data
The R approach : A typical analysis > data(michelson, package = "MASS") > michelson Speed Run Expt 1 850 1 1 2 740 2 1 3 900 3 1 4 1070 4 1 5 930 5 1 6 850 6 1 7 950 7 1 8 980 8 1 9 980 9 1 10 880 10 1 11 1000 11 1 12 980 12 1 13 930 13 1 14 650 14 1 15 760 15 1 16 810 16 1
> str(michelson) 'data.frame': 100 obs. of 3 variables: $ Speed: int 850 740 900 1070 930 850 950 980 980 880... $ Run : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 $ Expt : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1
> bwplot(speed ~ Expt, data = michelson, aspect = 1) 1000 900 Speed 800 700 600 1 2 3 4 5
Fitting models Model: y ij = µ + α i + ε ij where y ij = measured speed of light µ = overall mean (actual speed) α i = additive effect of i-th experiment ε ij N (0, σ 2 ) = error To fit this model in R, use > fm <- lm(speed ~ Expt, data = michelson)
Hypothesis testing (ANOVA) Test whether experiments are different (should not be) > anova(fm) Analysis of Variance Table Response: Speed Df Sum Sq Mean Sq F value Pr(>F) Expt 4 94514 23628.5 4.2878 0.003114 Residuals 95 523510 5510.6
> xyplot(speed ~ Run Expt, data = michelson, layout = c(2, 3)) 5 1000 900 800 700 600 3 4 1000 Speed 1 2 900 800 700 600 1000 900 800 700 600 1 2 3 4 5 6 7 8 9 1011121314151617181920 Run
Summary The S approach is to work with objects. ˆ Model fits produce objects, usually stored as variables ˆ Queried interactively for further analysis > anova(fm) > summary(fm) > residuals(fm) ˆ Use graphics as a natural component of workflow
Even graphics is programmable Globular cluster data: data/globclus_prop.dat > GlobClust <- read.table("data/globclus_prop.dat", header = TRUE) > xyplot(mv ~ r.tidal, GlobClust, grid = TRUE) 2 4 Mv 6 8 10 0 50 100 150 200 250 r.tidal
Even graphics is programmable > xyplot(mv ~ r.tidal, GlobClust, grid = TRUE, scales = list(x = list(log = 2))) 2 4 Mv 6 8 10 2^3 2^4 2^5 2^6 2^7 2^8 r.tidal
Adding a regression line > xyplot(mv ~ r.tidal, GlobClust, scales = list(x = list(log = 2) panel = function(x, y,...) { panel.grid(h = -1, v = -1) panel.points(x, y,...) panel.abline(lm(y ~ x), col = "black") }) 2 4 Mv 6 8 10 2^3 2^4 2^5 2^6 2^7 2^8 r.tidal
ˆ Doesn t seem to be a good fit ˆ One solution: use locally linear fits ˆ At each location, fit regression line based on nearby points ˆ A generalization of this idea is known as LOESS 2 4 Mv 6 8 10 2^3 2^4 2^5 2^6 2^7 2^8 r.tidal
ˆ Doesn t seem to be a good fit ˆ One solution: use locally linear fits ˆ At each location, fit regression line based on nearby points ˆ A generalization of this idea is known as LOESS 2 4 Mv 6 8 10 2^3 2^4 2^5 2^6 2^7 2^8 r.tidal
ˆ Doesn t seem to be a good fit ˆ One solution: use locally linear fits ˆ At each location, fit regression line based on nearby points ˆ A generalization of this idea is known as LOESS 2 4 Mv 6 8 10 2^3 2^4 2^5 2^6 2^7 2^8 r.tidal
> xyplot(mv ~ r.tidal, GlobClust, scales = list(x = list(log = 2)), panel = function(x, y,...) { panel.grid(h = -1, v = -1) panel.points(x, y,...) panel.loess(x, y, col = "black") }) 2 4 Mv 6 8 10 2^3 2^4 2^5 2^6 2^7 2^8 r.tidal
ˆ Smooth curve denotes fitted LOESS model ˆ Estimates average acceleration as function of distance ˆ But how confident are we about the fitted estimate? 2 4 Mv 6 8 10 2^3 2^4 2^5 2^6 2^7 2^8 r.tidal
One approach: Bootstrap ˆ Resample with replacement to mimic underlying randomness ˆ Use same modeling process to obtain estimate ˆ Repeat many times to assess variability
> xyplot(mv ~ r.tidal, GlobClust, scales = list(x = list(log = 2)), panel = function(x, y,...) { panel.grid(h = -1, v = -1) n <- length(x) for (i in 1:1000) { bs.id <- sample(1:n, replace = TRUE) ## SRSWR panel.loess(x[bs.id], y[bs.id], col = "red", alpha = 0.02) } panel.points(x, y,...) panel.loess(x, y, col = "black") })
2 4 Mv 6 8 10 2^3 2^4 2^5 2^6 2^7 2^8 r.tidal
Powerful built-in tools + Programming language Flexibility
Dissemination of research ˆ Rapid prototyping quick implementation of research ideas ˆ Well-structured packaging system allows dissemination ˆ CRAN: Comprehensive R Archive Network: > 3000 packages ˆ Other specialized collections (Bioconductor, Omegahat)
Sweave: reproducible documents ˆ Inspired by literate documents (Knuth) ˆ Enables mixing of R code and L A TEX/ HTML (etc.) ˆ Source file reproduces both analysis and report ˆ Reproducible research + convenience ˆ rnw/bootstrap.rnw
Sweave: reproducible documents ˆ Inspired by literate documents (Knuth) ˆ Enables mixing of R code and L A TEX/ HTML (etc.) ˆ Source file reproduces both analysis and report ˆ Reproducible research + convenience ˆ rnw/bootstrap.rnw
Summary ˆ R is a feature-rich interactive language + environment ideally suited to data analysis as well as other kinds of numerical computations ˆ Some learning required before it can be used effectively ˆ Typical mind-blocks for newcomers: ˆ R is not C! More like LISP / Scheme ˆ Vectorization (easy to get past with a little experience) ˆ Functional approach to programming
Summary ˆ R is a feature-rich interactive language + environment ideally suited to data analysis as well as other kinds of numerical computations ˆ Some learning required before it can be used effectively ˆ Typical mind-blocks for newcomers: ˆ R is not C! More like LISP / Scheme ˆ Vectorization (easy to get past with a little experience) ˆ Functional approach to programming
Questions?