The R Environment. A high-level overview. Deepayan Sarkar. 22 July Indian Statistical Institute, Delhi

Transcription

1 The R Environment A high-level overview Deepayan Sarkar Indian Statistical Institute, Delhi 22 July 2013

2 The R language ˆ Tutorials in the Astrostatistics workshop will use R ˆ...in the form of silent hands-on tutorials ˆ This talk gives a very high-level introduction to R

3 What exactly is R? ˆ R is a language and environment for statistical computing and graphics. ˆ It is a Free Software project which is similar to the S language and environment which was developed at Bell Laboratories by John Chambers and colleagues. ˆ R can be considered as a different implementation of S.

4 What exactly is R? ˆ R is a language and environment for statistical computing and graphics. ˆ It is a Free Software project which is similar to the S language and environment which was developed at Bell Laboratories by John Chambers and colleagues. ˆ R can be considered as a different implementation of S.

5 The origins of S ˆ Developed at Bell Labs (statistics research department, 1970s onward) ˆ Primary goals ˆ Interactivity: Exploratory Data Analysis vs batch mode ˆ Flexibility: Novel vs routine methodology ˆ Practical: For actual use, not (just) academic research

6 R 1993 Started as teaching tool by Robert Gentleman & Ross Ihaka at the Univesity of Auckland... because S didn t run on the Apple computers they had 1995 Convinced by Martin Maechler to release as Free Software 2000 Version 1.0 released

9 ˆ Not really that different from S, but the Free Software/Open Source development model has made it a larger success, particularly in academia R

10 Why the success? ˆ Rapid prototyping ˆ Interfaces to external software ˆ Easy dissemination of research (through packages)

13 Rapid prototyping S is a programming language and environment for all kinds of computing involving data. It has a simple goal: To turn ideas into software, quickly and faithfully John Chambers Programming with Data

14 Example: Fibonacci sequence Sequence {F n } defined by the recurrence relation F n = F n 1 + F n 2 with seed values F 0 = 0, F 1 = 1

15 S is a programming language > fibonacci <- function(n) { x <- c(0, 1) while (length(x) < n) { x <- c(x, sum(tail(x, 2))) } return (x) } > fib10 <- fibonacci(10) > fib10 [1]

16 Basic usage of R ˆ Think of R as a (sophisticated) calculator ˆ Expressions are typed at the command line ˆ R evaluates it and prints result ˆ...unless result is stored in a variable (using <- or =)

17 Basic usage of R ˆ Think of R as a (sophisticated) calculator ˆ Expressions are typed at the command line ˆ R evaluates it and prints result ˆ...unless result is stored in a variable (using <- or =)

18 S is a programming language > fibonacci(20) [1] [13] > fib20 <- fibonacci(20) > fib20 [1] [13]

19 File fib.c: #include <Rdefines.h> Easy to call C for efficiency void cfib(int *x, int n) { int i; x[0] = 0; x[1] = 1; for (i = 2; i < n; i++) x[i] = x[i-1] + x[i-2]; } SEXP do_fibonacci(sexp nr) { SEXP ans = PROTECT(NEW_INTEGER(INTEGER_VALUE(nr))); cfib(integer_pointer(ans), length(ans)); UNPROTECT(1); return ans; }

20 Easy to call C for efficiency $ R CMD SHLIB fib.c gcc -std=gnu99 -I/home/deepayan/Rinstall/R-devel/lib/R/incl gcc -std=gnu99 -shared -L/usr/local/lib -o fib.so fib.o -L/ > dyn.load("fib.so") > cfib10 =.Call("do_fibonacci", as.integer(10)) > cfib10 [1]

21 Vectorized computation The Fibonacci series has a closed-form expression as well. F (n) = φn (1 φ) n 5, where φ = > phi <- (1 + sqrt(5)) / 2 > n <- 0:9 > n [1] > (phi^n - (1 - phi)^n) / sqrt(5) [1]

22 S is a programming language ˆ...designed for interactive use ˆ...with a focus on data analysis ˆ Basic data structures are vectors ˆ Large collection of statistical functions ˆ Advanced statistical graphics capabilities

23 S is a programming language ˆ...designed for interactive use ˆ...with a focus on data analysis ˆ Basic data structures are vectors ˆ Large collection of statistical functions ˆ Advanced statistical graphics capabilities

24 Ideal for simulation studies Example: How many heads when we toss a fair coin 500 times? > rbinom(1, size = 500, prob = 0.5) [1] 234 > rbinom(10, size = 500, prob = 0.5) [1]

25 Ideal for simulation studies > x <- rbinom(1000, size = 500, prob = 0.5) > histogram(x) 20 Percent of Total x

26 What if the coin is not fair? > x <- rbinom(1000, size = 500, prob = 0.1) > histogram(x) Percent of Total x

27 Statistical models ˆ S can fit all kinds of statistical models ˆ A brief example: ˆ Michelson s 1879 experiment on measuring the speed of light ˆ Measurements of the speed of light in air ˆ Made between 5th June and 2nd July, 1879 ˆ Five experiments, each consisting of 20 consecutive runs ˆ Speed recorded in km/s ˆ The currently accepted value on this scale is 734.5

28 Statistical models ˆ S can fit all kinds of statistical models ˆ A brief example: ˆ Michelson s 1879 experiment on measuring the speed of light ˆ Measurements of the speed of light in air ˆ Made between 5th June and 2nd July, 1879 ˆ Five experiments, each consisting of 20 consecutive runs ˆ Speed recorded in km/s ˆ The currently accepted value on this scale is 734.5

29 Speed Run Expt The Michelson data

30 The R approach : A typical analysis > data(michelson, package = "MASS") > michelson Speed Run Expt

31 > str(michelson) 'data.frame': 100 obs. of 3 variables: $ Speed: int $ Run : Factor w/ 20 levels "1","2","3","4",..: $ Expt : Factor w/ 5 levels "1","2","3","4",..:

32 > bwplot(speed ~ Expt, data = michelson, aspect = 1) Speed

33 Fitting models Model: y ij = µ + α i + ε ij where y ij = measured speed of light µ = overall mean (actual speed) α i = additive effect of i-th experiment ε ij N (0, σ 2 ) = error To fit this model in R, use > fm <- lm(speed ~ Expt, data = michelson)

34 Hypothesis testing (ANOVA) Test whether experiments are different (should not be) > anova(fm) Analysis of Variance Table Response: Speed Df Sum Sq Mean Sq F value Pr(>F) Expt Residuals

35 > xyplot(speed ~ Run Expt, data = michelson, layout = c(2, 3)) Speed Run

36 Summary The S approach is to work with objects. ˆ Model fits produce objects, usually stored as variables ˆ Queried interactively for further analysis > anova(fm) > summary(fm) > residuals(fm) ˆ Use graphics as a natural component of workflow

37 Even graphics is programmable Globular cluster data: data/globclus_prop.dat > GlobClust <- read.table("data/globclus_prop.dat", header = TRUE) > xyplot(mv ~ r.tidal, GlobClust, grid = TRUE) 2 4 Mv r.tidal

38 Even graphics is programmable > xyplot(mv ~ r.tidal, GlobClust, grid = TRUE, scales = list(x = list(log = 2))) 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal

39 Adding a regression line > xyplot(mv ~ r.tidal, GlobClust, scales = list(x = list(log = 2) panel = function(x, y,...) { panel.grid(h = -1, v = -1) panel.points(x, y,...) panel.abline(lm(y ~ x), col = "black") }) 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal

40 ˆ Doesn t seem to be a good fit ˆ One solution: use locally linear fits ˆ At each location, fit regression line based on nearby points ˆ A generalization of this idea is known as LOESS 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal

43 > xyplot(mv ~ r.tidal, GlobClust, scales = list(x = list(log = 2)), panel = function(x, y,...) { panel.grid(h = -1, v = -1) panel.points(x, y,...) panel.loess(x, y, col = "black") }) 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal

44 ˆ Smooth curve denotes fitted LOESS model ˆ Estimates average acceleration as function of distance ˆ But how confident are we about the fitted estimate? 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal

45 One approach: Bootstrap ˆ Resample with replacement to mimic underlying randomness ˆ Use same modeling process to obtain estimate ˆ Repeat many times to assess variability

46 > xyplot(mv ~ r.tidal, GlobClust, scales = list(x = list(log = 2)), panel = function(x, y,...) { panel.grid(h = -1, v = -1) n <- length(x) for (i in 1:1000) { bs.id <- sample(1:n, replace = TRUE) ## SRSWR panel.loess(x[bs.id], y[bs.id], col = "red", alpha = 0.02) } panel.points(x, y,...) panel.loess(x, y, col = "black") })

47 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal

48 Powerful built-in tools + Programming language Flexibility

49 Dissemination of research ˆ Rapid prototyping quick implementation of research ideas ˆ Well-structured packaging system allows dissemination ˆ CRAN: Comprehensive R Archive Network: > 3000 packages ˆ Other specialized collections (Bioconductor, Omegahat)

50 Sweave: reproducible documents ˆ Inspired by literate documents (Knuth) ˆ Enables mixing of R code and L A TEX/ HTML (etc.) ˆ Source file reproduces both analysis and report ˆ Reproducible research + convenience ˆ rnw/bootstrap.rnw

51 Sweave: reproducible documents ˆ Inspired by literate documents (Knuth) ˆ Enables mixing of R code and L A TEX/ HTML (etc.) ˆ Source file reproduces both analysis and report ˆ Reproducible research + convenience ˆ rnw/bootstrap.rnw

52 Summary ˆ R is a feature-rich interactive language + environment ideally suited to data analysis as well as other kinds of numerical computations ˆ Some learning required before it can be used effectively ˆ Typical mind-blocks for newcomers: ˆ R is not C! More like LISP / Scheme ˆ Vectorization (easy to get past with a little experience) ˆ Functional approach to programming

53 Summary ˆ R is a feature-rich interactive language + environment ideally suited to data analysis as well as other kinds of numerical computations ˆ Some learning required before it can be used effectively ˆ Typical mind-blocks for newcomers: ˆ R is not C! More like LISP / Scheme ˆ Vectorization (easy to get past with a little experience) ˆ Functional approach to programming

54 Questions?