The R Environment. A high-level overview. Deepayan Sarkar. 22 July 2013. Indian Statistical Institute, Delhi

Similar documents

R: A Free Software Project in Statistical Computing

R Language Fundamentals

Simple Linear Regression Inference

Testing for Lack of Fit

Lecture 8. Confidence intervals and the central limit theorem

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

Statistical Models in R

An Introduction to the Use of R for Clinical Research

Univariate Regression

Fitting Subject-specific Curves to Grouped Longitudinal Data

How to calculate an ANOVA table

Introduction to Hierarchical Linear Modeling with R

GETTING STARTED WITH R AND DATA ANALYSIS

Outline. Dispersion Bush lupine survival Quasi-Binomial family

Basics of Statistical Machine Learning

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Introduction to Analysis of Variance (ANOVA) Limitations of the t-test

Geostatistics Exploratory Analysis

Additional sources Compilation of sources:

Multiple Linear Regression

The Normal distribution

Psychology 205: Research Methods in Psychology

Least Squares Estimation

Research Methods & Experimental Design

Normal Distribution. Definition A continuous random variable has a normal distribution if its probability density. f ( y ) = 1.

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Section 13, Part 1 ANOVA. Analysis Of Variance

Final Exam Practice Problem Answers

Difference of Means and ANOVA Problems

Sample Size Calculation for Longitudinal Studies

Week 5: Multiple Linear Regression

Simulation Exercises to Reinforce the Foundations of Statistical Thinking in Online Classes

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Consider a study in which. How many subjects? The importance of sample size calculations. An insignificant effect: two possibilities.

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Nonlinear Regression:

MATH 4470/5470 EXPLORATORY DATA ANALYSIS ONLINE COURSE SYLLABUS

Chapter 7 Notes - Inference for Single Samples. You know already for a large sample, you can invoke the CLT so:

Part 2: Analysis of Relationship Between Two Variables

Introduction to the R Language

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Fairfield Public Schools

Financial Risk Models in R: Factor Models for Asset Returns. Workshop Overview

5 Correlation and Data Exploration

5. Linear Regression

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Hypothesis Testing. Hypothesis Testing. Step 1: State the Hypotheses

Package cpm. July 28, 2015

Recursive Algorithms. Recursion. Motivating Example Factorial Recall the factorial function. { 1 if n = 1 n! = n (n 1)! if n > 1

Ordinal Regression. Chapter

Hypothesis Testing for Beginners

Data Analysis Tools. Tools for Summarizing Data

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

17. SIMPLE LINEAR REGRESSION II

Math 425 (Fall 08) Solutions Midterm 2 November 6, 2008

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Chapter 13 Introduction to Nonlinear Regression( 非線性迴歸 )

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

MIXED MODEL ANALYSIS USING R

Lecture 10: Depicting Sampling Distributions of a Sample Proportion

WHERE DOES THE 10% CONDITION COME FROM?

Time Series Analysis AMS 316

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Confidence Intervals for the Difference Between Two Means

An Interactive Tool for Residual Diagnostics for Fitting Spatial Dependencies (with Implementation in R)

PEST - Beyond Basic Model Calibration. Presented by Jon Traum

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 1 - Bootstrap

Section 1: Simple Linear Regression

Illustration (and the use of HLM)

Data Mining Introduction

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

Normal distribution. ) 2 /2σ. 2π σ

9.07 Introduction to Statistical Methods Homework 4. Name:

Theory at a Glance (For IES, GATE, PSU)

A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

Chapter 4 and 5 solutions

Normality Testing in Excel

statistics Chi-square tests and nonparametric Summary sheet from last time: Hypothesis testing Summary sheet from last time: Confidence intervals

Package empiricalfdr.deseq2

Econometrics Simple Linear Regression

SIMULATION STUDIES IN STATISTICS WHAT IS A SIMULATION STUDY, AND WHY DO ONE? What is a (Monte Carlo) simulation study, and why do one?

HYPOTHESIS TESTING: POWER OF THE TEST

Chapter 3 Quantitative Demand Analysis

3. Regression & Exponential Smoothing

Lean Certification Program Blended Learning Program Cost: $5500. Course Description

Logit Models for Binary Data

Transcription:

The R Environment A high-level overview Deepayan Sarkar Indian Statistical Institute, Delhi 22 July 2013

The R language ˆ Tutorials in the Astrostatistics workshop will use R ˆ...in the form of silent hands-on tutorials ˆ This talk gives a very high-level introduction to R

What exactly is R? ˆ R is a language and environment for statistical computing and graphics. ˆ It is a Free Software project which is similar to the S language and environment which was developed at Bell Laboratories by John Chambers and colleagues. ˆ R can be considered as a different implementation of S.

What exactly is R? ˆ R is a language and environment for statistical computing and graphics. ˆ It is a Free Software project which is similar to the S language and environment which was developed at Bell Laboratories by John Chambers and colleagues. ˆ R can be considered as a different implementation of S.

The origins of S ˆ Developed at Bell Labs (statistics research department, 1970s onward) ˆ Primary goals ˆ Interactivity: Exploratory Data Analysis vs batch mode ˆ Flexibility: Novel vs routine methodology ˆ Practical: For actual use, not (just) academic research

R 1993 Started as teaching tool by Robert Gentleman & Ross Ihaka at the Univesity of Auckland... because S didn t run on the Apple computers they had 1995 Convinced by Martin Maechler to release as Free Software 2000 Version 1.0 released

R 1993 Started as teaching tool by Robert Gentleman & Ross Ihaka at the Univesity of Auckland... because S didn t run on the Apple computers they had 1995 Convinced by Martin Maechler to release as Free Software 2000 Version 1.0 released

R 1993 Started as teaching tool by Robert Gentleman & Ross Ihaka at the Univesity of Auckland... because S didn t run on the Apple computers they had 1995 Convinced by Martin Maechler to release as Free Software 2000 Version 1.0 released

ˆ Not really that different from S, but the Free Software/Open Source development model has made it a larger success, particularly in academia R

Why the success? ˆ Rapid prototyping ˆ Interfaces to external software ˆ Easy dissemination of research (through packages)

Why the success? ˆ Rapid prototyping ˆ Interfaces to external software ˆ Easy dissemination of research (through packages)

Why the success? ˆ Rapid prototyping ˆ Interfaces to external software ˆ Easy dissemination of research (through packages)

Rapid prototyping S is a programming language and environment for all kinds of computing involving data. It has a simple goal: To turn ideas into software, quickly and faithfully John Chambers Programming with Data

Example: Fibonacci sequence Sequence {F n } defined by the recurrence relation F n = F n 1 + F n 2 with seed values F 0 = 0, F 1 = 1

S is a programming language > fibonacci <- function(n) { x <- c(0, 1) while (length(x) < n) { x <- c(x, sum(tail(x, 2))) } return (x) } > fib10 <- fibonacci(10) > fib10 [1] 0 1 1 2 3 5 8 13 21 34

Basic usage of R ˆ Think of R as a (sophisticated) calculator ˆ Expressions are typed at the command line ˆ R evaluates it and prints result ˆ...unless result is stored in a variable (using <- or =)

Basic usage of R ˆ Think of R as a (sophisticated) calculator ˆ Expressions are typed at the command line ˆ R evaluates it and prints result ˆ...unless result is stored in a variable (using <- or =)

S is a programming language > fibonacci(20) [1] 0 1 1 2 3 5 8 13 21 34 55 89 [13] 144 233 377 610 987 1597 2584 4181 > fib20 <- fibonacci(20) > fib20 [1] 0 1 1 2 3 5 8 13 21 34 55 89 [13] 144 233 377 610 987 1597 2584 4181

File fib.c: #include <Rdefines.h> Easy to call C for efficiency void cfib(int *x, int n) { int i; x[0] = 0; x[1] = 1; for (i = 2; i < n; i++) x[i] = x[i-1] + x[i-2]; } SEXP do_fibonacci(sexp nr) { SEXP ans = PROTECT(NEW_INTEGER(INTEGER_VALUE(nr))); cfib(integer_pointer(ans), length(ans)); UNPROTECT(1); return ans; }

Easy to call C for efficiency $ R CMD SHLIB fib.c gcc -std=gnu99 -I/home/deepayan/Rinstall/R-devel/lib/R/incl gcc -std=gnu99 -shared -L/usr/local/lib -o fib.so fib.o -L/ > dyn.load("fib.so") > cfib10 =.Call("do_fibonacci", as.integer(10)) > cfib10 [1] 0 1 1 2 3 5 8 13 21 34

Vectorized computation The Fibonacci series has a closed-form expression as well. F (n) = φn (1 φ) n 5, where φ = 1 + 5 2 > phi <- (1 + sqrt(5)) / 2 > n <- 0:9 > n [1] 0 1 2 3 4 5 6 7 8 9 > (phi^n - (1 - phi)^n) / sqrt(5) [1] 0 1 1 2 3 5 8 13 21 34

S is a programming language ˆ...designed for interactive use ˆ...with a focus on data analysis ˆ Basic data structures are vectors ˆ Large collection of statistical functions ˆ Advanced statistical graphics capabilities

S is a programming language ˆ...designed for interactive use ˆ...with a focus on data analysis ˆ Basic data structures are vectors ˆ Large collection of statistical functions ˆ Advanced statistical graphics capabilities

Ideal for simulation studies Example: How many heads when we toss a fair coin 500 times? > rbinom(1, size = 500, prob = 0.5) [1] 234 > rbinom(10, size = 500, prob = 0.5) [1] 260 249 241 234 263 243 256 269 253 271

Ideal for simulation studies > x <- rbinom(1000, size = 500, prob = 0.5) > histogram(x) 20 Percent of Total 15 10 5 0 220 240 260 280 x

What if the coin is not fair? > x <- rbinom(1000, size = 500, prob = 0.1) > histogram(x) 25 20 Percent of Total 15 10 5 0 30 40 50 60 70 x

Statistical models ˆ S can fit all kinds of statistical models ˆ A brief example: ˆ Michelson s 1879 experiment on measuring the speed of light ˆ Measurements of the speed of light in air ˆ Made between 5th June and 2nd July, 1879 ˆ Five experiments, each consisting of 20 consecutive runs ˆ Speed recorded in km/s 299000 ˆ The currently accepted value on this scale is 734.5

Statistical models ˆ S can fit all kinds of statistical models ˆ A brief example: ˆ Michelson s 1879 experiment on measuring the speed of light ˆ Measurements of the speed of light in air ˆ Made between 5th June and 2nd July, 1879 ˆ Five experiments, each consisting of 20 consecutive runs ˆ Speed recorded in km/s 299000 ˆ The currently accepted value on this scale is 734.5

Speed Run Expt 1 850 1 1 2 740 2 1 3 900 3 1 4 1070 4 1 5 930 5 1 6 850 6 1 7 950 7 1 8 980 8 1 9 980 9 1 10 880 10 1 11 1000 11 1 12 980 12 1 13 930 13 1 14 650 14 1 15 760 15 1 The Michelson data

The R approach : A typical analysis > data(michelson, package = "MASS") > michelson Speed Run Expt 1 850 1 1 2 740 2 1 3 900 3 1 4 1070 4 1 5 930 5 1 6 850 6 1 7 950 7 1 8 980 8 1 9 980 9 1 10 880 10 1 11 1000 11 1 12 980 12 1 13 930 13 1 14 650 14 1 15 760 15 1 16 810 16 1

> str(michelson) 'data.frame': 100 obs. of 3 variables: $ Speed: int 850 740 900 1070 930 850 950 980 980 880... $ Run : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 $ Expt : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1

> bwplot(speed ~ Expt, data = michelson, aspect = 1) 1000 900 Speed 800 700 600 1 2 3 4 5

Fitting models Model: y ij = µ + α i + ε ij where y ij = measured speed of light µ = overall mean (actual speed) α i = additive effect of i-th experiment ε ij N (0, σ 2 ) = error To fit this model in R, use > fm <- lm(speed ~ Expt, data = michelson)

Hypothesis testing (ANOVA) Test whether experiments are different (should not be) > anova(fm) Analysis of Variance Table Response: Speed Df Sum Sq Mean Sq F value Pr(>F) Expt 4 94514 23628.5 4.2878 0.003114 Residuals 95 523510 5510.6

> xyplot(speed ~ Run Expt, data = michelson, layout = c(2, 3)) 5 1000 900 800 700 600 3 4 1000 Speed 1 2 900 800 700 600 1000 900 800 700 600 1 2 3 4 5 6 7 8 9 1011121314151617181920 Run

Summary The S approach is to work with objects. ˆ Model fits produce objects, usually stored as variables ˆ Queried interactively for further analysis > anova(fm) > summary(fm) > residuals(fm) ˆ Use graphics as a natural component of workflow

Even graphics is programmable Globular cluster data: data/globclus_prop.dat > GlobClust <- read.table("data/globclus_prop.dat", header = TRUE) > xyplot(mv ~ r.tidal, GlobClust, grid = TRUE) 2 4 Mv 6 8 10 0 50 100 150 200 250 r.tidal

Even graphics is programmable > xyplot(mv ~ r.tidal, GlobClust, grid = TRUE, scales = list(x = list(log = 2))) 2 4 Mv 6 8 10 2^3 2^4 2^5 2^6 2^7 2^8 r.tidal

Adding a regression line > xyplot(mv ~ r.tidal, GlobClust, scales = list(x = list(log = 2) panel = function(x, y,...) { panel.grid(h = -1, v = -1) panel.points(x, y,...) panel.abline(lm(y ~ x), col = "black") }) 2 4 Mv 6 8 10 2^3 2^4 2^5 2^6 2^7 2^8 r.tidal

ˆ Doesn t seem to be a good fit ˆ One solution: use locally linear fits ˆ At each location, fit regression line based on nearby points ˆ A generalization of this idea is known as LOESS 2 4 Mv 6 8 10 2^3 2^4 2^5 2^6 2^7 2^8 r.tidal

ˆ Doesn t seem to be a good fit ˆ One solution: use locally linear fits ˆ At each location, fit regression line based on nearby points ˆ A generalization of this idea is known as LOESS 2 4 Mv 6 8 10 2^3 2^4 2^5 2^6 2^7 2^8 r.tidal

ˆ Doesn t seem to be a good fit ˆ One solution: use locally linear fits ˆ At each location, fit regression line based on nearby points ˆ A generalization of this idea is known as LOESS 2 4 Mv 6 8 10 2^3 2^4 2^5 2^6 2^7 2^8 r.tidal

> xyplot(mv ~ r.tidal, GlobClust, scales = list(x = list(log = 2)), panel = function(x, y,...) { panel.grid(h = -1, v = -1) panel.points(x, y,...) panel.loess(x, y, col = "black") }) 2 4 Mv 6 8 10 2^3 2^4 2^5 2^6 2^7 2^8 r.tidal

ˆ Smooth curve denotes fitted LOESS model ˆ Estimates average acceleration as function of distance ˆ But how confident are we about the fitted estimate? 2 4 Mv 6 8 10 2^3 2^4 2^5 2^6 2^7 2^8 r.tidal

One approach: Bootstrap ˆ Resample with replacement to mimic underlying randomness ˆ Use same modeling process to obtain estimate ˆ Repeat many times to assess variability

> xyplot(mv ~ r.tidal, GlobClust, scales = list(x = list(log = 2)), panel = function(x, y,...) { panel.grid(h = -1, v = -1) n <- length(x) for (i in 1:1000) { bs.id <- sample(1:n, replace = TRUE) ## SRSWR panel.loess(x[bs.id], y[bs.id], col = "red", alpha = 0.02) } panel.points(x, y,...) panel.loess(x, y, col = "black") })

2 4 Mv 6 8 10 2^3 2^4 2^5 2^6 2^7 2^8 r.tidal

Powerful built-in tools + Programming language Flexibility

Dissemination of research ˆ Rapid prototyping quick implementation of research ideas ˆ Well-structured packaging system allows dissemination ˆ CRAN: Comprehensive R Archive Network: > 3000 packages ˆ Other specialized collections (Bioconductor, Omegahat)

Sweave: reproducible documents ˆ Inspired by literate documents (Knuth) ˆ Enables mixing of R code and L A TEX/ HTML (etc.) ˆ Source file reproduces both analysis and report ˆ Reproducible research + convenience ˆ rnw/bootstrap.rnw

Sweave: reproducible documents ˆ Inspired by literate documents (Knuth) ˆ Enables mixing of R code and L A TEX/ HTML (etc.) ˆ Source file reproduces both analysis and report ˆ Reproducible research + convenience ˆ rnw/bootstrap.rnw

Summary ˆ R is a feature-rich interactive language + environment ideally suited to data analysis as well as other kinds of numerical computations ˆ Some learning required before it can be used effectively ˆ Typical mind-blocks for newcomers: ˆ R is not C! More like LISP / Scheme ˆ Vectorization (easy to get past with a little experience) ˆ Functional approach to programming

Summary ˆ R is a feature-rich interactive language + environment ideally suited to data analysis as well as other kinds of numerical computations ˆ Some learning required before it can be used effectively ˆ Typical mind-blocks for newcomers: ˆ R is not C! More like LISP / Scheme ˆ Vectorization (easy to get past with a little experience) ˆ Functional approach to programming

Questions?