The R Environment. A high-level overview. Deepayan Sarkar. 22 July Indian Statistical Institute, Delhi

Size: px
Start display at page:

Download "The R Environment. A high-level overview. Deepayan Sarkar. 22 July 2013. Indian Statistical Institute, Delhi"

Transcription

1 The R Environment A high-level overview Deepayan Sarkar Indian Statistical Institute, Delhi 22 July 2013

2 The R language ˆ Tutorials in the Astrostatistics workshop will use R ˆ...in the form of silent hands-on tutorials ˆ This talk gives a very high-level introduction to R

3 What exactly is R? ˆ R is a language and environment for statistical computing and graphics. ˆ It is a Free Software project which is similar to the S language and environment which was developed at Bell Laboratories by John Chambers and colleagues. ˆ R can be considered as a different implementation of S.

4 What exactly is R? ˆ R is a language and environment for statistical computing and graphics. ˆ It is a Free Software project which is similar to the S language and environment which was developed at Bell Laboratories by John Chambers and colleagues. ˆ R can be considered as a different implementation of S.

5 The origins of S ˆ Developed at Bell Labs (statistics research department, 1970s onward) ˆ Primary goals ˆ Interactivity: Exploratory Data Analysis vs batch mode ˆ Flexibility: Novel vs routine methodology ˆ Practical: For actual use, not (just) academic research

6 R 1993 Started as teaching tool by Robert Gentleman & Ross Ihaka at the Univesity of Auckland... because S didn t run on the Apple computers they had 1995 Convinced by Martin Maechler to release as Free Software 2000 Version 1.0 released

7 R 1993 Started as teaching tool by Robert Gentleman & Ross Ihaka at the Univesity of Auckland... because S didn t run on the Apple computers they had 1995 Convinced by Martin Maechler to release as Free Software 2000 Version 1.0 released

8 R 1993 Started as teaching tool by Robert Gentleman & Ross Ihaka at the Univesity of Auckland... because S didn t run on the Apple computers they had 1995 Convinced by Martin Maechler to release as Free Software 2000 Version 1.0 released

9 ˆ Not really that different from S, but the Free Software/Open Source development model has made it a larger success, particularly in academia R

10 Why the success? ˆ Rapid prototyping ˆ Interfaces to external software ˆ Easy dissemination of research (through packages)

11 Why the success? ˆ Rapid prototyping ˆ Interfaces to external software ˆ Easy dissemination of research (through packages)

12 Why the success? ˆ Rapid prototyping ˆ Interfaces to external software ˆ Easy dissemination of research (through packages)

13 Rapid prototyping S is a programming language and environment for all kinds of computing involving data. It has a simple goal: To turn ideas into software, quickly and faithfully John Chambers Programming with Data

14 Example: Fibonacci sequence Sequence {F n } defined by the recurrence relation F n = F n 1 + F n 2 with seed values F 0 = 0, F 1 = 1

15 S is a programming language > fibonacci <- function(n) { x <- c(0, 1) while (length(x) < n) { x <- c(x, sum(tail(x, 2))) } return (x) } > fib10 <- fibonacci(10) > fib10 [1]

16 Basic usage of R ˆ Think of R as a (sophisticated) calculator ˆ Expressions are typed at the command line ˆ R evaluates it and prints result ˆ...unless result is stored in a variable (using <- or =)

17 Basic usage of R ˆ Think of R as a (sophisticated) calculator ˆ Expressions are typed at the command line ˆ R evaluates it and prints result ˆ...unless result is stored in a variable (using <- or =)

18 S is a programming language > fibonacci(20) [1] [13] > fib20 <- fibonacci(20) > fib20 [1] [13]

19 File fib.c: #include <Rdefines.h> Easy to call C for efficiency void cfib(int *x, int n) { int i; x[0] = 0; x[1] = 1; for (i = 2; i < n; i++) x[i] = x[i-1] + x[i-2]; } SEXP do_fibonacci(sexp nr) { SEXP ans = PROTECT(NEW_INTEGER(INTEGER_VALUE(nr))); cfib(integer_pointer(ans), length(ans)); UNPROTECT(1); return ans; }

20 Easy to call C for efficiency $ R CMD SHLIB fib.c gcc -std=gnu99 -I/home/deepayan/Rinstall/R-devel/lib/R/incl gcc -std=gnu99 -shared -L/usr/local/lib -o fib.so fib.o -L/ > dyn.load("fib.so") > cfib10 =.Call("do_fibonacci", as.integer(10)) > cfib10 [1]

21 Vectorized computation The Fibonacci series has a closed-form expression as well. F (n) = φn (1 φ) n 5, where φ = > phi <- (1 + sqrt(5)) / 2 > n <- 0:9 > n [1] > (phi^n - (1 - phi)^n) / sqrt(5) [1]

22 S is a programming language ˆ...designed for interactive use ˆ...with a focus on data analysis ˆ Basic data structures are vectors ˆ Large collection of statistical functions ˆ Advanced statistical graphics capabilities

23 S is a programming language ˆ...designed for interactive use ˆ...with a focus on data analysis ˆ Basic data structures are vectors ˆ Large collection of statistical functions ˆ Advanced statistical graphics capabilities

24 Ideal for simulation studies Example: How many heads when we toss a fair coin 500 times? > rbinom(1, size = 500, prob = 0.5) [1] 234 > rbinom(10, size = 500, prob = 0.5) [1]

25 Ideal for simulation studies > x <- rbinom(1000, size = 500, prob = 0.5) > histogram(x) 20 Percent of Total x

26 What if the coin is not fair? > x <- rbinom(1000, size = 500, prob = 0.1) > histogram(x) Percent of Total x

27 Statistical models ˆ S can fit all kinds of statistical models ˆ A brief example: ˆ Michelson s 1879 experiment on measuring the speed of light ˆ Measurements of the speed of light in air ˆ Made between 5th June and 2nd July, 1879 ˆ Five experiments, each consisting of 20 consecutive runs ˆ Speed recorded in km/s ˆ The currently accepted value on this scale is 734.5

28 Statistical models ˆ S can fit all kinds of statistical models ˆ A brief example: ˆ Michelson s 1879 experiment on measuring the speed of light ˆ Measurements of the speed of light in air ˆ Made between 5th June and 2nd July, 1879 ˆ Five experiments, each consisting of 20 consecutive runs ˆ Speed recorded in km/s ˆ The currently accepted value on this scale is 734.5

29 Speed Run Expt The Michelson data

30 The R approach : A typical analysis > data(michelson, package = "MASS") > michelson Speed Run Expt

31 > str(michelson) 'data.frame': 100 obs. of 3 variables: $ Speed: int $ Run : Factor w/ 20 levels "1","2","3","4",..: $ Expt : Factor w/ 5 levels "1","2","3","4",..:

32 > bwplot(speed ~ Expt, data = michelson, aspect = 1) Speed

33 Fitting models Model: y ij = µ + α i + ε ij where y ij = measured speed of light µ = overall mean (actual speed) α i = additive effect of i-th experiment ε ij N (0, σ 2 ) = error To fit this model in R, use > fm <- lm(speed ~ Expt, data = michelson)

34 Hypothesis testing (ANOVA) Test whether experiments are different (should not be) > anova(fm) Analysis of Variance Table Response: Speed Df Sum Sq Mean Sq F value Pr(>F) Expt Residuals

35 > xyplot(speed ~ Run Expt, data = michelson, layout = c(2, 3)) Speed Run

36 Summary The S approach is to work with objects. ˆ Model fits produce objects, usually stored as variables ˆ Queried interactively for further analysis > anova(fm) > summary(fm) > residuals(fm) ˆ Use graphics as a natural component of workflow

37 Even graphics is programmable Globular cluster data: data/globclus_prop.dat > GlobClust <- read.table("data/globclus_prop.dat", header = TRUE) > xyplot(mv ~ r.tidal, GlobClust, grid = TRUE) 2 4 Mv r.tidal

38 Even graphics is programmable > xyplot(mv ~ r.tidal, GlobClust, grid = TRUE, scales = list(x = list(log = 2))) 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal

39 Adding a regression line > xyplot(mv ~ r.tidal, GlobClust, scales = list(x = list(log = 2) panel = function(x, y,...) { panel.grid(h = -1, v = -1) panel.points(x, y,...) panel.abline(lm(y ~ x), col = "black") }) 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal

40 ˆ Doesn t seem to be a good fit ˆ One solution: use locally linear fits ˆ At each location, fit regression line based on nearby points ˆ A generalization of this idea is known as LOESS 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal

41 ˆ Doesn t seem to be a good fit ˆ One solution: use locally linear fits ˆ At each location, fit regression line based on nearby points ˆ A generalization of this idea is known as LOESS 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal

42 ˆ Doesn t seem to be a good fit ˆ One solution: use locally linear fits ˆ At each location, fit regression line based on nearby points ˆ A generalization of this idea is known as LOESS 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal

43 > xyplot(mv ~ r.tidal, GlobClust, scales = list(x = list(log = 2)), panel = function(x, y,...) { panel.grid(h = -1, v = -1) panel.points(x, y,...) panel.loess(x, y, col = "black") }) 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal

44 ˆ Smooth curve denotes fitted LOESS model ˆ Estimates average acceleration as function of distance ˆ But how confident are we about the fitted estimate? 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal

45 One approach: Bootstrap ˆ Resample with replacement to mimic underlying randomness ˆ Use same modeling process to obtain estimate ˆ Repeat many times to assess variability

46 > xyplot(mv ~ r.tidal, GlobClust, scales = list(x = list(log = 2)), panel = function(x, y,...) { panel.grid(h = -1, v = -1) n <- length(x) for (i in 1:1000) { bs.id <- sample(1:n, replace = TRUE) ## SRSWR panel.loess(x[bs.id], y[bs.id], col = "red", alpha = 0.02) } panel.points(x, y,...) panel.loess(x, y, col = "black") })

47 2 4 Mv ^3 2^4 2^5 2^6 2^7 2^8 r.tidal

48 Powerful built-in tools + Programming language Flexibility

49 Dissemination of research ˆ Rapid prototyping quick implementation of research ideas ˆ Well-structured packaging system allows dissemination ˆ CRAN: Comprehensive R Archive Network: > 3000 packages ˆ Other specialized collections (Bioconductor, Omegahat)

50 Sweave: reproducible documents ˆ Inspired by literate documents (Knuth) ˆ Enables mixing of R code and L A TEX/ HTML (etc.) ˆ Source file reproduces both analysis and report ˆ Reproducible research + convenience ˆ rnw/bootstrap.rnw

51 Sweave: reproducible documents ˆ Inspired by literate documents (Knuth) ˆ Enables mixing of R code and L A TEX/ HTML (etc.) ˆ Source file reproduces both analysis and report ˆ Reproducible research + convenience ˆ rnw/bootstrap.rnw

52 Summary ˆ R is a feature-rich interactive language + environment ideally suited to data analysis as well as other kinds of numerical computations ˆ Some learning required before it can be used effectively ˆ Typical mind-blocks for newcomers: ˆ R is not C! More like LISP / Scheme ˆ Vectorization (easy to get past with a little experience) ˆ Functional approach to programming

53 Summary ˆ R is a feature-rich interactive language + environment ideally suited to data analysis as well as other kinds of numerical computations ˆ Some learning required before it can be used effectively ˆ Typical mind-blocks for newcomers: ˆ R is not C! More like LISP / Scheme ˆ Vectorization (easy to get past with a little experience) ˆ Functional approach to programming

54 Questions?

R: A Free Software Project in Statistical Computing

R: A Free Software Project in Statistical Computing R: A Free Software Project in Statistical Computing Achim Zeileis Institut für Statistik & Wahrscheinlichkeitstheorie http://www.ci.tuwien.ac.at/~zeileis/ Acknowledgments Thanks: Alex Smola & Machine Learning

More information

R Language Fundamentals

R Language Fundamentals R Language Fundamentals Data Types and Basic Maniuplation Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Where did R come from? Overview Atomic Vectors Subsetting

More information

Silvia Liverani. Department of Statistics University of Warwick. CSC, 24th April 2008. R: A programming environment for Data. Analysis and Graphics

Silvia Liverani. Department of Statistics University of Warwick. CSC, 24th April 2008. R: A programming environment for Data. Analysis and Graphics : A Department of Statistics University of Warwick CSC, 24th April 2008 Outline 1 2 3 4 5 6 What do you need? Performance Functionality Extensibility Simplicity Compatability Interface Low-cost Project

More information

Simple Linear Regression Inference

Simple Linear Regression Inference Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

More information

Testing for Lack of Fit

Testing for Lack of Fit Chapter 6 Testing for Lack of Fit How can we tell if a model fits the data? If the model is correct then ˆσ 2 should be an unbiased estimate of σ 2. If we have a model which is not complex enough to fit

More information

Lecture 8. Confidence intervals and the central limit theorem

Lecture 8. Confidence intervals and the central limit theorem Lecture 8. Confidence intervals and the central limit theorem Mathematical Statistics and Discrete Mathematics November 25th, 2015 1 / 15 Central limit theorem Let X 1, X 2,... X n be a random sample of

More information

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a

More information

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F Random and Mixed Effects Models (Ch. 10) Random effects models are very useful when the observations are sampled in a highly structured way. The basic idea is that the error associated with any linear,

More information

Statistical Models in R

Statistical Models in R Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Linear Models in R Regression Regression analysis is the appropriate

More information

An Introduction to the Use of R for Clinical Research

An Introduction to the Use of R for Clinical Research An Introduction to the Use of R for Clinical Research Dimitris Rizopoulos Department of Biostatistics, Erasmus Medical Center d.rizopoulos@erasmusmc.nl PSDM Event: Open Source Software in Clinical Research

More information

Univariate Regression

Univariate Regression Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

More information

Fitting Subject-specific Curves to Grouped Longitudinal Data

Fitting Subject-specific Curves to Grouped Longitudinal Data Fitting Subject-specific Curves to Grouped Longitudinal Data Djeundje, Viani Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh, EH14 4AS, UK E-mail: vad5@hw.ac.uk Currie,

More information

How to calculate an ANOVA table

How to calculate an ANOVA table How to calculate an ANOVA table Calculations by Hand We look at the following example: Let us say we measure the height of some plants under the effect of different fertilizers. Treatment Measures Mean

More information

Introduction to Hierarchical Linear Modeling with R

Introduction to Hierarchical Linear Modeling with R Introduction to Hierarchical Linear Modeling with R 5 10 15 20 25 5 10 15 20 25 13 14 15 16 40 30 20 10 0 40 30 20 10 9 10 11 12-10 SCIENCE 0-10 5 6 7 8 40 30 20 10 0-10 40 1 2 3 4 30 20 10 0-10 5 10 15

More information

GETTING STARTED WITH R AND DATA ANALYSIS

GETTING STARTED WITH R AND DATA ANALYSIS GETTING STARTED WITH R AND DATA ANALYSIS [Learn R for effective data analysis] LEARN PRACTICAL SKILLS REQUIRED FOR VISUALIZING, TRANSFORMING, AND ANALYZING DATA IN R One day course for people who are just

More information

Outline. Dispersion Bush lupine survival Quasi-Binomial family

Outline. Dispersion Bush lupine survival Quasi-Binomial family Outline 1 Three-way interactions 2 Overdispersion in logistic regression Dispersion Bush lupine survival Quasi-Binomial family 3 Simulation for inference Why simulations Testing model fit: simulating the

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

 Y. Notation and Equations for Regression Lecture 11/4. Notation: Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through

More information

Introduction to Analysis of Variance (ANOVA) Limitations of the t-test

Introduction to Analysis of Variance (ANOVA) Limitations of the t-test Introduction to Analysis of Variance (ANOVA) The Structural Model, The Summary Table, and the One- Way ANOVA Limitations of the t-test Although the t-test is commonly used, it has limitations Can only

More information

Geostatistics Exploratory Analysis

Geostatistics Exploratory Analysis Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa Master of Science in Geospatial Technologies Geostatistics Exploratory Analysis Carlos Alberto Felgueiras cfelgueiras@isegi.unl.pt

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

Using Open Source Software to Teach Mathematical Statistics p.1/29

Using Open Source Software to Teach Mathematical Statistics p.1/29 Using Open Source Software to Teach Mathematical Statistics Douglas M. Bates bates@r-project.org University of Wisconsin Madison Using Open Source Software to Teach Mathematical Statistics p.1/29 Outline

More information

How do most businesses analyze data?

How do most businesses analyze data? Marilyn Monda, MA, MBB Say hello to R! And say good bye to expensive stats software R Course 201311 1 How do most businesses analyze data? Excel??? Calculator?? Homegrown analysis packages?? Statistical

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is

More information

The Normal distribution

The Normal distribution The Normal distribution The normal probability distribution is the most common model for relative frequencies of a quantitative variable. Bell-shaped and described by the function f(y) = 1 2σ π e{ 1 2σ

More information

Psychology 205: Research Methods in Psychology

Psychology 205: Research Methods in Psychology Psychology 205: Research Methods in Psychology Using R to analyze the data for study 2 Department of Psychology Northwestern University Evanston, Illinois USA November, 2012 1 / 38 Outline 1 Getting ready

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Research Methods & Experimental Design

Research Methods & Experimental Design Research Methods & Experimental Design 16.422 Human Supervisory Control April 2004 Research Methods Qualitative vs. quantitative Understanding the relationship between objectives (research question) and

More information

Normal Distribution. Definition A continuous random variable has a normal distribution if its probability density. f ( y ) = 1.

Normal Distribution. Definition A continuous random variable has a normal distribution if its probability density. f ( y ) = 1. Normal Distribution Definition A continuous random variable has a normal distribution if its probability density e -(y -µ Y ) 2 2 / 2 σ function can be written as for < y < as Y f ( y ) = 1 σ Y 2 π Notation:

More information

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares Topic 4 - Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test - Fall 2013 R 2 and the coefficient of correlation

More information

Section 13, Part 1 ANOVA. Analysis Of Variance

Section 13, Part 1 ANOVA. Analysis Of Variance Section 13, Part 1 ANOVA Analysis Of Variance Course Overview So far in this course we ve covered: Descriptive statistics Summary statistics Tables and Graphs Probability Probability Rules Probability

More information

Final Exam Practice Problem Answers

Final Exam Practice Problem Answers Final Exam Practice Problem Answers The following data set consists of data gathered from 77 popular breakfast cereals. The variables in the data set are as follows: Brand: The brand name of the cereal

More information

Difference of Means and ANOVA Problems

Difference of Means and ANOVA Problems Difference of Means and Problems Dr. Tom Ilvento FREC 408 Accounting Firm Study An accounting firm specializes in auditing the financial records of large firm It is interested in evaluating its fee structure,particularly

More information

Sample Size Calculation for Longitudinal Studies

Sample Size Calculation for Longitudinal Studies Sample Size Calculation for Longitudinal Studies Phil Schumm Department of Health Studies University of Chicago August 23, 2004 (Supported by National Institute on Aging grant P01 AG18911-01A1) Introduction

More information

Week 5: Multiple Linear Regression

Week 5: Multiple Linear Regression BUS41100 Applied Regression Analysis Week 5: Multiple Linear Regression Parameter estimation and inference, forecasting, diagnostics, dummy variables Robert B. Gramacy The University of Chicago Booth School

More information

Simulation Exercises to Reinforce the Foundations of Statistical Thinking in Online Classes

Simulation Exercises to Reinforce the Foundations of Statistical Thinking in Online Classes Simulation Exercises to Reinforce the Foundations of Statistical Thinking in Online Classes Simcha Pollack, Ph.D. St. John s University Tobin College of Business Queens, NY, 11439 pollacks@stjohns.edu

More information

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software STATA Tutorial Professor Erdinç Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software 1.Wald Test Wald Test is used

More information

Consider a study in which. How many subjects? The importance of sample size calculations. An insignificant effect: two possibilities.

Consider a study in which. How many subjects? The importance of sample size calculations. An insignificant effect: two possibilities. Consider a study in which How many subjects? The importance of sample size calculations Office of Research Protections Brown Bag Series KB Boomer, Ph.D. Director, boomer@stat.psu.edu A researcher conducts

More information

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

More information

Week TSX Index 1 8480 2 8470 3 8475 4 8510 5 8500 6 8480

Week TSX Index 1 8480 2 8470 3 8475 4 8510 5 8500 6 8480 1) The S & P/TSX Composite Index is based on common stock prices of a group of Canadian stocks. The weekly close level of the TSX for 6 weeks are shown: Week TSX Index 1 8480 2 8470 3 8475 4 8510 5 8500

More information

Nonlinear Regression:

Nonlinear Regression: Zurich University of Applied Sciences School of Engineering IDP Institute of Data Analysis and Process Design Nonlinear Regression: A Powerful Tool With Considerable Complexity Half-Day : Improved Inference

More information

MATH 4470/5470 EXPLORATORY DATA ANALYSIS ONLINE COURSE SYLLABUS

MATH 4470/5470 EXPLORATORY DATA ANALYSIS ONLINE COURSE SYLLABUS MATH 4470/5470 EXPLORATORY DATA ANALYSIS ONLINE COURSE SYLLABUS COURSE DESCRIPTION Introduction to modern techniques in data analysis, including stem-and-leafs, box plots, resistant lines, smoothing and

More information

Chapter 7 Notes - Inference for Single Samples. You know already for a large sample, you can invoke the CLT so:

Chapter 7 Notes - Inference for Single Samples. You know already for a large sample, you can invoke the CLT so: Chapter 7 Notes - Inference for Single Samples You know already for a large sample, you can invoke the CLT so: X N(µ, ). Also for a large sample, you can replace an unknown σ by s. You know how to do a

More information

Part 2: Analysis of Relationship Between Two Variables

Part 2: Analysis of Relationship Between Two Variables Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable

More information

Introduction to the R Language

Introduction to the R Language Introduction to the R Language Functions Biostatistics 140.776 Functions Functions are created using the function() directive and are stored as R objects just like anything else. In particular, they are

More information

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results IAPRI Quantitative Analysis Capacity Building Series Multiple regression analysis & interpreting results How important is R-squared? R-squared Published in Agricultural Economics 0.45 Best article of the

More information

Fairfield Public Schools

Fairfield Public Schools Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

More information

Financial Risk Models in R: Factor Models for Asset Returns. Workshop Overview

Financial Risk Models in R: Factor Models for Asset Returns. Workshop Overview Financial Risk Models in R: Factor Models for Asset Returns and Interest Rate Models Scottish Financial Risk Academy, March 15, 2011 Eric Zivot Robert Richards Chaired Professor of Economics Adjunct Professor,

More information

Quality Assurance for Graphics in R

Quality Assurance for Graphics in R New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

More information

5 Correlation and Data Exploration

5 Correlation and Data Exploration 5 Correlation and Data Exploration Correlation In Unit 3, we did some correlation analyses of data from studies related to the acquisition order and acquisition difficulty of English morphemes by both

More information

5. Linear Regression

5. Linear Regression 5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4

More information

Package pesticides. February 20, 2015

Package pesticides. February 20, 2015 Type Package Package pesticides February 20, 2015 Title Analysis of single serving and composite pesticide residue measurements Version 0.1 Date 2010-11-17 Author David M Diez Maintainer David M Diez

More information

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

Introduction to Hypothesis Testing. Hypothesis Testing. Step 1: State the Hypotheses

Introduction to Hypothesis Testing. Hypothesis Testing. Step 1: State the Hypotheses Introduction to Hypothesis Testing 1 Hypothesis Testing A hypothesis test is a statistical procedure that uses sample data to evaluate a hypothesis about a population Hypothesis is stated in terms of the

More information

Package cpm. July 28, 2015

Package cpm. July 28, 2015 Package cpm July 28, 2015 Title Sequential and Batch Change Detection Using Parametric and Nonparametric Methods Version 2.2 Date 2015-07-09 Depends R (>= 2.15.0), methods Author Gordon J. Ross Maintainer

More information

Recursive Algorithms. Recursion. Motivating Example Factorial Recall the factorial function. { 1 if n = 1 n! = n (n 1)! if n > 1

Recursive Algorithms. Recursion. Motivating Example Factorial Recall the factorial function. { 1 if n = 1 n! = n (n 1)! if n > 1 Recursion Slides by Christopher M Bourke Instructor: Berthe Y Choueiry Fall 007 Computer Science & Engineering 35 Introduction to Discrete Mathematics Sections 71-7 of Rosen cse35@cseunledu Recursive Algorithms

More information

Ordinal Regression. Chapter

Ordinal Regression. Chapter Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe

More information

Hypothesis Testing for Beginners

Hypothesis Testing for Beginners Hypothesis Testing for Beginners Michele Piffer LSE August, 2011 Michele Piffer (LSE) Hypothesis Testing for Beginners August, 2011 1 / 53 One year ago a friend asked me to put down some easy-to-read notes

More information

Data Analysis Tools. Tools for Summarizing Data

Data Analysis Tools. Tools for Summarizing Data Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool

More information

Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011

Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Name: Section: I pledge my honor that I have not violated the Honor Code Signature: This exam has 34 pages. You have 3 hours to complete this

More information

17. SIMPLE LINEAR REGRESSION II

17. SIMPLE LINEAR REGRESSION II 17. SIMPLE LINEAR REGRESSION II The Model In linear regression analysis, we assume that the relationship between X and Y is linear. This does not mean, however, that Y can be perfectly predicted from X.

More information

Math 425 (Fall 08) Solutions Midterm 2 November 6, 2008

Math 425 (Fall 08) Solutions Midterm 2 November 6, 2008 Math 425 (Fall 8) Solutions Midterm 2 November 6, 28 (5 pts) Compute E[X] and Var[X] for i) X a random variable that takes the values, 2, 3 with probabilities.2,.5,.3; ii) X a random variable with the

More information

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96 1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

More information

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples

More information

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model. Polynomial Regression POLYNOMIAL AND MULTIPLE REGRESSION Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model. It is a form of linear regression

More information

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University Software and Methods for the Analysis of Affymetrix GeneChip Data Rafael A Irizarry Department of Biostatistics Johns Hopkins University Outline Overview Bioconductor Project Examples 1: Gene Annotation

More information

MIXED MODEL ANALYSIS USING R

MIXED MODEL ANALYSIS USING R Research Methods Group MIXED MODEL ANALYSIS USING R Using Case Study 4 from the BIOMETRICS & RESEARCH METHODS TEACHING RESOURCE BY Stephen Mbunzi & Sonal Nagda www.ilri.org/rmg www.worldagroforestrycentre.org/rmg

More information

Lecture 10: Depicting Sampling Distributions of a Sample Proportion

Lecture 10: Depicting Sampling Distributions of a Sample Proportion Lecture 10: Depicting Sampling Distributions of a Sample Proportion Chapter 5: Probability and Sampling Distributions 2/10/12 Lecture 10 1 Sample Proportion 1 is assigned to population members having a

More information

Correlations. MSc Module 6: Introduction to Quantitative Research Methods Kenneth Benoit. March 18, 2010

Correlations. MSc Module 6: Introduction to Quantitative Research Methods Kenneth Benoit. March 18, 2010 Correlations MSc Module 6: Introduction to Quantitative Research Methods Kenneth Benoit March 18, 2010 Relationships between variables In previous weeks, we have been concerned with describing variables

More information

WHERE DOES THE 10% CONDITION COME FROM?

WHERE DOES THE 10% CONDITION COME FROM? 1 WHERE DOES THE 10% CONDITION COME FROM? The text has mentioned The 10% Condition (at least) twice so far: p. 407 Bernoulli trials must be independent. If that assumption is violated, it is still okay

More information

Time Series Analysis AMS 316

Time Series Analysis AMS 316 Time Series Analysis AMS 316 Programming language and software environment for data manipulation, calculation and graphical display. Originally created by Ross Ihaka and Robert Gentleman at University

More information

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition Online Learning Centre Technology Step-by-Step - Excel Microsoft Excel is a spreadsheet software application

More information

Confidence Intervals for the Difference Between Two Means

Confidence Intervals for the Difference Between Two Means Chapter 47 Confidence Intervals for the Difference Between Two Means Introduction This procedure calculates the sample size necessary to achieve a specified distance from the difference in sample means

More information

An Interactive Tool for Residual Diagnostics for Fitting Spatial Dependencies (with Implementation in R)

An Interactive Tool for Residual Diagnostics for Fitting Spatial Dependencies (with Implementation in R) DSC 2003 Working Papers (Draft Versions) http://www.ci.tuwien.ac.at/conferences/dsc-2003/ An Interactive Tool for Residual Diagnostics for Fitting Spatial Dependencies (with Implementation in R) Ernst

More information

PEST - Beyond Basic Model Calibration. Presented by Jon Traum

PEST - Beyond Basic Model Calibration. Presented by Jon Traum PEST - Beyond Basic Model Calibration Presented by Jon Traum Purpose of Presentation Present advance techniques available in PEST for model calibration High level overview Inspire more people to use PEST!

More information

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 1 - Bootstrap

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 1 - Bootstrap Nathalie Villa-Vialanei Année 2015/2016 M1 in Economics and Economics and Statistics Applied multivariate Analsis - Big data analtics Worksheet 1 - Bootstrap This worksheet illustrates the use of nonparametric

More information

Section 1: Simple Linear Regression

Section 1: Simple Linear Regression Section 1: Simple Linear Regression Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Regression: General Introduction

More information

Illustration (and the use of HLM)

Illustration (and the use of HLM) Illustration (and the use of HLM) Chapter 4 1 Measurement Incorporated HLM Workshop The Illustration Data Now we cover the example. In doing so we does the use of the software HLM. In addition, we will

More information

Data Mining Introduction

Data Mining Introduction Data Mining Introduction Bob Stine Dept of Statistics, School University of Pennsylvania www-stat.wharton.upenn.edu/~stine What is data mining? An insult? Predictive modeling Large, wide data sets, often

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices: Doing Multiple Regression with SPSS Multiple Regression for Data Already in Data Editor Next we want to specify a multiple regression analysis for these data. The menu bar for SPSS offers several options:

More information

Normal distribution. ) 2 /2σ. 2π σ

Normal distribution. ) 2 /2σ. 2π σ Normal distribution The normal distribution is the most widely known and used of all distributions. Because the normal distribution approximates many natural phenomena so well, it has developed into a

More information

9.07 Introduction to Statistical Methods Homework 4. Name:

9.07 Introduction to Statistical Methods Homework 4. Name: 1. Estimating the population standard deviation and variance. Homework #2 contained a problem (#4) on estimating the population standard deviation. In that problem, you showed that the method of estimating

More information

Theory at a Glance (For IES, GATE, PSU)

Theory at a Glance (For IES, GATE, PSU) 1. Forecasting Theory at a Glance (For IES, GATE, PSU) Forecasting means estimation of type, quantity and quality of future works e.g. sales etc. It is a calculated economic analysis. 1. Basic elements

More information

A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA

A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA REVSTAT Statistical Journal Volume 4, Number 2, June 2006, 131 142 A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA Authors: Daiane Aparecida Zuanetti Departamento de Estatística, Universidade Federal de São

More information

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon t-tests in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com www.excelmasterseries.com

More information

Chapter 4 and 5 solutions

Chapter 4 and 5 solutions Chapter 4 and 5 solutions 4.4. Three different washing solutions are being compared to study their effectiveness in retarding bacteria growth in five gallon milk containers. The analysis is done in a laboratory,

More information

Normality Testing in Excel

Normality Testing in Excel Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com

More information

statistics Chi-square tests and nonparametric Summary sheet from last time: Hypothesis testing Summary sheet from last time: Confidence intervals

statistics Chi-square tests and nonparametric Summary sheet from last time: Hypothesis testing Summary sheet from last time: Confidence intervals Summary sheet from last time: Confidence intervals Confidence intervals take on the usual form: parameter = statistic ± t crit SE(statistic) parameter SE a s e sqrt(1/n + m x 2 /ss xx ) b s e /sqrt(ss

More information

Package empiricalfdr.deseq2

Package empiricalfdr.deseq2 Type Package Package empiricalfdr.deseq2 May 27, 2015 Title Simulation-Based False Discovery Rate in RNA-Seq Version 1.0.3 Date 2015-05-26 Author Mikhail V. Matz Maintainer Mikhail V. Matz

More information

Econometrics Simple Linear Regression

Econometrics Simple Linear Regression Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight

More information

SIMULATION STUDIES IN STATISTICS WHAT IS A SIMULATION STUDY, AND WHY DO ONE? What is a (Monte Carlo) simulation study, and why do one?

SIMULATION STUDIES IN STATISTICS WHAT IS A SIMULATION STUDY, AND WHY DO ONE? What is a (Monte Carlo) simulation study, and why do one? SIMULATION STUDIES IN STATISTICS WHAT IS A SIMULATION STUDY, AND WHY DO ONE? What is a (Monte Carlo) simulation study, and why do one? Simulations for properties of estimators Simulations for properties

More information

HYPOTHESIS TESTING: POWER OF THE TEST

HYPOTHESIS TESTING: POWER OF THE TEST HYPOTHESIS TESTING: POWER OF THE TEST The first 6 steps of the 9-step test of hypothesis are called "the test". These steps are not dependent on the observed data values. When planning a research project,

More information

Chapter 3 Quantitative Demand Analysis

Chapter 3 Quantitative Demand Analysis Managerial Economics & Business Strategy Chapter 3 uantitative Demand Analysis McGraw-Hill/Irwin Copyright 2010 by the McGraw-Hill Companies, Inc. All rights reserved. Overview I. The Elasticity Concept

More information

3. Regression & Exponential Smoothing

3. Regression & Exponential Smoothing 3. Regression & Exponential Smoothing 3.1 Forecasting a Single Time Series Two main approaches are traditionally used to model a single time series z 1, z 2,..., z n 1. Models the observation z t as a

More information

Lean Certification Program Blended Learning Program Cost: $5500. Course Description

Lean Certification Program Blended Learning Program Cost: $5500. Course Description Lean Certification Program Blended Learning Program Cost: $5500 Course Description Lean Certification Program is a disciplined process improvement approach focused on reducing waste, increasing customer

More information

Logit Models for Binary Data

Logit Models for Binary Data Chapter 3 Logit Models for Binary Data We now turn our attention to regression models for dichotomous data, including logistic regression and probit analysis. These models are appropriate when the response

More information