Summary of R software commands used to generate bootstrap and permutation test output and figures in Chapter 16



Similar documents
Scatter Plots with Error Bars

Using SPSS, Chapter 2: Descriptive Statistics

4. Descriptive Statistics: Measures of Variability and Central Tendency

Data exploration with Microsoft Excel: analysing more than one variable

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

IBM SPSS Statistics for Beginners for Windows

5 Correlation and Data Exploration

If there is not a Data Analysis option under the DATA menu, you will need to install the Data Analysis ToolPak as an add-in for Microsoft Excel.

STEP TWO: Highlight the data set, then select DATA PIVOT TABLE

5. Correlation. Open HeightWeight.sav. Take a moment to review the data file.

Statistical Data analysis With Excel For HSMG.632 students

Exploratory data analysis (Chapter 2) Fall 2011

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

SPSS Manual for Introductory Applied Statistics: A Variable Approach

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

TIPS FOR DOING STATISTICS IN EXCEL

A Picture Really Is Worth a Thousand Words

Dongfeng Li. Autumn 2010

Introduction Course in SPSS - Evening 1

Data Analysis Tools. Tools for Summarizing Data

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

Using Excel in Research. Hui Bian Office for Faculty Excellence

Name: Date: Use the following to answer questions 2-3:

Exercise 1.12 (Pg )

GeoGebra. 10 lessons. Gerrit Stols

Using Excel for descriptive statistics

Data exploration with Microsoft Excel: univariate analysis

Gamma Distribution Fitting

Gestation Period as a function of Lifespan

Chapter 5 Analysis of variance SPSS Analysis of variance

Psychology 205: Research Methods in Psychology

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

Estimating a market model: Step-by-step Prepared by Pamela Peterson Drake Florida Atlantic University

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Directions for Frequency Tables, Histograms, and Frequency Bar Charts

Figure 1. An embedded chart on a worksheet.

IBM SPSS Statistics 20 Part 1: Descriptive Statistics

Statgraphics Getting started

Variables. Exploratory Data Analysis

SPSS Workbook 1 Data Entry : Questionnaire Data

SPSS Resources. 1. See website (readings) for SPSS tutorial & Stats handout

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

An introduction to using Microsoft Excel for quantitative data analysis

Using Excel for Statistical Analysis

Multiple Regression. Page 24

Information Technology Services will be updating the mark sense test scoring hardware and software on Monday, May 18, We will continue to score

Chapter 4 Displaying and Describing Categorical Data

PloneSurvey User Guide (draft 3)

Advanced Excel for Institutional Researchers

Cell Phone Impairment?

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Drawing a histogram using Excel

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

3.4 The Normal Distribution

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Describing, Exploring, and Comparing Data

Chapter 7 Section 7.1: Inference for the Mean of a Population

GeoGebra Statistics and Probability

Using MS Excel to Analyze Data: A Tutorial

January 26, 2009 The Faculty Center for Teaching and Learning

Statistics Revision Sheet Question 6 of Paper 2

containing Kendall correlations; and the OUTH = option will create a data set containing Hoeffding statistics.

Introduction to Statistical Computing in Microsoft Excel By Hector D. Flores; and Dr. J.A. Dobelman

Simulating Investment Portfolios

6 3 The Standard Normal Distribution

Dealing with Data in Excel 2010

Below is a very brief tutorial on the basic capabilities of Excel. Refer to the Excel help files for more information.

How To Run Statistical Tests in Excel

The Normal Distribution

Descriptive Statistics

SECTION 2-1: OVERVIEW SECTION 2-2: FREQUENCY DISTRIBUTIONS

Directions for using SPSS

R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol

AMS 7L LAB #2 Spring, Exploratory Data Analysis

Package SHELF. February 5, 2016

Simple Predictive Analytics Curtis Seare

Tutorial Segmentation and Classification

AMATH 352 Lecture 3 MATLAB Tutorial Starting MATLAB Entering Variables

General instructions for the content of all StatTools assignments and the use of StatTools:

Excel 2003: Ringtones Task

Using Microsoft Excel to Plot and Analyze Kinetic Data

STAB22 section 1.1. total = 88(200/100) + 85(200/100) + 77(300/100) + 90(200/100) + 80(100/100) = = 837,

Introduction to R and Exploratory data analysis

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Session 10. Laboratory Works

SPSS Explore procedure

Linear Models in STATA and ANOVA

GETTING YOUR DATA INTO SPSS

Beginner s Matlab Tutorial

Multivariate Analysis of Ecological Data

Chapter 7 Section 1 Homework Set A

Getting started in Excel

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

When to use Excel. When NOT to use Excel 9/24/2014

EXST SAS Lab Lab #4: Data input and dataset modifications

Microsoft Excel Tutorial

Transcription:

Summary of R software commands used to generate bootstrap and permutation test output and figures in Chapter 16 Since R is command line driven and the primary software of Chapter 16, this document details all commands used to generate Chapter 16 s output and figures. To help familiarize yourself with the software, we suggest that you try to run these commands and compare your output to those in the chapter before trying to tackle the chapter exercises. Below is a screenshot of R for Windows. To utilize functions from an R package, you need to

load the package. We ll assume you already have installed the boot package, so loading it involves the following set of clicks starting with the Packages item on the R menu. Packages > Load Package > boot Once you have the package loaded, you are now able to utilize the functions within this package. Figure 16.1 First we need to load in the data set into R. R has a built in function read.csv that reads.csv files into R. We recommend that all Excel files to be read into R be saved as a.csv file. Once that is done, the following commands read the file into the data matrix bc and then attach it so that the columns can be referred to by their column names rather than referring to everything using rows and columns of bc. bc <- read.csv("u:/ips8/datasets/chapter16/time50.csv") attach(bc) The variable of interest here is Time. To get the mean and median, we can simply type the following commands. The output is shown below each command. Some of these basic summary statistic commands are helpful in checking that the data set has been read in correctly. mean(time) [1] 23.26 median(time) [1] 12 The next set of commands generates the histogram and Normal quantile plot of Figure 16.1. The default histogram in R is one that displays the counts. To get the percentages, we alter the histogram heights to be the percentages and then use the plot command to create the histogram with appropriate axis labels. The Normal quantile plot utilizes the qqnorm function. h <- hist(time) h$density = h$counts/sum(h$counts) plot(h,freq=f,main="",xlab="time (in days)",ylab="percent",las=1) qqnorm(time,main="",xlab="normal score",ylab="times (in days)")

Figures 16.3 and 16.5 These are the first figures that use a function from the boot package. The boot function requires the sample statistic of interest be defined as an R function, so the first command defines the mean as an R function called theta. The second command requests that 3000 bootstrap sample means be generated from the variable Time. These 3000 sample means are saved in the data object named time.boot. By simply typing time.boot, you get a summary of the bootstrapping. That is what is shown in Figure 16.5. theta <- function(data,indices) { d <- data[indices] mean(d) time.boot <- boot(time,theta,r=3000) time.boot Figure 16.3 is generated by the following set of commands. By requesting prob=t in the hist function, a density histogram (area of the histograms bars sums to 1) is generated. The Normal curve based on the 3000 generated samples is then drawn (need the mean and standard deviation of the bootstrap sample), followed by vertical lines representing the sample mean and bootstrap sample mean. For the Normal quantile plot, we also draw the line the observations should follow. hist(time.boot$t,prob=t,axes=f,main="",xlab="mean times of resamples (in days)",ylab="") axis(1) jcc <- dnorm(seq(12,40,length=500),mean(time.boot$t),sqrt(var(time.boot$t))) lines(seq(12,40,length=500),jcc) lines(c(mean(time),mean(time)),c(0,dnorm(mean(time),mean(time.boot$t),sqrt(var(ti me.boot$t)))),col=2) afc <- mean(time.boot$t) lines(c(afc,afc),c(0,dnorm(afc,mean(time.boot$t),sqrt(var(time.boot$t)))),lty=2) qqnorm(time.boot$t,main="",xlab="normal score",ylab="mean times of resamples (in days)") qqline(time.boot$t, col = 2,lwd=1,lty=1)

A crude version of this plot can also be generated using the plot function. That command is simply plot(time.boot) Figures 16.6 and 16.7 For this figure, we create our variable with the data rather than reading in a data set. This is only a viable option when the data set is small. The rest of the commands are similar to those used with the starting time data set. stout <- c(94,46,78,66,140,24) theta <- function(data,indices) { d <- data[indices] mean(d) stout.boot <- boot(stout,theta,r=3000) stout.boot plot(stout.boot) Figure 16.8 This figure is similar to Figure 16.1 so the commands are similar. If you compare these commands to those for Figure 16.1, only the variable names and data set names have changed. gpa <- read.csv("u:/ips8/datasets/chapter16/gpa.csv") attach(gpa) h = hist(gpa,nclass=20) h$density = h$counts/sum(h$counts) plot(h,freq=f,main="",xlab="gpa",ylab="percent",las=1) qqnorm(gpa,main="",xlab="normal score",ylab="gpa") Figure 16.9 and accompanying output For this figure and output, we are using the trimmed mean statistic. This is available within the mean function. This means we need to tweak the function theta that will be used within the boot function. The rest of the commands are similar to those used for Figure 16.3. mean(gpa, trim=0.25)

theta <- function(data,indices) { d <- data[indices] mean(d,trim=0.25) gpa.boot <- boot(gpa,theta,r=3000) hist(gpa.boot$t,prob=t,axes=f,main="",xlab="means of resamples",ylab="",ylim=c(0,max(dnorm(mean(gpa,trim=0.25),mean(gpa.boot$t),sqrt(v ar(gpa.boot$t)))))) axis(1) jcc <- dnorm(seq(2.5,3.5,length=500),mean(gpa.boot$t),sqrt(var(gpa.boot$t))) lines(seq(2.5,3.5,length=500),jcc) lines(c(mean(gpa,trim=0.25),mean(gpa,trim=0.25)),c(0,dnorm(mean(gpa,trim=0.25),m ean(gpa.boot$t),sqrt(var(gpa.boot$t)))),col=2) afc <- mean(gpa.boot$t) lines(c(afc,afc),c(0,dnorm(afc,mean(gpa.boot$t),sqrt(var(gpa.boot$t)))),lty=2) qqnorm(gpa.boot$t,main="",xlab="normal score",ylab="means of resamples") qqline(gpa.boot$t, col = 2,lwd=1,lty=1) Example 16.6 To generate the values that appear in the table, we use the length, mean, and var functions. Since sex is coded 1=Male and 2=Female, we can restrict attention to GPA scores of just one sex by adding the sex specification within brackets. mean(gpa[sex==1]) sqrt(var(gpa[sex==1])) length(gpa[sex==1]) mean(gpa[sex==2]) sqrt(var(gpa[sex==2])) length(gpa[sex==2])

Figure 16.10 A function from another package is needed to generate the density curves. The first command loads this package. This command works only if the package has already been installed. The function is called sm.density.compare. This function allows you to place a legend anywhere on the plot using a click of the mouse. The other commands set up the legend for this example. Make sure you click on the figure to place the legend. Until you do this, you will not get another command prompt. The last two commands generate the Normal quantile plot. The first plots the data for the males (sex=1) and the last two add the data for the females (sex=2) to this plot. library(sm) # create value labels cyl.f <- factor(sex, levels= c(1,2), labels = c("1 Male", "2 Female")) # plot densities sm.density.compare(gpa, sex, xlab="gpa",xlim=c(0,4)) # add legend via mouse click colfill<-c(2:(2+length(levels(cyl.f)))) legend(locator(1), levels(cyl.f), fill=colfill) qqnorm(gpa[sex==1],main="",xlab="normal score",ylab="gpa",col=2) bcc <- qqnorm(gpa[sex==2],main="",xlab="normal score",ylab="gpa",plot=f) points(bcc$x,bcc$y,col=3) Figure 16.11 and accompanying output The commands for this figure and output are similar to those used for the trimmed mean (Figure 16.9). The key is defining the function to be used with the boot function. Rather than call this function theta, we call it meandiff here to describe the function. Because this example compares the means of two groups, the function involves two columns from the gpa matrix. Column 2 contains the GPAs and Column 9 contains the sex variable.

meandiff <- function(x, w){ y <- tapply(x[w,2], x[w,9], mean) y[1]-y[2] gpa1.boot <- boot(gpa,meandiff,r=3000,strata=sex) gpa1.boot hist(gpa1.boot$t,prob=t,axes=f,main="",xlab="difference in means of resamples",ylab="",ylim=c(0,max(dnorm(mean(gpa1.boot$t),mean(gpa1.boot$t),sqrt(va r(gpa1.boot$t)))))) axis(1) jcc <- dnorm(seq(-0.6,0.4,length=500),mean(gpa1.boot$t),sqrt(var(gpa1.boot$t))) lines(seq(-0.6,0.4,length=500),jcc) lines(c(-0.1490259,-0.1490259),c(0,dnorm(- 0.1490259,mean(gpa1.boot$t),sqrt(var(gpa1.boot$t)))),col=2) afc <- mean(gpa1.boot$t) lines(c(afc,afc),c(0,dnorm(afc,mean(gpa1.boot$t),sqrt(var(gpa1.boot$t)))),lty=2) qqnorm(gpa1.boot$t,main="",xlab="normal score",ylab="difference in means of resamples") qqline(gpa1.boot$t, col = 2,lwd=1,lty=1) Example 16.8 The R command to get the 2.5 and 97.5 percentiles uses the quantile function. You specify the variable containing the bootstrap sample statistics and then a vector containing the percentiles. quantile(gpa.boot$t,c(.025,.975)) Example 16.9 and Figures 16.17 and 16.18 Similar to Figure 16.11 and accompanying output, which looked at the difference in the means of two groups, the function to be used in boot here is a bit more complicated. The function ratvar involves the second column, which contains the GPAs, and the ninth column, which contains the gender information. Rather than take a difference in means here, the statistic is a

ratio of variances. The other addition here is we utilize the function boot.ci to obtain the various confidence intervals based on our bootstrapping results. The remaining commands create the histogram in Figure 6.17. ratvar <- function(x, w) { y <- tapply(x[w,2], x[w,9], var) y[1]/y[2] gpa2.boot <- boot(gpa,ratvar,r=5000,strata=sex) boot.ci(gpa2.boot) hist(gpa2.boot$t,prob=t,axes=f,main="",xlab="ratio of variances (male to female) of resamples",ylab="") axis(1) jcc <- dnorm(seq(0.5,3,length=500),mean(gpa2.boot$t),sqrt(var(gpa2.boot$t))) lines(seq(.5,3,length=500),jcc) lines(c(var(gpa[sex==1])/var(gpa[sex==2]), var(gpa[sex==1])/var(gpa[sex==2])),c(0,dnorm(var(gpa[sex==1])/var(gpa[sex==2]),me an(gpa2.boot$t),sqrt(var(gpa2.boot$t)))),col=2) afc <- mean(gpa2.boot$t) lines(c(afc,afc),c(0,dnorm(afc,mean(gpa2.boot$t),sqrt(var(gpa2.boot$t)))),lty=2) Example 16.10 and Figure 16.20 For this example, we first need to read in the data set and attach it so we can refer to variables by their column names. After that, we define the function to be used for the bootstrapping. In this case, it is the correlation between the rating, which appears in column 3 of the data set, and the price, which appears in column 4. With this function defined, we then call the boot function to do the bootstrapping. The remainder of the commands generates the histogram and Normal quantile plot that we ve seen before. bc <- read.csv("u:/ips8/datasets/ch16/laundry.csv") attach(bc)

theta <- function(data,indices){ d <- data[indices,] cov(d[,3],d[,4])/sqrt(var(d[,3])*var(d[,4])) corr.boot <- boot(bc,theta,r=5000) hist(corr.boot$t,prob=t,axes=f,main="",xlab="correlation coefficient of resamples",ylab="") axis(1) jcc <- dnorm(seq(0.25,.995,length=500),mean(corr.boot$t),sqrt(var(corr.boot$t))) lines(seq(.25,.995,length=500),jcc) lines(c(.6707535,.6707535),c(0,dnorm(.6707535,mean(corr.boot$t),sqrt(var(corr.boot$t )))),col=2) afc <- mean(corr.boot$t) lines(c(afc,afc),c(0,dnorm(afc,mean(corr.boot$t),sqrt(var(corr.boot$t)))),lty=2) qqnorm(corr.boot$t,main="",xlab="normal score",ylab="correlation coefficient") qqline(corr.boot$t, col = 2,lwd=1,lty=1) Software output of Example 16.12 For this example, we again enter the data into two variables directly rather than reading in a file. To do a permutation test, a different package is needed. The first command loads the perm package (assuming is has already been installed). The second and third commands enter the data. The function we need is called permts and is controlled by a set of rules. Here we use the default rules except for the number of permutations, which we set to 5000. library(perm) trtgrp <- c(24,43,58,71,43,49,61,44,67,49,53,56,59,52,62,54,57,33,46,43,57) ctrlgp <- c(42,43,55,26,62,37,33,41,19,54,20,85,46,10,17,60,53,42,37,42,55,28,48) rules <- permcontrol(nmc=5000) permts(trtgrp,ctrlgp,alternative="greater",method="exact.mc",control=rules)

Example 16.13 For this example we compare the male and female GPAs using a permutation test. The commands will be very similar to those for Example 16.12. First the data set is read in. Then we call the function permts using the sex== within brackets to separate the GPAs by gender. bc <- read.csv("u:/ips8/datasets/ch16/gpa.csv") attach(bc) rules <- permcontrol(nmc=5000) permts(gpa[sex==1],gpa[sex==2],method="exact.mc",control=rules)