Bootstrap Example and Sample Code



Similar documents
The Normal Distribution. Alan T. Arnholt Department of Mathematical Sciences Appalachian State University

Binomial Distribution n = 20, p = 0.3

Pr(X = x) = f(x) = λe λx

VISUALIZATION OF DENSITY FUNCTIONS WITH GEOGEBRA

Sampling Distributions

Normal Approximation. Contents. 1 Normal Approximation. 1.1 Introduction. Anthony Tanbakuchi Department of Mathematics Pima Community College

Normal Distribution. Definition A continuous random variable has a normal distribution if its probability density. f ( y ) = 1.

Exact Confidence Intervals

Maximum Likelihood Estimation

Math 461 Fall 2006 Test 2 Solutions

Lecture 5 : The Poisson Distribution

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Introduction to R. I. Using R for Statistical Tables and Plotting Distributions

Normal distribution. ) 2 /2σ. 2π σ

Dongfeng Li. Autumn 2010

Package SHELF. February 5, 2016

X = rnorm(5e4,mean=1,sd=2) # we need a total of 5 x 10,000 = 5e4 samples X = matrix(data=x,nrow=1e4,ncol=5)

Exploratory Data Analysis

Practice problems for Homework 11 - Point Estimation

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

CHAPTER 6: Continuous Uniform Distribution: 6.1. Definition: The density function of the continuous random variable X on the interval [A, B] is.

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

Stat 5102 Notes: Nonparametric Tests and. confidence interval

Summary of R software commands used to generate bootstrap and permutation test output and figures in Chapter 16

5.1 Identifying the Target Parameter

5. Continuous Random Variables

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

16. THE NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION

Lecture 2 ESTIMATING THE SURVIVAL FUNCTION. One-sample nonparametric methods

Nonparametric statistics and model selection

Projects Involving Statistics (& SPSS)

How To Test For Significance On A Data Set

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

SOLUTIONS: 4.1 Probability Distributions and 4.2 Binomial Distributions

Probability Distributions

Notes on Continuous Random Variables

1 Sufficient statistics

Solutions to Homework 6 Statistics 302 Professor Larget

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Lecture 8. Confidence intervals and the central limit theorem

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 1 - Bootstrap

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

Data Analysis Tools. Tools for Summarizing Data

Stat 515 Midterm Examination II April 6, 2010 (9:30 a.m. - 10:45 a.m.)

12.5: CHI-SQUARE GOODNESS OF FIT TESTS

Department of Civil Engineering-I.I.T. Delhi CEL 899: Environmental Risk Assessment Statistics and Probability Example Part 1

Lecture 2: Discrete Distributions, Normal Distributions. Chapter 1

Confidence intervals

Review of Random Variables

The normal approximation to the binomial

4. Continuous Random Variables, the Pareto and Normal Distributions

Stats on the TI 83 and TI 84 Calculator

Questions and Answers

Notes on the Negative Binomial Distribution

Data Mining: Algorithms and Applications Matrix Math Review

Descriptive Statistics

Package cpm. July 28, 2015

Principle of Data Reduction

Math 425 (Fall 08) Solutions Midterm 2 November 6, 2008

Lecture 7: Continuous Random Variables

Significance, Meaning and Confidence Intervals

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Software for Distributions in R

Lecture 3: Linear methods for classification

5: Magnitude 6: Convert to Polar 7: Convert to Rectangular

3.4 The Normal Distribution

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Normal Distribution Example 1

The normal approximation to the binomial

An Introduction to Basic Statistics and Probability

ECE302 Spring 2006 HW5 Solutions February 21,

Chapter 7 Section 1 Homework Set A

UNIVERSITY OF OSLO. The Poisson model is a common model for claim frequency.

Multivariate Normal Distribution

Quadratic forms Cochran s theorem, degrees of freedom, and all that

THE BINOMIAL DISTRIBUTION & PROBABILITY

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Sample Size and Power in Clinical Trials

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, cm

SIMPLIFYING ALGEBRAIC FRACTIONS

To give it a definition, an implicit function of x and y is simply any relationship that takes the form:

UNIT I: RANDOM VARIABLES PART- A -TWO MARKS

Operational Risk Modeling Analytics

Statistics 104: Section 6!

MATH 10: Elementary Statistics and Probability Chapter 5: Continuous Random Variables

Package EstCRM. July 13, 2015

Chapter 2. Hypothesis testing in one population

Tutorial 5: Hypothesis Testing

Simple Linear Regression

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

CONTINGENCY TABLES ARE NOT ALL THE SAME David C. Howell University of Vermont

Simple Regression Theory II 2010 Samuel L. Baker

Graphs. Exploratory data analysis. Graphs. Standard forms. A graph is a suitable way of representing data if:

Transcription:

U.C. Berkeley Stat 135 : Concepts of Statistics Bootstrap Example and Sample Code 1 Bootstrap Example This section will demonstrate how the bootstrap can be used to generate confidence intervals. Suppose we have a sample of data from an exponential distribution with parameter λ, i.e. our data come from a distribution with density f(x λ) = λe λx. Recall that the MLE is given by ˆλ = 1/ X and that ˆλ has asymptotic variance equal to λ 2 /n. Since we know that the MLE is asymptotically normally distributed, we can form a (1 α)100% confidence interval for λ by using our estimate of the asymptotic variance based on our MLE and taking [ˆλ z(1 α/2)ˆλ/ n, ˆλ + z(1 α/2)ˆλ/ n]. Here is an example with R code generating the sample, calculating the MLE, and computing a 95% confidence interval for λ my.test.data<-rexp(100,3) mean(my.test.data) #[1] 0.30125: AN example of what happened 1/mean(my.Test.Data) #[1] 3.32 empirical.lambda<-1/mean(my.test.data) #Normal theory quantiles: CIlow<-empirical.lambda-qnorm(0.975)*empirical.lambda/sqrt(100) CIhigh<-empirical.lambda+qnorm(0.975)*empirical.lambda/sqrt(100) #[2.67, 3.97] We see that we generated a sample of size 100 from the exponential distribution with parameter λ = 3. The mean of our data set is 0.301, and therefore our maximum likelihood estimate for λ is 3.32. Since we know the MLE is asymptotically normal, we can construct a confidence interval using our estimate of the asymptotic variance of our MLE and the quantiles of the standard normal distribution. In this case, we get [2.67, 3.97] for our confidence interval. But suppose we were not working with the MLE, or suppose we did not want to use the asymptotic distribution to form our confidence intervals. Then we could use the bootstrap to estimate the distribution of ˆλ and create bootstrap confidence intervals for λ. First, we form our set of bootstrap estimates of our parameter by generating B random samples of size n = 100 from the exponential distribution with parameter ˆλ and using each of these samples to get a new estimate of our model parameter, ˆλ (b). 1

In general, we don t necessarily know anything about the distribution of the new parameter estimates ˆλ (b). But, although we may not have a perfect idea of the shape of this distribution, we can calculate quantiles q (α) from it, where we have q (α) = ˆλ (b) such that a fraction α of the other bootstrap parameter estimates are less than or equal to ˆλ (b). [For example, if we had 1000 bootstrap samples, the quantile q (0.05) would be the 50th largest observation.] This means we can write P(q (α/2) ˆλ (b) q (1 α/2)) = 1 α Now suppose we want to look at the distribution of ˆλ (b) ˆλ. From the expression above, we can see P(q (α/2) ˆλ ˆλ (b) ˆλ q (1 α/2) ˆλ) = 1 α In addition, we can argue that we can estimate the distribution of ˆλ λ by the distribution of ˆλ (b) ˆλ. This makes sense if you think of the analogy that ˆλ arose from sampling from a distribution with parameter λ, while ˆλ (b) arose in exactly the same way from sampling from a distribution from parameter ˆλ. This means we can say P((q (α/2) ˆλ ˆλ (b) ˆλ q (1 α/2) ˆλ) = P((q (α/2) ˆλ ˆλ λ q (1 α/2) ˆλ) = 1 α Now we can find our confidence interval for λ by rearranging terms to get P(2ˆλ q (1 α/2) λ 2ˆλ q (α/2)) = 1 α. This means our bootstrap confidence interval for λ is [2ˆλ q (1 α/2), 2ˆλ q (α/2)]. Now some code demonstrating how to find a 95% confidence interval for λ. Look at each line of code carefully to see what it is doing. #first, initialize a matrix that will receive the values of the #estimate from each sample boot.sampling.dist<-matrix(1,2000) #Now create 2000 bootstrap samples and compute the value of the stat for each of them for (i in 1:2000){ boot.sampling.dist[i]<-1/mean(rexp(100,empirical.lambda)) } #look at the sampling distribution of the stat, according to parametric bootstrap: windows() hist(boot.sampling.dist,main="estimate of sampling distribution of lambda",breaks=50) #find the quantiles of this distribution my.quantiles<-quantile(boot.sampling.dist,c(.025,0.975)) #calculate the bootstrap confidence interval boundaries CIbootlow<-2*empirical.lambda-my.quantiles[2] CIboothigh<-2*empirical.lambda-my.quantiles[1] 2

One more thing that the collection of bootstrap estimated parameters can be used for is to calculate an estimate of the standard error of ˆλ. The standard error of ˆλ can be estimated by the sample standard error of the bootstrap parameters: boot.estimate.se<-sqrt(var(boot.sampling.dist)) 2 Some Helpful Commands Working with Distributions There are a set of four functions that are defined for almost any distribution that you will encounter on your homework. I will show you examples for the normal distribution, but analogous functions exist for other distributions as well, such as the exponential (used above), the binomial, etc. These four functions are rnorm(), pnorm(), qnorm() and dnorm(). Each of them takes a series of arguments. rnorm() is used to generate a set of random variables sampled from a normal distribution of your choice. You pass it three arguments, in this order: n, the number of observations you want it to generate; µ the mean of the distribution you want it to sample from; σ 2, the variance of the distribution you want it to sample from. [Note that the number of arguments may vary if you are working with a different distribution that has a different number of parameters. For example, we saw above that rexp() only takes two parameters.] pnorm() is used to give the cumulative density of the distribution you are working with. You pass it three arguments: q, the quantile below which you want the cumulative probability; and the parameters of the distribution you are working with. For example pnorm(0, 0, 1) = 0.5, since half the area of the standard normal distribution is below 0. qnorm() is used to calculate a quantile, if you know the cumulative probability. You pass it three arguments: p, the cumulative probability; and the parameters of the distribution you are working with. For example, qnorm(0.5, 3, 1) = 3 since 3 is the 0.5th quantile of the normal distribution with mean 3 and variance 1. dnorm() is used to calculate the value of the density function at a given point. You pass it three arguments: x, the point at which you want to compute the density; and the parameters of the distribution you are working with. Histograms There are a couple of useful arguments for the hist() function that will come in handy on your homework: breaks: Tells R how many columns you want your histogram to have. If you give it a number, it will make that many columns. Otherwise, you can pass it a vector of breakpoints and the column borders will occur at the breakpoints you specify. freq: If this argument is set to FALSE, the histogram bars will indicate density (percentages) rather than counts. This is useful if you want to compare the distribution of your data to another distribution. You can superimpose the density of the distribution you want to compare your data to. main: This argument is where you set the title for your histogram. Name it something descriptive so whoever is looking at it knows what they re seeing. 3

When you want to compare the shape of your histogram to the shape of a certain density function (like if you re trying to test whether your assumed distribution was reasonable), you may want to superimpose the plot of the density function on top of your histogram. To do this, you first create your histogram using the hist() command. Then you call the function lines(), passing it two arguments: a vector of values ranging over the width of your histogram and the density of your desired distribution at each of these values. I ll show an example of this below. An Example We will generate a random sample from a certain density, create a histogram of our sample, and superimpose the original density on top of our histogram to check the fit. # set number of sample points to generate n<-1000 # generate n sample points from exponential dist with parameter 1 expo.data<-rexp(n,1) # make a histogram of the data hist(expo.data,breaks=50,main="histogram of exponential data",freq=false) # observing that our histogram ranges over the values 0 to 10 # we create a vector that also ranges over these values xvals<-seq(0,10,by=0.01) # we then superimpose the density function at these points on our # histogram lines(xvals,dexp(xvals,1)) # to save a copy of this histogram you can execute pdf("myhist.pdf", width=3, height=3, horizontal=f) hist(expo.data,breaks=50,main="histogram of exponential data",freq= FALSE) lines(xvals,dexp(xvals,1)) dev.off() 4

Histogram of exponential data Density 0.0 0.4 0.8 0 2 4 6 8 expo.data Figure 1: Histogram of exponential random variables with density superimposed. 5