Monte Carlo, Bootstrap, and Jackknife Estimation Assume that your true model is

Transcription

1 Monte Carlo, Bootstrap, and Jackknife Estimation Assume that your true model is y = Xβ + u, (1.1) where u is i.i.d. with 1) E(u X) = 0 and 2) E(uu X) = σ 2 I, that is, the conditional mean of the error is zero and there is no autocorrelation or heteroskedasticity conditional on X. Then using 1) the ordinary least squares (OLS) estimator of β, ˆβ = (X X) 1 X y, is unbiased. You will want to estimate the variance of ˆβ. Using 2) an estimator of the var( ˆβ) = σ 2 (X X) 1 is ˆσ 2 (X X) 1, (1.2) where ˆσ 2 = û û/(n K), û = y X ˆβ, N = the number of observations, and K = the number of regressors. To understand what is meant by the var( ˆβ) and its estimator, consider the following Monte Carlo procedure. Keep in mind that you would never want to apply this procedure to the classical linear model in (1.1) for actual data, because you can easily evaluate (1.2). 1. MONTE CARLO ESTIMATION OF STANDARD ERRORS OF ˆβ: (= positive square root of the estimated variance). a. Assume a value for β, which is otherwise unobservable. You also select a matrix of values for X, which you hold constant over repeated trials. b. Draw u randomly with replacement from some distribution you assume to be correct, using a random number generator. This use of a random number generator yields the term Monte Carlo, famed for its roulette wheels and games of chance. c. Compute y from equation (1.1). d. Estimate ˆβ by regressing your generated y on X obtaining ˆβ = (X X) 1 X y. (1.3) e. Repeat steps (a)-(d) many times (say 10,000) holding β constant. Note that you have generated 10,000 drawings of the random variable u and y through equation (1.1), from which we could compute 10,000 estimates of ˆβ using (1.3). Our estimate of the sample variance of ˆβ over 10,000 such outcomes is our sample measure of the population variance. 1

2 f. The main use of the Monte Carlo method is to compute the bias and meansquare error of your estimator when it is difficult to do so analytically. However, it is also useful to demonstrate omitted variable bias and other related models to econometrics students. Keep in mind that the assumptions made in the Monte Carlo method may make your results specific to your exact model. 2. BOOTSTRAP ESTIMATION OF THE STANDARD ERRORS OF ˆβ. The term bootstrap implies that you are going pull yourself up by your bootstraps. The wide range of bootstrap methods fit into two categories: 1) methods that allow computation of standard errors when the analytical formulas are hard to derive and 2) bootstrap methods that lead to better small-sample approximations. Here you are confined to one actual data set and wish to resample from the empirical distribution of the residuals or the original {X, y} data, rather than assume some distribution of the true error as with Monte Carlo analysis. Given (1.1) as your true model, you could easily evaluate (1.2) and not do any bootstrapping. With more complex models containing non-normal error terms and non-linearities in β, or in two-step models where you need to correct the estimated standard errors of the second-step estimators, the derivation of analytical formulas for the variance of ˆβ is complex. Examples are two-step M estimators, two-step panel data estimators, and two-step logit or probit estimators. In these cases ˆθ, a second-step estimator, is a function of parameters that are estimated in the first step. The bootstrap will adjust for this. For software that does not compute heteroskedastic consistent (HC) standard errors, if there is heteroskedasticity in the model, the wild and pairs bootstrap estimators make the HC correction of the estimated standard errors. With clustered data, cluster-robust standard errors can be obtained by resampling the clusters via bootstrapping. Theory and Monte Carlo evidence indicate that the bootstrap estimates are more accurate (measured by the size and power of a t-test based on the estimated standard error) in small samples than the asymptotic formula, when an asymptotically pivotal statistic is employed (one whose asymptotic normal distribution does not depend on unknown parameters). Otherwise, there is no guarantee of a gain in accuracy However, usually there is a gain in accuracy even if an asymptotically pivotal statistic is not employed. A nice summary is found in Bootstrap Inference in Econometrics by James MacKinnon, Dept. of Economics Working Paper, Queens Univ., June, Also, see J. L. Horowitz, The Bootstrap, Ch. 52, Handbook of Econometrics, J.J. Heckman and E. Leamer editors, Vol. 5, 2001 for technical derivations. There are three basic non-parametric bootstrap methods we will focus on: the naive bootstrap, pairs bootstrapping, and the wild bootstrap. These are in contrast to the less 2

3 popular parametric bootstrap. The three methods are most easily explained for the simple model (1.1). As with the Monte Carlo method, you assume that the model generating your data is the same as in (1.1). However, now you do not assume knowledge of β or u and do not generate random data from (1.1). Instead you use the estimator ˆβ and the original data {X, y}.: 1. Bootstrap Methods 1.1 Non-Parametric: Residual Bootstrap a. Estimate ˆβ = (X X) 1 X y. b. Compute û = y X ˆβ. You work with û instead of assuming the distribution of u as in Monte Carlo estimation. c. Draw with replacement a sample of size N using a discrete uniform random number generator U[1, N], where N is your sample size. Let these random numbers be represented by z 1,..., z N. Generate element u n as element z n of û, n = 1,..., N. What this means is that each element of û has probability 1/N of being drawn. See the Residual Bootstrap example in the Stata do file called monte carlo.do. d. Treating y = X ˆβ + u as your true model, compute y. e. Compute β = (X X) 1 X y. f. Repeat (c)-(e) B times. See MacKinnon for details. g. Compute the square root of the sample variance of these β estimates. This is the estimate of the standard error of ˆβ. With B bootstrap replications, β 1,..., β B compute s 2ˆβ,Boot = 1 (B 1) B (βb β ) 2. where β = B 1 B b=1 β b. h. Take the square root of s 2ˆβ,Boot to get the bootstrap estimate of the standard error. i. This bootstrap provides no asymptotic refinement (an improved approximation to the finite-sample distribution of an asymptotically pivotal statistic), since its distribution depends on the unknown parameter defining the mean and variance of ˆβ. That is, there will be no guarantee of an improvement in finite-sample performance. However, such an improvement usually obtains anyway. This method can be very useful in computing adjusted standard 3 b=1

4 errors with 2-step models or in computing cluster-robust standard errors by resampling clusters. 1.2 Non-Parametric: Pairs Bootstrap a. Follow step a above. b. Then draw pairs randomly with replacement, where the probability of any pair being drawn is equal to 1/N, from {X, y} to obtain {X, y } c. Then use the {X, y } data to obtain the pairs estimator βp = (X X ) 1 X y, d. Note that the pairs bootstrap produces a HC covariance matrix. See Lancaster (2003) for a proof of this. See the Pairs Estimator in the Stata file called monte carlo.do. 1.3 Non-Parametric: Wild Bootstrap a. The wild bootsrap also produces a HC covariance matrix; see MacKinnon (2002) for details. b. The wild first generates y n = X n ˆβ + f(ûn )v n, (1.4) where f(û n ) = û n (1 h n ) 1/2 (1.5) and h n is the n th diagonal element of X(X X) 1 X. We do this normalization so that, if u n is homoskedastic, then the normalized residual in (1.5) is homoskedastic. To see this remember that û n = (1 h n )u n and compute the variance of û n, where (1 h n ) is sometimes called m n. c. The best approach to specifying v n is to use the Rademacher distribution (See Davidson and Flachaire (2001)): v n = { 1 with probability 1/2, 1 with probability 1/2. (1.6) d. Now v n has E(v n) = 0, E(v 2 n ) = 1, E(v 3 n ) = 0, and E(v 4 n ) = 1. Since v n and û n are independent, the mean of the composite residual is zero, which preserves E(û n ) = 0. This is a nice property and if we take X n as given, this implies unbiasedness of β. 4

5 e. One can prove that var(wz) = var(w)var(z) assuming independence of w and z and E(w) = E(z) = 0. Then the variance of the composite residual is one times the variance of û n, preserving the variance of û n, the skewness of û n is eliminated, but the kurtosis of û n is preserved. Further, Wu and Mammen (1993) shows that the asymptotic distribution of their version of the wild bootstrap is the same as the asymptotic distribution of various statistics. These asymptotic refinements are due to their wild bootstrap s taking account of the skewness of û n. However, their version of the wild ignores kurtosis. f. Now follow steps e) h) of section 1.1 using the wild data for y generated in step b) of this section. See the Wild Estimator in the Stata do file called monte carlo.do. 1.4 Pairs vs. Wild Based on Atkinson and Cornwell, Inference in Two-Step Panel Data Models with Instruments and Time-Invariant Regressors: Bootstrap versus Analytic Estimators, for models with endogeneity, the wild has more accurate size and virtually the same power as the pairs estimator in estimation of t-values for the second-step estimators. Both generally outperform the asymptotic formula in terms of size and power. In a linear model context without panel data, Davidson and Flachaire (2001) find that the wild often outperforms the pairs when the error is heteroskedastic. 1.5 Parametric Bootstrap If it known that y n Normal[µ, σ 2 ] then we could obtain B bootstrap samples of size N by drawing from the Normal[ˆµ, s 2 ] distribution. This is an example of a parametric bootstrap. 2. Number of Bootstrap Draws The bootstrap asymptotics rely on big N, even if B is small. However, the bootstrap is more accurate with big B. How large B should be depends on the simulation error you can accept in your work. Davidson and MacKinnon recommend B = 399 for a type I error of.05 and B = 1, 499 for tests at a level of.01. If you are performing bootstrapping within a Monte Carlo analysis, then B = 399 is adequate. You need to have α (B + 1) be an integer. Note: If you assume a two-sided confidence interval with α =.05 then for the upper-tail, 399*.025=9.98 is the theoretical number of significant t-values you would 5

6 expect if the size were correct. You would array t-values from high to low. With 400 bootstrap draws, 400*.025=10, which says that you should have 10 t-values equal to 1.96 or greater to have correct size. However, if the 10-th ranked t-value is the last t-value greater than or equal to 1.96, should the 10-th ranked t-value belong to one set or the other. It sits on the cusp. Since.025 percent of 399 is 9.98 and 9.98 is not an even number, you eliminate ambiguity, since the required number is not an integer. This is not a major issue in my opinion. 3. Bias Adjustment Using The Bootstrap or Jackknife In small samples many sandwich estimators may be biased. Weak instruments may also cause bias. We can correct for these biases using the bootstrap or the jackknife via the following: a. Since the Bootstrap estimator of ˆβ is 1/B B b=1 β b, we can compute the bias correction for ˆβ as ˆβ (1/B B b=1 β b ˆβ) = 2 ˆβ 1/B B b=1 β b. The intuition is that since we do not know β, we treat ˆβ as the true value and determine the bias of the bootstrap estimator relative to this value. We then adjust ˆβ by this computed bias, assuming that the bias of the bootstrap estimator relative to ˆβ is the same as the bias of ˆβ relative to β. b. We can compute the jackknife estimator of the standard deviation of ˆβ for a sample of size N, n = 1,..., N, by computing N jackknife estimates of β obtained by successively dropping observation n and recomputing β J,n, where J stands for Jackknife. Then compute the variance of the N estimates and multiply by N 1 to get the estimated variance of ˆβ. Take the square root to get the estimated standard error. We can employ the jackknife two-stage-least-squares (JK2SLS) estimator of Hahn, J., and J. Hausman (2003), Weak Instruments: Diagnosis and Cures in Empirical Econometrics, American Economics Review Papers and Proceedings 93: , to correct for the bias caused by weak instruments. The formula for the jackknife bias correction is given in Shao and Tu (1995). To compute the jackknife bias correction for the estimated coefficients, let ˆβ be the estimator of β for a sample of size N. First compute N jackknife estimates of ˆβ obtained by successively dropping one observation and recomputing ˆβ. Call each of these N estimates β J,n, n = 1..., N, and their average β J = N n=1 β J,n. Define the jackknife bias estimator as BIAS J = (N 1)( β J ˆβ). (1.7) 6

7 Then the jackknife bias-adjusted (BA) estimator of β is ˆβ BA = ˆβ BIAS J = N ˆβ (N 1)( β J ). (1.8) Again, the intuition is that since we do not know β, we treat ˆβ as the true value and determine the bias of the jackknife estimator relative to this value. We then adjust ˆβ by this computed bias, assuming that the bias of the jacknife estimator relative to ˆβ is the same as the bias of ˆβ relative to β. c. The jackknife uses fewer computations (N < B) than the the bootstrap, but is outperformed by the bootstrap as B. 4. Hypothesis Testing Assume a model y = α + xβ + u. You can compute t = ( ˆβ β)/s ˆβ,Boot, using the bootstrap estimator of the standard deviation. For the specific null hypothesis that β = 0 you would compute t = ( ˆβ 0)/s ˆβ,Boot. While this is asymptotically valid so long as β and ˆβ approach the true β, this will not give you asymptotic refinements for any N. To obtain asymptotic refinement, we need to compute asymptotically pivotal test statistics whose asymptotic normal distribution does not depend on unknown parameters. This would require the studentized test statistic based on the asymptotic standard error of ˆβ =. We fashion this after the usual test sˆθ b statistic t = ( ˆβ β)/s ˆβ N[0, 1], that provides asymptotic refinement since it is asymptotically pivotal. This occurs because its asymptotic distribution does not depend on unknown parameters. To achieve asymptotic refinement, you have to compute t = (β ˆβ)/s ˆβ, b where s ˆβ is the analytic or asymptotic estimator evaluated using the bootstrap data for b each draw, and then find t (1 α/2) and t (α/2) for the bootstrap after rank ordering the B bootstrap draws. Use this to test the null hypothesis. For α =.05, take the (1 α/2) =

8 percentile and the (α/2) =.025 percentile. Then these standardized t values can then compared with the t value. If t > t (1 α/2) or t < t (α/2) then the null hypothesis is rejected. We are comparing one standardized statistic with another. However, computing the analytic formula may be very difficult and one may have to use the bootstrap estimator based on the standard deviations (s ˆβ,Boot ), computed over the B bootstrap trials. This will not yield asymptotic refinements but will probably still be better than using the asymptotic formula. 5. Boostrapping Time Series Data The bootstrap does not generally work well with time series data. The reason is that the bootstrap relies on resampling from an iid distrubution. With standard bootstrapping you are randomly selecting among a set of residuals which follow some autocorrelation process, thereby destroying that process. Two alternatives that can be employed are block bootstrapping and the sieve bootstrap. With block bootstrapping, time-series blocks that capture the autoregressive process are randomly selected and the entire block is resampled. The sieve bootstrap works by fitting an autoregressive process with order p for the original data and then generating boostrap samples by resampling the rescaled residuals randomly which are assumed to be iid. Since the sieve imposes more structure on the DGP, it should have better performance than the block sootstrap. As an example of the sieve, with p = 1 consider the model y t = βx t + u t, (1.9) where u t = ρu t 1 + ϵ t, (1.10) and ϵ t is white noise. Now estimate β and ρ and obtain ˆϵ t = û t ˆρû t 1. Bootstrap these residuals to get ˆϵ, t = 1,..., T. Then recursively compute û t = ρû t 1 +ˆϵ t and hence y t = ˆβx t + û t. Then regress y t on x t. The Moving Block Bootstrap constructs overlapping moving blocks. For the movingblock bootstrap, there are n - b + 1 blocks. The first contains obs. 1 through b, the second contains obs. 2 through b + 1, and the last contains obs. n - b + 1 through n. Choice of b is critical. In theory, it must increase as n increases. If blocks are too short, bootstrap samples cannot mimic original sample. Dependence is broken whenever we start a new block. If blocks are too long, bootstrap samples are not random enough. 8

9 For a nice discussion of the moving block bootstrap and a comparison of this and the other methods for time series see Bootstrap Methods in Econometrics by James G. MacKinnon Department of Economics Queens University Kingston, Ontario, Canada K7L 3N6 jgm@econ.queensu.ca September, Boostrapping Panel Data With both panel-data bootstrap methods, three resampling schemes are available. These are cross-sectional (also called panel bootstrap) resampling, temporal resampling (also called block bootstrap resampling), and cross-sectional/temporal resampling. With panel-bootstrap resampling, one randomly selects among N cross-sectional units and uses all T observations for each. If cross-sectional dependence exists, one can select the relevant blocks of cross-sectional units. With temporal resampling, one randomly selects temporal units and uses all N observations for each. If temporal dependence exists, one can select the relevant blocks of temporal units. Of course this choice is critical to the accuracy of the bootstrap. With cross-sectional/temporal resampling, both methods are utilized. Following Cameron and Trivedi (2005), in the fixed-t case consistent (as N ) standard errors can be obtained using the cross-sectional bootstrap method. Hence, we employ this method for both the pairs and wild methods, where we assume no cross-sectional or temporal dependence. Also, see Kapetnaios (2008), A Bootstrap Procedure for Panel Data Sets with Many Cross-Sectional Units, The Econometrics Journal 11, , who shows that if the data do not exhibit cross-sectional dependence but exhibit temporal dependence, then cross-sectional resampling is superior to block bootstrap resampling. Further, he shows that cross-sectional resampling provides asymptotic refinements. Monte Carlo results using these assumptions indicate the superiority of the cross-sectional method. 9