CHAPTER 4 ASYMPTOTIC THEORY FOR LINEAR REGRESSION MODELS WITH I.I.D. OBSERVATIONS

Transcription

1 CHAPTER 4 ASYMPTOTIC THEORY FOR LINEAR REGRESSION MODELS WITH I.I.D. OBSERVATIONS Key words: Asymptotics, Almost sure convergence, Central limit theorem, Convergence in distribution, Convergence in quadratic mean, Convergence in probability, I.I.D., Law of large numbers, the Slutsky theorem, White s heteroskedasticity-consistent variance estimator. Abstract: When the conditional normality assumption on the regression error does not hold, the OLS estimator no longer has the nite sample normal distribution, and the t-test statistics and F -test statistics no longer follow a Student t-distribution and a F -distribution in nite samples respectively. In this chapter, we show that under the assumption of i.i.d. observations with conditional homoskedasticity, the classical t-test and F -test are approximately applicable in large samples. However, under conditional heteroskedasticity, the t-test statistics and F -test statistics are not applicable even when the sample size goes to in nity. Instead, White s (1980) heteroskedasticity-consistent variance estimator should be used, which yield asymptotically valid hypothesis test procedures. A direct test for conditional heteroskedasticity due to White (1980) is presented. To facilite asymptotic analysis in this and subsequent chapters, we rst introduce some basic tools in asymptotic analysis. Motivation The assumptions of classical linear regression models are rather strong and one may have a hard time nding practical applications where all these assumptions hold exactly. For example, it has been documented that most economic and nancial data have heavy tails, and so are not normally distributed. An interesting question now is whether the estimators and tests which are based on the same principles as before still make sense in this more general setting. In particular, what happens to the OLS estimator, the t- and F -tests if any of the following assumptions fails: strict Exogeneity E(" t jx) = 0? normality ("jx N(0; 2 I))? conditional homoskedasticity (var(" t jx) = 2 )? serial uncorrelatedness (cov(" t ; " s jx) = 0 for t 6= s)? 1

2 When classical assumptions are violated, we do not know the nite sample statistical properties of the estimators and test statistics anymore. A useful tool to obtain the understanding of the properties of estimators and tests in this more general setting is to pretend that we can obtain a limitless number of observations. We can then pose the question how estimators and test statistics would behave when the number of observations increases without limit. This is called asymptotic analysis. In practice, the sample size is always nite. However, the asymptotic properties translate into results that hold true approximately in nite samples, provided that the sample size is large enough. We now need to introduce some basic analytic tools for asymptotic theory. For more systematic introduction of asymptotic theory, see, for example, White (1994, 1999). 2

3 Introduction to Asymptotic Theory In this section, we introduce some important convergence concepts and limit theorems. First, we introduce the concept of convergence in mean squares, which is a distance measure of a sequence of random variables from a random variable. De nition [Convergence in mean squares (or in quadratic mean)] A sequence of random variables/vectors/matrices Z n ; n = 1; 2; :::; is said to converge to Z in mean squares as n! 1 if EjjZ n Zjj 2! 0 as n! 1; where jj jj is the sum of the absolute value of each component in Z n Z: Remarks: (i) When Z n is a vector or matrix, convergence can be understood as convergence in each element of Z n : (ii) When Z n Z is a l m matrix, where l and m are xed positive integers, then we can also de ne the squared norm as jjz n Zjj 2 = lx mx [Z n Z] 2 (t;s): Example 1: Suppose fz t g is i.i.d.(; 2 ); and Z n = n 1 P n Z t: Then Solution: Because E( Z n ) = ; we have s=1 Z n q:m:! : E( Z n ) 2 = var( Z n )! = var n 1 Z t! = 1 n var Z 2 t = 1 var(z n 2 t ) = 2 n! 0 as n! 1: It follows that E( Z n ) 2 = 2 n 3! 0 as n! 1:

4 Next, we introduce the concept of convergence in probability that is another popular distance measure between a sequence of random variables and random variable. De nition [Convergence in probability] Z n converges to Z in probability if for any given constant " > 0; Remarks: (i) We can also write Pr[jjZ n Zjj > "]! 0 as n! 1 or Pr[jjZ n Zjj "]! 1 as n! 1: Z n Z p! 0 or Z n Z = o P (1); (ii) When Z = b is a constant, we can write Z n probability limit of Z n : p! b and b = p lim Zn is called the (iii) The notation o P (1) means that Z n b vanishes to zero in probability. (iv) Convergence in probability is also called weak convergence or convergence with probability approaching one. (v) When Z n! p Z; the probability that the di erence jjz n Zjj exceeds any given small constant " is rather small for all n su ciently large. In other words, Z n will ve very close to Z with very high probability, when the sample size n is su ciently large. Lemma [Weak Law of Large Numbers (WLLN) for I.I.D. Sample] Suppose fz t g is i.i.d.(; 2 ); and de ne Z n = n 1 P n Z t; n = 1; 2; ::: : Then Z n p! as n! : Proof: For any given constant " > 0; we have by Chebyshev s inequality Hence, Pr(j Z n j > ") E( Z n ) 2 " 2 = 2! 0 as n! 1: n" 2 Z n p! as n! 1: Remark: This is the so-called weak law of large numbers (WLLN). In fact, we can weaken the moment condition. Lemma [WLLN for I.I.D. Random Sample] Suppose fz t g is i.i.d. with E(Z t ) = and EjZ t j < 1: De ne Z n = n 1 P n Z t: Then Z n p! as n! 1: 4

5 Question: Why do we need the moment condition EjZ t j < 1? Answer: An counter example: Suppose fz t g is a sequence of i.i.d. Cauchy(0; 1) random variables whose moments do not exist. Then Z n Cauchy(0; 1) for all n 1; and so it does not converges in probability to some constant as n! 1: We now introduce a a related useful concept: De nition [Boundedness in Probability] A sequence of random variables/vectors/matrices fz n g is bounded in probability if for any small constant > 0; there exists a constant C < 1 such that P (jjz n jj > C) as n! 1: We denote Z n = O P (1): Interpretation: As n! 1; the probability that jjz n jj exceeds a very large constant is small. Or, equiavently, jjz n jj is smaller than C with a very high probability as n! 1: Example: Suppose Z n N(; 2 ) for all n 1: Then Z n = O P (1): Solution: For any > 0; we always have a su ciently large constant C = C() > 0 such that P (jz n j > C) = 1 P ( C Z n C) C = 1 P Z n C C C + = 1 + ; where (z) = P (Z z) is the CDF of N(0; 1): [We can choose C such that [(C )=] and [ (C + )=] 1 2 :] A Special Case: What happens to C if Z n N(0; 1)? In this case, P (jz n j > C) = 1 (C) + ( C) = 2[1 (C)]: 5

6 Suppose we set that is, we set 2(C) = 1 ; 1 C = 1 ; 2 where 1 () is the inverse function of (): Then we have P (jz n j > C) = : Remark: Z n Z q:m:! 0 implies Z n Z! p 0 (why?) but the converse may not be true. We now give a counter example. Example: Suppose Z n = ( 0 with prob 1 1 n n with prob 1 n : p Then Z n! 0 as n! 1 but E(Zn 0) 2 = n! 1: Please verify it. Solution: (i) For any given " > 0; we have P (jz n 0j > ") = P (Z n = n) = 1 n! 0: (ii) X E(Z n 0) 2 = (z n 0) 2 f(z n ) z n2f0;ng = (0 0) 2 1 n 1 + (n 0) 2 n 1 = n! 1: Next, we provide another convergence concept called almost sure convergence. De nition [Almost Sure Convergence] fz n g converges to Z almost surely if h i Pr lim jjz n Zjj = 0 = 1: n!1 We denote Z n Z a:s:! 0: Interpretation: Recall the de nition of a random variable: any random variable is a mapping from the sample space to the real line, namely Z :! R: Let! be a basic outcome in the sample space : De ne a subset in : A c = f! 2 : lim n!1 Z n (!) = Z(!)g: 6

7 That is, A c is the sets of basic outcomes on which the sequence of fz n ()g converges to Z() as n! 1: Then almost sure convergence can be stated as P (A c ) = 1: In other words, the convergent set A c has probability one to occur. Example: Let! be uniformly distributed on [0; 1]; and de ne Z(!) =! for all! 2 [0; 1]: and Z n (!) =! +! n for! 2 [0; 1]: Is Z n Z a:s:! 0? Solution: Consider A c = f! 2 : lim n!1 jz n (!) Z(!)j = 0g: Because for any given! 2 [0; 1); we always have In contrast, for! = 1; we have lim n(!) n!1 Z(!)j = lim j(! +! n ) n!1!j = lim! n = 0: n!1 lim jz n(1) Z(1)j = 1 n = 1 6= 0: n!1 Thus, A c = [0; 1) and P (A c ) = 1: We also have P (A) = P (! = 1) = 0: Remark: Almost sure convergence is closely related to pointwise convergence (almost everywhere). It is also called strong convergence. Lemma [Strong Law of Large Numbers (SLLN) for I.I.D. Random Samples] Suppose fz t g be i.i.d. with E(Z t ) = and EjZ t j < 1: Then Z n a:s:! as n! 1: Remark: Almost sure convergence implies convergence in probability but not vice versa. 7

8 Lemma [Continuity]: (i) Suppose a n p! a and bn p! b; and g() and h() are continuous functions. Then g(a n ) + h(b n ) p! g(a) + h(b); and g(a n )h(b n ) p! g(a)h(b): (ii) Similar results hold for almost sure convergence. The last convergence concept we will introduce is called convergenve in distribution. De nition [Convergence in Distribution] Suppose fz n g is a sequence of random variables with CDF ff n (z)g; and Z is a random variable with CDF F (z): Then Z n converges to Z in distribution if F n (z)! F (z) as n! 1 for all continuity points of F (); namely all points z where F (z) is continuous. The distribution F (z) called the asymptotic distribution or limiting distribution of Z n : Remark: For convergence in mean squares, convergence in probability and almost sure convergence, we can view that convergence of Z n to Z occurs element by element (that is, each element of Z n converges to the corresponding element of Z). For convergence in distribution of Z n to Z, however, element by element convergence does not imply convergence in distribution of Z n to Z; because element-wise convergence in distribution ignores the relationships among the components of Z n : However, Z d n! Z implies that each element of Z n converges in distribution to the corresponding element of Z; namely convergence in joint distribution implies convergence in marginal distribution. It may be emphasized that convergence in distribution is de ned in terms of the closeness of the CDF F n (z) to the CDF F (z); not between the closeness of the random variable Z n to the random variable Z: In contrast, convergence in mean squres, convergence in probability and almost sure convergence all measure the closeness between the random variable Z n and the random variable Z: The concept of convergence in distribution is very useful in practice. Generally, the nite sample distribution of statistic is either complicated or unknown. Suppose we know the asymptotic distribution of Z n. Then we can use it as the approximation to the nite sample distribution and use it in statistical inference. The resulting inference is not exact, but the accuracy improves as the sample size n becomes larger. 8

9 With the concept of convergence in distribution, we can now state the Central Limit Theorem (CLT) for i.i.d. random samples, a fundamental limit theorem in probability theory. Lemma [Central Limit Theorem (CLT) for I.I.D. Random Samples]: Suppose fz t g is i.i.d.(; 2 ); and Z n = n 1 P n Z t: Then as n! 1; Z n E( Z n ) p var( Zn ) = Z n p 2 =n = p n( Zn ) d! N(0; 1): Proof: Put and Y n = n 1 P n Y t: Then Y t = Z t ; p n( Zn ) = p n Y n : The characteristic function of p n Y n n (u) = E[exp(iu p ny n )]; i = p 1 "!# iu = E exp p Y t n ny iu = E exp p Y t by independence n where (0) = 1; 0 (0) = 0; 00 (0) = n u = Y pn by identical distribution: = Y (0) + 0 (0) p u + 1 n n 2 00 (0) u2 n + ::: u 2 n = 1 + o(1) 2n u 2! exp as n! 1; 2 1; o(1) means a reminder term that vanishes to zero as n! 1; and we have also made use of the fact that 1 + a n n! e a : 9

10 More rigorously, we can show It follows that u ln n (u) = n ln Y pn = ln Y n 1! u 2 lim n!1 = u2 2 lim n!1 = u 2 2 : pu n 0 Y (u=p n) Y (u= p n) n 1=2 00 Y (u= p n) Y (u= p n) [ 0 Y (u= p n)] 2 lim n!1 n(u) = e 1 2 u2 : 2 Y (u= p n) This is the characteristic function of N(0; 1). By the uniqueness of the characteristic function, the asymptotic distribution of p n( Zn ) is N(0; 1): This completes the proof. Lemma [Cramer-Wold Device] A p 1 random vector Z n any nonzero 2 R p such that 0 = p j=1 2 j = 1; we have d! Z if and only if for 0 Z d n! 0 Z: Remark: This lemma is useful for obtaining asymptotic multivariate distributions. Slutsky Theorem: Let Z d p p n! Z; a n! a and bn! b; where a and b are constants. Then a n + b n Z d n! a + bz as n! 1: Question: If X n! d X and Y n! d Y: Is X n + Y n! d X + Y? Answer: No. We consider two examples: Example 1: X n and Y n are independent N(0,1). Then X n + Y n! d N(0; 2): 10

11 Example 2: X n = Y n N(0; 1) for all n 1: Then X n + Y n = 2X n N(0; 4): Lemma [Delta Method] Suppose p n( Z n )=! d N(0; 1), and g() is continuously di erentiable with g 0 () 6= 0: Then as n! 1; p n[g( Zn ) g()]! d N(0; [g 0 ()] 2 2 ): Remark: The Delta method is a Taylor series approximation in a statistical context. It linearizes a smooth (i.e., di erentiable) nonlinear statistic so that the CLT can be applied to the linearized statistic. Therefore, it can be viewed as a generalization of the CLT from a sample average to a nonlinear statistic. This method is very useful when more than one parameter makes up the function to be estimated and more than one random variable is used in the estimator. Proof: First, because p n( Z n )=! d N(0; 1) implies p n( Z n )= = O P (1); we have Z n = O P (n 1=2 ) = o P (1): Next, by a rst order Taylor series expansion, we have Y n = g( Z n ) = g() + g 0 ( n )( Z n ); where n = + (1 ) Z n for 2 [0; 1]: It follows by the Slutsky theorem that p g( Z n ) n g() = g 0 ( n ) p n Z n! d N(0; [g 0 ()] 2 ); where g 0 ( n )! p g 0 () given n! p : By the Slutsky theorem again, we have p n[yn g()]! d N(0; 2 [g 0 ()] 2 ): Example: Suppose p n( Z n )=! d N(0; 1) and 6= 0 and 0 < < 1: Find the limiting distribution of p n( Z n 1 1 ): Solution: Put g( Z n ) = Z n 1 : Because 6= 0; g() is continuous at : By a rst order Taylor series expansion, we have g( Z n ) = g() + g 0 ( n )( Z n ); or Z 1 n 1 = ( 2 n )( Z n ) 11

12 where n = + (1 ) Z n! p given Z n! p and 2 [0; 1]: It follows that p n( Z 1 n 1 ) = 2 n p n( Zn )! d N(0; 2 = 4 ): We now use these asymptotic tools to investigate the large sample behavior of the OLS estimator and related statistics in subsequent chapters. 12

13 Large Sample Theory for Linear Regression Models with I.I.D. Observations 4.1 Assumptions We rst state the assumptions under which we will estalish the asymptotic theory for linear regression models. Assumption 4.1 [I.I.D.]: fy t ; X 0 tg n is an i.i.d. random sample. Assumption A.2 [Linearity]: Y t = X 0 t 0 + " t ; t = 1; :::; n; for some unknown K 1 parameter 0 and some unobservable random variable " t : Assumption 4.3 [Correct Model Speci cation]: E(" t jx t ) = 0 a.s. with E(" 2 t ) = 2 < 1: Assumption 4.4 [Nonsingularity]: The K K matrix is nonsingular and nite: Q = E(X t X 0 t) Assumption 4.5: The K K matrix V var(x t " t ) = E(X t X 0 t" 2 t ) is nite and positive de nite (p.d.). Remarks: (i) The i.i.d. observations assumption implies that the asymptotic theory developed in this chapter will be applicable to cross-sectional data, but not time series data. The observations of the later are usually correlated and will be considered in Chapter 5. Put Z t = (Y t ; X 0 t) 0 : Then I.I.D. means that Z t and Z s are independent when t 6= s, and the Z t have the same distribution for all t: The identical distribution means that the observations are generated from the same data generating process, and independence means that di erent observations contain new information about the data generating process. 13

14 (ii) Assumptions 4.1 and 4.3 imply the strict exogenciety condition (Assumption 3.2) holds, because we have E(" t jx) = E(" t jx 1 ; X 2 ; :::X t ; :::X n ) = E(" t jx t ) = 0 a:s: (iii) As a most important feature of Assumptions together, we do not assume normality for the conditional distribution of " t jx t ; and allow for conditional heteroskedasticity (i.e., var(" t jx t ) 6= 2 a.s.). Relaxation of the normality assumption is more realizatic for economic and nancial data. For example, it has been well documented (Mandelbrot 1963, Fama 1965, Kon 1984) that returns on nancial assets are not normally distributed. However, the I.I.D. assumption implies that cov(" t ; " s ) = 0 for all t 6= s: That is, there exists no serial correlation in the regression disturbance. (iv) Among other things, Assumption 4.4 implies E(Xjt) 2 < 1 for 0 j k: By the SLLN for i.i.d. random samples, we have X 0 X n = 1 n X t Xt 0! a:s: E(X t X 0 t) = Q as n! 1: Hence, when n is large, the matrix X 0 X behaves approximately like nq; whose minimum eigenvalue min (nq) = n min (Q)! 1 at the rate of n: Assumption 4.4 implies Assumption 3.3. (v) When X 0t = 1; Assumption 4.4 implies E(" 2 t ) < 1: (vi) If E(" 2 t jx t ) = 2 < 1 a.s., i.e., there exists conditional homoskedasticity, then Assumption 4.5 can be ensured by Assumption 4.4. More generally, there exists conditional heteroskedasticity, the moment condition in Assumption 4.5 can be ensured by the moment conditions that E(" 4 t ) < 1 and E(Xjt) 4 < 1 for 0 j k; because by repeatedly using the Cauchy-Schwarz inequality twice, we have je(" 2 t X jt X lt )j [E(" 4 t )] 1=2 [E(XjtX 2 lt)] 2 1=2 [E(" 4 t )] 1=2 [E(Xjt)E(X 4 lt)] 4 1=4 where 0 j; l k and 1 t n: We now address the following questions: Consistency of OLS? 14

15 Asymptotic normality? Asymptotic e ciency? Con dence interval estimation? Hypothesis testing? In particular, we are interested in knowing whether the statistical properties of OLS ^ and related test statistics derived under the classical linear regression setup are staill valid under the current setup, at least when n is large. 4.2 Consistency of OLS where Suppose we have a random sample fy t ; X 0 tg n : Recall that the OLS estimator: ^ = (X 0 X) 1 X 0 Y X 0 1 X X 0 Y = n n = ^Q 1 n 1 X t Y t ; ^Q = n 1 Substituting Y t = X 0 t 0 + " t ; we obtain X t Xt: 0 ^ = 0 + ^Q 1 n 1 We will consider the consistency of ^ directly. X t " t : Theorem [Consistency of OLS] Under Assumptions , as n! 1; ^ p! 0 or ^ 0 = o P (1): Proof: Let C > 0 be some bounded constant. Also, recall X t = (X 0t ; X 1t ; :::; X kt ) 0 : First, the moment condition holds: for all 0 j k; EjX jt " t j (EX 2 jt) 1 2 (E" 2 t ) 1 2 by the Cauchy-Schwarz inequality C 1 2 C 1 2 C 15

16 where E(X 2 jt) C by Assumption 4.4, and E(" 2 t ) C by Assumption 4.3. It follows by WLLN (with Z t = X t " t ) that where n 1 p X t " t! E(Xt " t ) = 0; E(X t " t ) = E[E(X t " t jx t )] by the law of iterated expectations = E[X t E(" t jx t )] = E(X t 0) = 0: Applying WLLN again (with Z t = X t Xt) 0 and noting that EjX jt X lt j [E(X 2 jt)e(x 2 lt)] 1 2 C by the Cauchy-Schwarz inequality for all pairs (j; l); where 0 j; l k; we have ^Q p! E(X t X 0 t) = Q: Hence, we have ^Q 1! p Q 1 by continuity. It follows that This completes the proof. ^ 0 = (X 0 X) 1 X 0 " = ^Q 1 n 1 X t " t p! Q 1 0 = 0: 4.3 Asymptotic Normality of OLS Next, we derive the asymptotic distribution of ^: We rst provide a multivariate CLT for I.I.D. random samples. Lemma [Multivariate Central Limit Theorem (CLT) for I.I.D. Random Samples]: Suppose fz t g is a sequence of i.i.d. random vectors with E(Z t ) = 0 and var(z t ) = E(Z t Z 0 t) = V is nite and positive de nite. De ne Z n = n 1 Z t : 16

17 Then as n! 1; or p n Zn d! N(0; V ) V 1 2 p n Zn d! N(0; I): Question: What is the variance-covariance matrix of p nz n? Answer: Noting that E(Z t ) = 0; we have var( p! nz n ) = var n 1 2 Z t " = E n 1 2 = n 1! Z t E(Z t Zs) 0 s=1 n 1 2 s=1! 0 # Z s = n 1 E(Z t Zt) 0 (because Z t and Z s are independent for t 6= s) = E(Z t Zt) 0 = V: In other words, the variance of p nz n is identical to the variance of each individual random vector Z t : Theorem [Asymptotic Normality of OLS] Under Assumptions , we have p n(^ 0 )! d N(0; Q 1 V Q 1 ) as n! 1; where V var(" t jx t ) = E(X t X 0 t" 2 t ): Proof: Recall that First, we consider the second term p n(^ 0 ) = ^Q 1 n 1 2 X t " t : n 1 2 X t " t : Noting that E(X t " t ) = 0 by Assumption 4.3, and var(x t " t ) = E[(X t X 0 t" 2 t ) = V; which is nite and p.d. by Assumption 4.5. Then, by the CLT for i.i.d. random sequences 17

18 fz t = X t " t g, we have n 1 2 X t " t = p! n n 1 X t " t = p n Z n d! Z ~ N(0; V ): On the other hand, as shown earlier, we have ^Q p! Q; and so ^Q 1 p! Q 1 given that Q is nonsingular so that the inverse function is continuous and well de ned. It follows by the Slutsky Theorem that This completes the proof. Remarks: The theorem implies that: p n(^ 0 ) = ^Q 1 n 1 2 X t " t d! Q 1 Z N(0; Q 1 V Q 1 ): (i) The asymptotic mean of p n(^ 0 ) is 0. That is, the mean of p n(^ 0 ) is approximately 0 when n is large. (ii) The asymptotic variance of p n(^ 0 ) is Q 1 V Q 1 : That is, the variance of p n(^ 0 ) is approximately Q 1 V Q 1 : Because the asymptotic variance is a di erent concept from the variance of p n(^ 0 ); we denote the asymptotic variance of p n(^ 0 ) as follows:avar( p n^) = Q 1 V Q 1 : We now consider a special case under which we can simplfy the expression of avar( p n^): Special Case : Conditional Homoskedasticity Assumption 4.6: E(" 2 t jx t ) = 2 a.s. Theorem: Suppose Assumptions hold. Then as n! 1; p n(^ 0 ) d! N(0; 2 Q 1 ): 18

19 Remark: Under conditional homoskedasticity, the asymptotic variance of p n(^ 0 ) is avar( p n^) = 2 Q 1 : Proof: Under Assumption 4.6, we can simplify V = E(X t Xt" 0 2 t ) = E[E(X t Xt" 0 2 t jx t )] by the law of iterated expectations = E[X t XtE(" 0 2 t jx t )] = 2 E(X t Xt) 0 = 2 Q: The results follow immediately because Q 1 V Q 1 = Q 1 2 QQ 1 = 2 Q 1 : Question: Is the OLS estimator ^ the BLUE estimator asymptotically (i.e., when n! 1)? 4.4 Asymptotic Variance Estimator To construct con dence interval estimators or hypothesis tests, we need to estimate the asymptotic variance of p n(^ 0 ), avar( p n^): Because the express of avar( p n^) differs under conditional homoskedasticity and conditional heteroskedasticity respectively, we consider the estimator for avar( p n^) under these two cases seperately. Case I: Conditional Homoskedasticity In this case, the asymptotic variance of p n(^ 0 ) is Question: How to estimate Q? avar( p n^) = Q 1 V Q 1 = 2 Q 1 : Lemma: Suppose Assumptions 4.1, 4.2 and 4.4 hold. Then ^Q = n 1 X t Xt 0 p! Q: Question: How to estimate 2? 19

20 Recalling that 2 = E(" 2 t ); we use the sample residual variance estimator Is s 2 consistent for 2? s 2 = e 0 e=(n K) = 1 e 2 t n K = 1 (Y t n K Xt^) 0 2 : Theorem [Consistent Estimator for 2 ]: Under Assumptions , s 2 p! 2 : Proof: Given that s 2 = e 0 e=(n K) and e t = Y t X 0 t^ = " t + X 0 t 0 X 0 t^ = " t X 0 t(^ 0 ); we have s 2 = = = 2 1 [" t X n K t(^ 0 0 )] 2! n n K n 1 " 2 t " # +(^ 0 ) 0 (n K) 1 X t Xt 0 (^ 0 ) 2(^ 0 ) 0 (n K) 1 X t " t p! Q given that K is a xed number (i.e., K does not grow with the sample size n), where we have made use of the WLLN in three places respectively. Remark: We can then consistently estimate 2 Q 1 by s 2 ^Q 1 : Theorem [Asymptotic Variance Estimator of p n(^ , we have s 2 ^Q 1 p! 2 Q 1 : 0 )] Under Assumptions 20

21 Remark: The asymptotic variance estimator of p n(^ 0 ) is s 2 ^Q 1 = s 2 (X 0 X=n) 1 : This is equivalent to saying that the variance estimator of ^ 0 is equal to s 2 ^Q 1 =n = s 2 (X 0 X) 1 when n! 1: Thus, when n! 1 and there exists conditional homoskedasticity, the variance estimator of ^ 0 coincides with the form of the variance estimator for ^ 0 in the classical regression case. Because of this, as will be seen below, the conventional t-test and F -test are still valid for large samples under conditional homoskedasticity. CASE II: Conditional Heteroskedesticity In this case, which cannot be simpli ed. avar( p n^) = Q 1 V Q 1 ; Question: We can still estimate ^Q to estimate Q: How to estimate V = E(X t X 0 t" 2 t )? Answer: Use its sample analog where ^V = n 1 X t Xte 0 2 t = X0 D(e)D(e) 0 X ; n D(e) = diag(e 1 ; e 2 ; :::; e n ) is an n n diagonal matrix with diagonal elements equal to e t for t = 1; :::; n: To ensure consistency of ^V to V; we impose the following additional moment conditions. Assumption 4.7: (i) E(X 4 jt) < 1 for all 0 j k; and (ii) E(" 4 t ) < 1: Lemma: Suppose Assumptions and 4.7 hold. Then ^V p! V: Proof: Because e t = " t (^ 0 ) 0 X t ; we have ^V = n 1 X t Xt" 0 2 t +n 1 X t Xt[(^ 0 0 ) 0 X t Xt(^ 0 0 )] 2n 1 X t Xt[" 0 t Xt(^ 0 0 )]! p V ; 21

22 where for the rst term, we have n 1 X t Xt" 0 2 t by the WLLN and Assumption 4.7, which implies p! E(X t X 0 t" 2 t ) = V EjX it X jt " 2 t j [E(X 2 itx 2 jt)e(" 4 t )] 1 2 : For the second term, we have n 1 X it X jt (^ 0 ) 0 X t Xt(^ 0 0 ) given ^ =! p 0 kx l=0 m=0! kx (^ l 0 l )(^ m 0 m) n 1 X it X jt X lt X mt 0! p 0; and n 1 X it X jt X lt X mt! p E (X it X jt X lt X mt ) = O(1) by the WLLN and Assumption 4.7. Similarly, for the last term, we have n 1 X it X jt " t Xt(^ 0 0 ) given ^ 0! p 0; and n 1 =! kx (^ l 0 l ) n 1 X it X jt X lt " t l=0! p 0 X it X jt X lt " t! p E (X it X jt X lt " t ) = 0 by the WLLN and Assumption 4.7. This completes the proof. We now construct a consistent estimator for avar( p n^) under conditional heteroskedasticity. Theorem [Asymptotic variance estimator for p n(^ and 4.7, we have ^Q 1 ^V ^Q 1 p! Q 1 V Q 1 : 0 )]: Under Assumptions 22

23 Remark: This is the so-called White s (1980) heteroskedasticity-consistent variancecovariance matrix of the estimator p n(^ 0 ): It follows that when there exists conditional heteroskedasticity, the estimator for the variance of ^ (X 0 X=n) 1 ^V (X 0 X=n) 1 =n = (X 0 X) 1 X 0 D(e)D(e) 0 X(X 0 X) 1 ; 0 is which di ers from the estimator s 2 (X 0 X) 1 in the case of conditional homoskedasticity. Question: What happens if we use s 2 ^Q 1 as an estimator for the avar[ p n(^ 0 )] while there exists conditional heteroskedasticity? Answer: Observe that V E(X t X 0 t" 2 t ) = 2 Q + cov(x t X 0 t; " 2 t ) = 2 Q + cov[x t X 0 t; 2 (X t )]; where 2 = E(" 2 t ); 2 (X t ) = E(" 2 t jx t ); and the last equality follows by the LIE. Thus, if 2 (X t ) is positively correlated with X t X 0 t; 2 Q will underestimate the true variancecovariance E(X t X 0 t" 2 t ) in the sense that V 2 Q is a positive de nite matrix. Consequently, the standard t-test and F -test will overreject the correct null hypothesis at any given signi cance level. There will exist substantial Type I errors. Question: What happens if one use the asymptotic variance estimator ^Q 1 ^V ^Q 1 but there exists conditional homoskedasticity? Answer: The asymptotic variance estimator is asymptotically valid, but it will not perform as well as the estimator s 2 ^Q 1 in nite samples, because the latter exploit the information of conditonal homoskedasticity. 4.5 Hypothesis Testing Question: How to construct a test statistic for the null hypothesis H 0 : R 0 = r; where R is a J K constant matrix, and r is a J 1 constant vector? The test procedures will di er depending on whether there exists conditional heteroskedasticity. We rst consider the case of conditional homoskedasticity. Case I: Conditional Homoskedasticity 23

24 Theorem [Asymptotic 2 under H 0 ; Test] Suppose Assumptions and 4.6 hold. Then J F (R^ r) 0 s 2 R(X 0 X) 1 R 0 1 (R^ r) d! 2 J as n! 1: Proof: Consider R^ r = R(^ 0 ) + R 0 r: Note that under conditional homoskedasticity, we have p n ^ 0 d! N(0; 2 Q 1 ): Thus, under H 0 : R = r; we have p n(r^ r) = R p n(^ 0 ) d! N(0; 2 RQ 1 R 0 ): It follows that the quadratic form p n(r^ r) 0 2 RQ 1 R 0 1 p n(r^ r) d! 2 J: Also, because s 2 ^Q 1 or equivalently p! 2 Q 1 ; we have by the Slutsky theorem p n(r^ r) 0 s 2 R ^Q 1 R 0 1 p n(r^ r) d! 2 J: namely J (R^ r) 0 [R(X 0 X) 1 R 0 ] 1 (R^ r)=j s 2 = J F d! 2 J; J F d! 2 J: Remarks: When f" t g is not i.i.d.n(0; 2 ) conditional on X t ; we cannot use the F distribution, but we can still compute the F -statistic and the appropriate test statistic is J times the F -statistic, which is asymptotically 2 J. That is, J F = (~e0 ~e e 0 e) e 0 e=(n K) d! 2 J: An Alternative Interpretation: Because J F J;n K approaches 2 J as n! 1; we 24

25 may interpret the above theorem in the following way: the classical results for the F -test are still approximately valid under conditional homoskedasiticity when n is large. When the null hypothesis is that all slope coe cients except the intercept are jointly zero, we can use a test statistic based on R 2 : A Special Case: Testing for Joint Signi cance of All Economic Variables Theorem [(n K)R 2 Test]: Suppose Assumption hold, and we are interested in testing the null hypothesis that H 0 : 0 1 = 0 2 = = 0 k = 0; where the are the regression coe cients from Y t = X 1t kx kt + " t : Let R 2 be the coe cient of determination from the unrestricted regression model Y t = Xt " t : Then under H 0 ; (n K)R 2 d! 2 k; where K = k + 1: Proof: First, recall that in this special case we have F = = R 2 =k (1 R 2 )=(n k 1) R 2 =k (1 R 2 )=(n K) : By the above theorem and noting J = k; we have k F = (n K)R2! d 2 1 R 2 k under H 0 : This implies that k F is bounded in probability; that is, (n K)R 2 1 R 2 = O P (1): Consequently, given that k is a xed integer, R 2 1 R 2 = O P (n 1 ) = o P (1) 25

26 or R 2 p! 0: Therefore, 1 R 2 p! 1: By the Slutsky theorem, we have or asymptotically equivalently, This completes the proof. Question: Do we have nr 2 d! 2 k? Answer: Yes, we have (n K)R 2 = k (n K)R2 =k 1 R 2 (1 R 2 ) = (k F )(1 R 2 ) nr 2 = d! 2 k; (n K)R 2 d! 2 k: n n K (n K)R2 and Case II: Conditional Heteroskedasticity Recall that under H 0 ; n n K! 1: p n(r^ r) = R p n(^ 0 ) + p n(r 0 r) = R p n(^ 0 ) d! N(0; RQ 1 V Q 1 R 0 ); where V = E(X t X 0 t" 2 t ): It follows that a robust (infeasible) Wald test statistic W = p n(r^ r) 0 [RQ 1 V Q 1 R 0 ] d! 2 J: 1 p n(r^ r) Because ^Q p! Q and ^V p! V; we have the Wald test statistic W = p n(r^ r) 0 [R ^Q 1 ^V ^Q 1 R 0 ] d! 2 J 1 p n(r^ r) 26

27 by the Slutsky theorem. By robustness, we mean that W is valid no matter whether there exists conditional heteroskedasticity. We can write W equivalently as follows: W = (R^ r) 0 [R(X 0 X) 1 X 0 D(e)D(e) 0 X(X 0 X) 1 R 0 ] 1 (R^ r); where we have used the fact that where D(e)= diag(e 1 ; e 2 ; :::; e n ): Remarks: ^V = 1 n X t e t e t Xt 0 = X0 D(e)D(e) 0 X ; n (i) Under conditional heteroskedasticity, the test statistics J F and (n be used. K)R 2 cannot Question: What happens if there exists conditional heteroskedasticity but J F or (n K)R 2 is used. There will exist Type I errors. (ii) Although the general form of the Wald test statistic developed here can be used no matter whether there exists conditional homoskedasticity, this general form of test statistic may perform poorly in small samples. Thus, if one has information that the error term is conditionally homoskedastic, one should use the test statistics derived under conditional homoskedasticity, which will perform better in small sample sizes. Because of this reason, it is important to test whether conditional homoskedasticity holds. 4.6 Testing Conditional Homoskedasticity We now introduce a method to test conditional heteroskedasticity. Question: How to test conditional homoskedasticity for f" t g in a linear regression model? There have been many tests for conditional homoskedasticity. Here, we introduce a popular one due to White (1980). White s (1980) test The null hypothesis H 0 : E(" 2 t jx t ) = E(" 2 t ); 27

28 where " t is the regression error in the linear regression model Y t = Xt " t : First, suppose " t were observed, and we consider the auxiliary regression kx " 2 t = 0 + j X jt + X jl X jt X lt + v t j=1 1jlk = 0 vech(x t Xt) 0 + v t = 0 U t + v t ; where vech(x t Xt) 0 is an operator stacks all lower triangular elements of matrix X t Xt 0 into a K(K+1) 1 column vector. There is a total of K(K+1) regressors in the auxiliary 2 2 regression. This is essentially regressing " 2 t on the intercept, X t ; and the quadratic terms and cross-product terms of X t : Under H 0 ; all coe cients except the intercept are jointly zero. Any nonzero coe cients will indicate the existence of conditional heteroskedasticity. Thus, we can test H 0 by checking whether all coe cients except the intercept are jointly zero. Assuming that E(" 4 t jx t ) = 4 (which implies E(v 2 t jx t ) = 2 v under H 0 ); then we can run an OLS regression and construct a R 2 -based test statistic. Under H 0 ; we can obtain (n J 1) ~ R 2 d! 2 J; where J = K(K+1) 1 is the number of the regressors except the intercept. 2 Unfortunately, " t is not observable. However, we can replace " t with e t = Y t run the following feasible auxiliary regression e 2 t = 0 + kx j X jt + j=1 = 0 vech(x t X 0 t) + ~v t ; X 1jlk jl X jt X lt + ~v t X 0 t^; and the resulting test statistic (n J 1)R 2 d! 2 J: It can be shown that the replacement of " 2 t by e 2 t has no impact on the asymptotic 2 J distribution of (n J 1)R 2 : The proof, however, is rather tedious. For the details of the proof, see White (1980). Below, we provide some intuition. Question: Why does the use of e 2 t in place of " 2 t have no impact on the asymptotic distribution of (n J 1)R 2? Answer: 28

29 Put U t = vech(x t X 0 t): Then the infeasible auxiliary regression is " 2 t = U 0 t + v t : We have p n(~ 0 )! d N(0; 2 vq 1 uu ); and under H 0 : R = r; we have p n(r~ r)! d N(0; 2 vrq 1 uu R 0 ); where ~ is the OLS estimator and 2 v = E(v 2 t ): This implies R~ r = O P (n 1=2 ); which vanishes to zero in probability at rate n 1=2 : It is this term that yields the asymptotic 2 J distribution for (n J 1) ~ R 2 ; which is asymptotically equiavalent to the test statistic p n(r~ r) 0 [s 2 vr ^Q 1 uu R 0 ] 1 p n(r ~ r): Now suppose we replace " 2 t with e 2 t ; and consider the auxliary regression e 2 t = U 0 t 0 + ~v t : Denote the OLS estimator by ^: We decompose e 2 t = h i 2 " t Xt(^ 0 0 ) = " 2 t + (^ 0 ) 0 X t X 0 t(^ 0 ) 2(^ 0 ) 0 X t " t = 0 U t + ~v t : Thus, ^ can be written as follows: ^ = ~ + ^ + ^; where ~ is the OLS estimator of the infeasible auxiliary regression, ^ is the e ect of the second term, and ^ is the e ect of the third term. For the third term, X t " t is uncorrelated with U t given E(" t jx t ) = 0: Therefore, this term, after scaled by the factor ^ 0 that vanishes to zero in probability at the rate n 1=2 ; will vanish to zero in probability at a rate n 1 ; that is, ^ = O P (n 1 ): This is expected to have negligible impact on the asymptotic distribution of the test statistic. For the second term, X t Xt 0 is perfectly correlated with U t : However, it is scaled by a factor of jj^ 0 jj 2 rather than by jj^ 0 jj only. As a consequence, the regression coe cient of (^ 0 ) 0 X t Xt(^ 0 0 ) on U t will also vanish to zero at rate n 1 ; that is, ^ = O P (n 1 ): Therefore, it also has negligible impact on the asymptotic distribution of (n J 1)R 2 : 29

30 Question: How to test conditional homoskedasticity if E(" 4 t jx t ) is not a constant (i.e., E(" 4 t jx t ) 6= 4 for some 4 under H 0 )? This corresponds to the case when v t displays conditional heteroskedasticity. Hint: Consider a robust Wald test statistic. Please try it. 4.7 Empirical Applications 4.8 Summary and Conclusion In this chapter, within the context of i.i.d. observations, we have relaxed some key assumptions of the classical linear regression model. In particular, we do not assume conditional normality for " t and allow for conditional heteroskedasticity. Because the exact nite sample distribution of the OLS is generally unknown, we have replied on asymptotic analysis. It is found that for large samples, the results of the OLS estimator ^ and related test statistics (e.g., t-test statistic and F -test statistic) are still applicable under conditional homoskedasticity. Under conditional heteroskedasticity, however, the statistic properties of ^ are di erent from those of ^ under conditional homoskedasticity, and as a consequence, the conventional t-test and F -test are invalid even when the sample size n! 1: One has to use White s (1980) heteroskedasticity-consistent variance estimator for the OLS estimator ^ and use it to construct robust test statistics. A direct test for conditional heteroskedastcity, due to White (1980), is described. References Hong, Y. (2006), Probability Theory and Statistics for Economists. White, H. (1980), Econometrica. White, H. (1999), Asymptotic Theory for Econometricians, 2nd Edition. Hayashi, F. (2000), Econometrics, Ch.2. 30

31 ADVANCED ECONOMETRICS PROBLEM SET Suppose a stochastic process fy t ; X 0 tg 0 satis es the following assumptions: Assumption 1.1 [Linearity] fy t ; X 0 tg 0 is an i.i.d. process with Y t = X 0 t 0 + " t ; t = 1; :::; n; for some unknown parameter 0 and some unobservable disturbance " t ; Assumption 1.2 [i.i.d.] The K K matrix E(X t X 0 t) = Q is nonsingular and nite; Assumption 1.3 [conditional heteroskedasticity]: (i) E(X t " t ) = 0; (ii) E(" 2 t jx t ) 6= 2 ; (iii) E(X 4 jt) C for all 0 j k; and E(" 4 t ) C for some C < 1: (a) Show that ^! p 0? (b) Show that p n(^ 0 )! d N(0; ); where = Q 1 V Q 1 ; and V = E(X t X 0 t" 2 t ): (c) Show that the asymptotic variance estimator ^ = ^Q 1 ^V ^Q 1! p ; where ^Q = n 1 P n X tx 0 t and ^V = n 1 P n X tx 0 te 2 t : This is called White s (1980) heteroskedasticity consistent variance-covariance matrix estimator. (d) Consider a test for hypothesis H 0 : R 0 = r: Do we have J F! d 2 J ; where F = (R^ r) 0 [R(X 0 X) 1 R 0 ] 1 (R^ r)=j s 2 is the usual F -test statistic? If it holds, give the reasoning. If it does not, could you provide an alternative test statistic that converges in distribution to 2 J : 2. Suppose the following assumptions hold: Assumption 2.1: fy t ; X 0 tg 0 is an i.i.d. random sample with Y t = X 0 t 0 + " t ; for some unknown parameter 0 and unobservable random disturbance " t : Assumption 2.2: E(" t jx t ) = 0 a.s. 31

32 Assumption 2.3: (i) W t = W (X t ) is a positive function of X t ; (ii) The K K matrix E (X t W t X 0 t) = Q w is nite and nonsingular. (iii) E(W 8 t ) C < 1; E(X 8 jt) C < 1 for all 0 j k; and E(" 4 t ) C; Assumption 2.4: V w = E(X t W 2 t X 0 t" 2 t ) is nite and nonsingular. We consider the so-called weighted least squares (WLS) estimator for 0 :! 1 ^ w = n 1 X t W t Xt 0 n 1 X t W t Y t : (a) Show that ^ w is the solution to the following problem min W t (Y t X 0 t) 2 : (b) Show that ^ w is consistent for 0 ; (c) Show that p n(^ w 0 )! d N(0; w ) for some K K nite and positive de - nite matrix w : Obtain the expression of w under (i) conditional homoskedasticity E(" 2 t jx t ) = 2 a.s. and (ii) conditional heteroskedasticity E(" 2 t jx t ) 6= 2. (d) Propose an estimator ^ w for w ; and show that ^ w is consistent for w under conditional homoskedasticity and conditional heteroskedasticity respectively: (e) Construct a test statistic for H 0 : R 0 = r; where r is a J K matrix and r is a J 1 vector under conditional homoskedasticity and under conditional heteroskedasticity respectively. Derive the asymptotic distribution of the test statistic under H 0 in each case. (f) Suppose E(" 2 t jx t ) = 2 (X t ) is known, and we set W t = 1 (X t ). Construct a test statistic for H 0 : R 0 = r; where r is a J K matrix and r is a J 1 vector. Derive the asymptotic distribution of the test statistic under H Consider the problem of testing conditional homoskedasticity (H 0 : E(" 2 t jx t ) = 2 ) for a linear regression model Y t = Xt " t ; where X t is a K 1 vector consisting of an intercept and explanatory variables. To test conditional homoskedasticity, we consider the auxiliary regression " 2 t = vech(x t Xt) v t = Ut 0 + v t : 32

33 Suppose Assumptions 4.1, 4.2, 4.3, 4.4, 4.7 hold, and E(" 4 t jx t ) 6= 4 : That is, E(" 4 t jx t ) is a function of X t : (a) Show var(v t jx t ) 6= 2 v under H 0 : That is, the disturbance v t in the auxiliary regression model displays conditional heteroskedasticity. (b) Suppose " t is directly observable. Construct an asymptotically valid test for the null hypothesis H 0 of conditional homoskedasticity. Justify your reasoning and test statistic. 33