Large Samle Theory In statistics, we are interested in the roerties of articular random variables (or estimators ), which are functions of our data. In ymtotic analysis, we focus on describing the roerties of estimators when the samle size becomes arbitrarily large. The idea is that given a reonably large datet, the roerties of an estimator even when the samle size is finite are similar to the roerties of an estimator when the samle size is arbitrarily large. In these notes we focus on the large samle roerties of samle averages formed from i.i.d. data. That is, sume that X i i.i.d.f, for i = 1,..., n,.... Assume EX i = µ, for all i. The samle average after n draws is X n 1 n i X i. We focus on two imortant sets of large samle results: (1) Law of large numbers: Xn EX n. (2) Central limit theorem: n( X n EX) N(0, ). That is, n times a samle average looks like (in a recise sense to be defined later) a normal random variable n gets large. An imortant endeavor of ymtotic statistics is to show (1) and (2) under various sumtions on the data samling rocess. Consider a sequence of random variables Z 1, Z 2,...,. Convergence in robability: Z for all ɛ > 0, lim P rob ( Z < ɛ) = 1. More formally: for all ɛ > 0 and δ > 0, there exists n 0 (ɛ, δ) such that for all n > n 0 (ɛ, δ): P rob ( Z < ɛ) > 1 δ. Note that the limiting variable Z can also be a random variable. Furthermore, for convergence in robability, the random variables Z 1, Z 2,... and Z should be defined on the same robability sace: if this common samle sace is Ω with elements ω, then the statement of convergence in robability is that: P rob (ω Ω : (ω) Z(ω) < ɛ) > 1 δ. Corresonding to this convergence concet, we have the Weak Law of Large Numbers (WLLN), which is the result that X n µ. 1
Earlier, we had used Chebyshev s inequality to rove a version of the WLLN, under the sumtion that X i are i.i.d. with mean µ and variance σ 2. Khinchine s law of large numbers only requires that the mean exists (i.e., is finite), but does not require existence of variances. Recall Markov s inequality: for ositive random variables Z, and > 0, ɛ > 0, we have P (Z > ɛ) E(Z) /ɛ. Take Z = X n X. Then if we have: E[ X n X ] 0 : that is, X n converges in -th mean to X, then by Markov s inequality, we also have P ( X n X ɛ) 0. That is, convergence in -th mean imlies convergence in robability. Some definitions and results: Define: the robability limit of a random sequence, denoted lim, is a non-random quantity α such that α. Stochtic orders of magnitude: big-o.. : = O (n λ ) there exists an O(1) nonstochtic sequence a n such that Zn n λ a n 0. = O (1) for every δ, there exists a finite (δ) and n (δ) such that P rob( > ) < δ for all n n (δ). little-o.. : = o (n λ ) Zn n λ 0. = o (1) 0. Plim oerator theorem: Let be a k-dimensional random vector, and g( ) be a function which is continuous at a constant k-vector oint α. Then α = g(zn ) g(α). In other words: g(lim ) = lim g( ). Proof: See Serfling, Aroximation Theorems of Mathematical Statistics,. 24. Don t confuse this: lim 1 n i g(z i) = Eg(Z i ), from the LLN, but this is distinct from lim g( 1 n i Z i) = g(ez i ), using lim oerator result. The two are generally not the same; if g( ) is convex, then lim 1 n i g(z i) lim g( 1 n i Z i). 2
A useful intermediate result ( squeezing ): if α (a constant) and Z n lies between and α with robability 1, then Zn α. (Quick roof: for all ɛ > 0, X n α < ɛ Xn α < ɛ. Hence P rob( Xn α < ɛ) P rob( X n α < ɛ) 1.) Convergence almost surely Z P rob (ω : lim (ω) = Z(ω)) = 1. As with convergence in robability, almost sure convergence makes sense when the random variables Z 1,...,,... and Z are all defined on the same samle sace. Hence, for each element in the samle sace ω Ω, we can sociate the sequence Z 1 (ω),..., (ω),... well the limit oint Z(ω). To understand the robability statement more clearly, sume that all these RVs are defined on the samle sace (Ω, B(Ω), P ). Define some set-theoretic concets. Consider a sequence of sets S 1,..., S n, all B(Ω). Unless this sequence is monotonic, it s difficult to talk about a limit of this sequence. So we define the liminf and limsu of this set sequence. lim inf S n lim m=n S m = n=1 m=n S m = {ω Ω : ω S n, n n 0 (ω)}. Note that the sequence of sets m=n S m, for n = 1, 2, 3,... is a non-decreing sequence of sets. By taking the union, the liminf is this the limit of this monotone sequence of sets. That is, for all ω lim infs n, there exists some number n 0 (ω) such that ω S n for all n n 0 (ω). Hence we can say that lim infs n is the set of outcomes which occur eventually. lim su S n S m = lim n=1 m=n S m m=n = { ω Ω : for every m, ω S n0 (ω,m) for some n 0 (ω, m) m }. 3
Note that the sequence of sets m=n S n, for n = 1, 2, 3,... is a non-increing sequence of sets. Hence, limsu is limit of this monotone sequence of sets. The lim sus n is the set of outcomes ω which occurs within every tail of the set sequence S n. Hence we say that an outcome ω lim sus n occurs infinitely often. 1 Note that Hence and liminf S n limsu S n. P (limsu S n ) P (liminf S n ) P (limsu S n ) = 0 P (liminf S n ) = 0 P (liminf S n ) = 1 P (limsu S n ) = 1. Borel-Cantelli Lemma: if i=1 P (S i) < then P (lim su S n ) = 0. Proof: for each m, we have P (lim su S n ) = P (lim { m=n S m }) = lim P ( which equals zero (by the sumtion that i=1 P (S i) < ). m=n S m ) lim P (S m ) m n The first equality (interchange of limit and Prob oerations) is an alication of the Monotone Convergence Theorem, which holds only because the sequence m=n S m is monotone. To aly this to almost-sure convergence, consider the sequence of sets S n {ω : (ω) Z(ω) > ɛ}, for ɛ > 0. Almost sure convergence is the statement that P (limsu S n ) = 0: Then m=n S m denotes the ω s such that the sequence (ω) Z(ω) exceeds ɛ in the tail beyond n. 1 However, note that all outcomes ω lim infs n also occur an infinite number of times along the infinite sequence S n. 4
Then lim m=n S m denotes all ω s such that the sequence (ω) Z(ω) exceeds ɛ in every tail: those ω for which (ω) escaes the ɛ-ball around Z(ω) infinitely often. For these ω s, lim (ω) Z(ω). For almost-sure convergence, you require the robability of this set to equal zero, i.e. Pr(lim m=n S m) = 0. Corresonding to this convergence concet, we have the Strong Law of Large Numbers (SLLN), which is the result that X n µ. Consider that the samle averages are random variables X 1 (ω), X 2 (ω),... defined on the same robability sace, say (Ω, B, P ), where each ω indexes a sequence a sequence X 1 (ω), X 2 (ω), X 3 (ω),.... Consider the set sequence S n = { ω : X n (ω) µ > ɛ }. SLLN is the statement that P (limsus n ) = 0. Proof (sketch; Davidson, g. 296): From the above discussion, we see that the two statements X n 0 and P ( X n > ɛ, i.o.) = 0 for any ɛ > 0 are equivalent. Therefore, one way of roving a SLLN is to verify that i=1 P ( X n > ɛ) < for all ɛ > 0, and then aly the Borel-Cantelli Lemma. As a starting oint, we see that Chebyshev s inequality is not good enough by itself; it tells us P ( X n > ɛ) σ2 nɛ ; but 1 2 nɛ =. 2 n=1 However, consider the subsequence with indices n 2 : {1, 4, 9, 16,...}. Chebyshev s inequality, we have 2 Again using P ( X n 2 > ɛ) σ2 n 2 ɛ ; and 1 2 n 2 ɛ = 1 2 ɛ 2 (π2 /6) 1.64 <. ɛ 2 n=1 Hence, by BC lemma, we have that X n 2 0. Next, examine the omitted terms from the sum i=1 P ( X n > ɛ). Define D n 2 = max n 2 k<(n+1) 2 X k X n 2. Note that: ( ) X k X n 2 n 2 = k 1 X n 2 + 1 k 2 Another of Euler s greatest hits: i=1 1 n 2 = π2 6 k t=n 2 +1 X t (the Riemann zeta function). 5
and, by indeendence of X i, the two terms on RHS are uncorrelated. Hence V ar( X k X n 2) = ) 2 (1 n2 σ 2 k n + k n2 σ 2 2 k ) 2 = σ 2 ( 1 n 2 1 k By Chebyshev s inequality, then, we have n 2 P (D n 2 > ɛ) σ2 ɛ σ 2 ( 1 n 2 1 (n + 1) 2 ( ) 1 n 1 ; so 2 (n + 1) 2 P (D n 2 > ɛ) σ2 ( ) 1 ɛ 2 n 1 < σ2 ( ) 1 2 (n + 1) 2 ɛ 2 n 1 = σ2 2 (n + 1) 2 ɛ 2 n imlying, using BC Lemma, that D n 2 Now, consider, for n 2 l < (n + 1) 2, n 2 0. X l = X l X n 2 + X n 2 X l X n 2 + X n 2 D n 2 + X n 2. By the above discussion, the RHS converges a.s. to 0; that is, almost surely, for every ɛ > 0, there exists (n 2 ) such that RHS n 2 < ɛ for n 2 > (n 2 ). Consequently, for l > (n 2 ), X l < ɛ almost surely. That is, Xl 0. Examle: X i i.i.d. U[0, 1]. Show that X (1:n) min i=1,...,n X i 0. Take S n { X (1:n) > ɛ }. For all ɛ > 0, P ( X (1:n) > ɛ) = P (X (1:n) > ɛ) = P (X i > ɛ, i = 1,..., n) = (1 ɛ) n. Hence, n=1 (1 ɛ)n = 1/ɛ <. So the conclusion follows by the BC Lemma. Theorem: Z = Z. ). 6
Proof: 0 = P ( lim = lim P m n ( m n {ω : Z m (ω) Z(ω) > ɛ} {ω : Z m (ω) Z(ω) > ɛ} lim P ({ω : (ω) Z(ω) > ɛ}). ) ) Convergence in Distribution: d Z. A sequence of real-valued random variables Z 1, Z 2, Z 3,... converges in distribution to a random variable Z if lim F (z) = F Z (z) z s.t. F Z (z) is continuous. (1) This is a statement about the CDFs of the random variables Z, and Z 1, Z 2,.... These random variables do not need to be defined on the same robability sace. F Z is also called the limiting distribution of the random sequence. Alternative definitions of convergence in distribution ( Portmanteau theorem ): Letting F ([a, b]) F (b) F (a) for a < b, convergence (1) is equivalent to F Zn ([a, b]) F Z ([a, b]) [a, b] s.t. F Z continuous at a, b. For all bounded, continuous functions g( ), g(z)df Zn (z) g(z)df Z (z) Eg( ) Eg(Z). (2) This definition of distributional convergence is more useful in advanced settings, because it is extendable to setting where and Z are general random elements taking values in metric sace. Levy s continuity theorem: A sequence {X n } of random variables converges in distribution to random variable X if and only if the sequence of characteristic function {φ Xj (t)} converges ointwise (in t) to a function φ X (t) which is continuous at the origin. Then φ X is the characteristic function of X. 7
Hence, this ties together convergence of characteristic functions, and convergence in distribution. This theorem will be used to rove the CLT later. For roofs of most results here, see Serfling, Aroximation Theorems in Mathematical Statistics, ch. 1. Distributional convergence, defined above, is called weak convergence. This is because there are stronger notions of distributional convergence. One such notion is: su A B(R) P Zn (A) P Z (A) 0 which is convergence in total variation norm. Examle: Multinomial distribution on [0, 1]. X n { 1, 2, 3,..., n = 1} each with n n n n robability 1. Consider A = Q (set of rational numbers in [0, 1]). n Some definitions and results: Slutsky Theorem: If Z d n Z, and Y n α (a constant), then (a) Y n Z d n αz (b) + Y d n Z + α. Theorem *: (a) Z = Zn d Z (b) d Z = Z if Z is a constant. Note: convergence in robability imlies that (roughly seaking), the random variables (for n large enough) and Z frequently have the same numerical value. Convergence in distribution need not imly this, only that the CDF s of and Z are similar. Z d n Z = = O (1). Use = max( F 1 1 Z (1 ɛ), FZ (ɛ) ) in definition of O (1). Note that the LLN tells us that X, n µ, which imlies (trivially) that X n degenerate limiting distribution. d µ, a 8
This is not very useful for our uroses, because we are interested in knowing (say) how far X n is from µ, which is unknown. How do we fix X n, so that it h a non-degenerate limiting distribution? Central Limit Theorem: (Lindeberg-Levy) Let X i be i.i.d. with mean µ and variance σ 2. Then n( Xn µ) d N(0, 1). σ n is also called the rate of convergence of the sequence Xn. By definition, a rate of convergence is the lowest olynomial in n for which n ( X n µ) converges in distribution to a nondegenerate distribution. The rate of convergence n make sense: if you blow u X n by a constant (no matter how big), you still get a degenerate limiting distribution. If you blow u by n, then the sequence S n n X n = n i=1 X n will diverge. Proof via characteristic functions. LLCLT sumes indeendence (across n) well identically distributed. We extend this to the indeendent, non-identically distributed setting. Lindeberg-Feller CLT: For each n, let Y n,1,..., Y n,n be indeendent random variables with finite (ossibly non-identical) means and variances σ 2 n,i such that (1/C n ) 2 n E Y n,i EY n,i 2 1 { Yn,i EY n,i /C n>ɛ} 0 every ɛ > 0 i=1 where C n [ n i=1 σ2 n,i] 1/2. Then [ n i=1 (Y n,i EY n,i )] C n d N(0, 1) The samling framework is known a triangular array. Note that the (n 1)-th observation in the n-th samle (Y n,n 1 ) need not coincide with Y n 1,n 1, the (n 1)-th observation in the (n 1)-th samle. is a stan- C n is just the standard deviation of n i=1 Y n,i. Hence [ n i=1 (Y n,i EY n,i )] C n dardized sum. This is useful for showing ymtotic normality of Let-Squares regression. Let y = β 0 X + ɛ with ɛ i.i.d. with mean zero and variance σ 2, and x being an n 1 vector of covariates (here we have just 1 RHS variable). 9
The OLS estimator is ˆβ = (X X) 1 X Y. To aly the LFCLT, we consider the normalized difference between the estimated and true β s: ((X X) 1/2 /σ) ( ˆβ n β 0 ) = (1/σ) (X X) 1/2 X ɛ (1/σ) a n,i ɛ i where a n,i corresonds to the i-th comonent of the 1 n vector (1/σ) (X X) 1/2 X. So we just need to show that the Lindeberg condition alies to Y n,i = a n,i ɛ i. Note that n i=1 V (a n,iɛ i ) = V ( n i=1 a n,iɛ i ) = (1/σ 2 ) V ((X X) 1/2 X ɛ) = σ 2 /σ 2 1 = Cn. 2 Asymtotic aroximations for X n CLT tells us that n( X n µ)/σ h a limiting standard normal distribution. We can use this true result to say something about the distribution of Xn, even when n is finite. That is, we use the CLT to derive an ymtotic aroximation for the finite-samle distribution of Xn. The aroximation is follows: we fli over the result of the CLT: X n a σ n N(0, 1) + µ X n a N(µ, 1 n σ2 ). i=1 The notation a makes exlicit that what is on the RHS is an aroximation. Note that X d n N(µ, 1 n σ2 ) is definitely not true! This aroximation intuitively makes sense: under the sumtions of the LLCLT, we know that E X n = µ and V ar( X n ) = σ 2 /n. What the ymtotic aroximation tells us is that the distribution of Xn is aroximately normal. 3 Asymtotic aroximations for functions of X n Oftentimes, we are not interested er se in aroximating the finite-samle distribution of the samle mean X n, but rather functions of a samle mean. (Later, you will see that the ymtotic aroximations for many statistics and estimators that you run across are derived by exressing them samle averages.) Continuous maing theorem: Let g( ) be a continuous function. Then d Z = g( ) d g(z). Proof: Serfling, Aroximation Theorems of Mathematical Statistics,. 25. 3 Note that the aroximation is exactly right if we sumed that X i N(µ, σ 2 ), for all i. 10
(Note: you still have to figure out what the limiting distribution of g(z) is. But if you know F X, then you can get F g(x) by the change of variables formul.) Note that for any linear function g( X n ) = a X n +b, deriving the limiting distribution of g( X n ) is no roblem (just use Slutsky s Theorem to get a X n + b a N(aµ + b, 1 n a2 σ 2 )). The roblem in deriving the distribution of g( X n ) is that the X n is inside the g function: 1. Use the Mean Value Theorem: Let g be a continuous function on [a, b] that is differentiable on (a, b). Then, there exists (at let one) λ (a, b) such that g (λ) = g(b) g(a) b a g(b) g(a) = g (λ)(b a). Using the MVT, we can write where X n is an RV strictly between X n and µ. 2. On the RHS of Eq. (3): g( X n ) g(µ) = g (X n)( X n µ) (3) (a) g (X n) g (µ) by the squeezing result, and the lim oerator theorem. (b) If we multily by n and divide by σ, we can aly the CLT to get n( Xn µ)/σ d N(0, 1). (c) Hence, n RHS = n[g( Xn ) g(µ)] using Slutsky s theorem. d N(0, g (µ) 2 σ 2 ), (4) 3. Now, in order to get the ymtotic aroximation for the distribution of g( X n ), we fli over to get g( X n ) a N(g(µ), 1 n g (µ) 2 σ 2 ). 4. Check that g satisfies sumtions for these results to go through: continuity and differentiability for MVT, and continuous at µ for lim oerator theorem. Examles: 1/ X n, ex( X n ), ( X n ) 2, etc. Eq. (4) is a general result, known the Delta method. For the uroses of this cls, I want you to derive the aroximate distributions for g( X n ) from first rinciles ( we did above). 11
1 Some extra toics (CAN SKIP) 1.1 Convergence in distribution vs. a.s. Combining two results above, we have Z = d Z. The converse is not generally true. (Indeed, {Z 1, Z 2, Z 3,...} need not be defined on the same robability sace, in which ce s.d. convergence makes no sense.) However, consider Z n = F 1 (U), Z = F 1 Z (U); U U[0, 1]. Here F 1 (U) denotes the quantile function corresonding to the CDF F (z): Z F 1 (τ) = inf {z : F (z) > τ} τ [0, 1]. We have F (F 1 (τ)) = τ. (Note that quantile function is also right-continuous; discontinuity oints of the quantile function arise where the CDF function is flat.) Then P (Z n z) = P (F 1 (U) z) = P (U F Zn (z)) = F Zn (z) so that Zn d =. (Even though their domains are different!) Similarly, Z = d Z. The notation = d means identically distributed. Moreover, it turns out (quite intuitive) that the convergence F 1 (U) F 1 Z (U) fails only at oints where F 1 Z (U) is discontinuous (corresonding to flat ortions of F Z (z)). Since these oints of discontinuity are a countable set, their robability (under U[0, 1]) is equal to zero, so that F 1 (U) F 1 Z (U) for almost-all U. So what we have here, is a result that for real-valued random variables Z d n Z, we can construct identically distributed variables such that both Zn d Z and Zn Z. This is called the Skorokhod construction. Skorokhod reresentation: Let (n = 1, 2,...) be random elements defined on robability saces (Ω n, B(Ω n ), P n ) and d Z, where X is defined on (Ω, B(Ω), P ). Then there exist random variables Z n (n = 1, 2,...) and Z defined on a common robability sace ( Ω, B( Ω), P ) such that Z n d = (n = 1, 2,...); Z = d Z; Zn Z. Alications: 12
Continuous maing theorem: Z d n Z Zn Z ; for h( ) continuous, we have h(zn) h(z ) h(zn) d h(z ) which imlies h( ) d h(z). Building on the above, if h is bounded, then we get Eh( ) = Eh(Z n) Eh(Z ) = Eh(Z) under the bounded convergence theorem, which shows one direction of the Portmanteau theorem, Eq. (2). 1.2 Functional Central limit theorems There are a set of distributional convergence results, which are known functional CLT s (or Donsker theorems), because they deal with convergence of random functions (or, interchangeably, rocesses). These are indisensible tools in finance. One of the simlest random functions is the Wiener rocess W(t), which is viewed a random function on the unit interval t [0, 1]. This is also known a Brownian motion rocess. Features of the Wiener rocess: 1. W(0) = 0 2. Gaussian marginals: W(t) = d N(0, t); that is, P (W(t) a) = 1 a ( ) u 2 ex du. 2πt 2t 3. Indeendent increments: Define x t W(t). For any set 0 t 0 t 1... 1, the differences x t1 x t0, x t2 x t1,... are indeendent. 4. Given the two above features, we have that the increments are themselves normally distributed: x ti x ti 1 d = N(0, t i t i 1 ). Moreover, from t 1 t 0 = V (x 1 x 0 ) = E[(x 1 x 0 ) 2 ] 13 = Ex 2 1 2Ex 1 x 0 + Ex 2 0 = t 1 + t 0 2Ex 1 x 0
imlying Ex 1 x 0 = Cov(x 1, x 0 ) = t 0. 5. Furthermore, we know that any finite collection (x t1, x t2,...) is jointly distributed multivariate normal, with mean 0 and variance matrix Σ described above. Take the conditions of the LLCLT: X 1, X 2,... are iid with mean zero and finite variance σ 2. For a given n, define the artial sum S k = X 1 + X 2 + + X k for k n. Now, for t [0, 1], define the normalized artial sum rocess S n (t) = S [tn] σ n + (tn [tn]) 1 σ n X [tn]+1 where [a] denotes the largest integer a. Sn (t) is a random function on [0, 1]. (Grah) Then the functional CLT is the following result: Functional CLT: S n d W. Note that S n is a random element taking values in C[0, 1], the sace of continuous functions on [0,1]. Proving this result is beyond our focus in this cls. 4 Note that the functional CLT imlies more than the LLCLT. For examle, take t = 1. Then S n (1) = By the FCLT, we have n i=1 X i σ n = n X n σ which is the same result the LLCLT. P ( S n (1) a) P (W(1) a) = P (N(0, 1) a) Similarly, for k n < n but kn τ [0, 1], we have n ( ) k k i=1 S n = X i n σ n = k X k σ n d N (0, τ). Also, it turns out, by a functional version of the continuous maing theorem, we can get convergence results for functionals of the artial-sum rocess. For instance, su t Sn (t) d su t W(t), and it turns out P (su t W(t) a) = 2 a ex( u 2 /2) du; a 0. 2π (Recall that W 0 = 0, so a < 0 makes no sense.) 4 See, for instance, Billingsley, Convergence of Probability Meures, ch. 2, section 10. 0 14