Properties of MLE: consistency, asymptotic normality. Fisher information.

Lecture 3 Properties of MLE: cosistecy, asymptotic ormality. Fisher iformatio. I this sectio we will try to uderstad why MLEs are good. Let us recall two facts from probability that we be used ofte throughout this course. Law of Large Numbers (LLN): If the distributio of the i.i.d. sample X,..., X is such that X has fiite expectatio, i.e. EX <, the the sample average X +... + X X = EX coverges to its expectatio i probability, which meas that for ay arbitrarily small α >, P( X EX > θ) as. Note. Wheever we will use the LLN below we will simply say that the average coverges to its expectatio ad will ot metio i what sese. More mathematically iclied studets are welcome to carry out these steps more rigorously, especially whe we use LLN i combiatio with the Cetral Limit Theorem. Cetral Limit Theorem (CLT): If the distributio of the i.i.d. sample X,..., X is such that X has fiite expectatio ad variace, i.e. EX < ad π = Var(X) <, the (X EX ) d N(, π ) coverges i distributio to ormal distributio with zero mea ad variace π, which meas that for ay iterval [a, b], b x P (X EX ) [a, b] π e dx. 6 a

I other words, the radom variable (X EX ) will behave like a radom variable from ormal distributio whe gets large. Exercise. Illustrate CLT by geeratig Beroulli radom varibles B(p) (or oe Biomial r.v. B(, p)) ad the computig (X EX ). Repeat this may times ad use dfittool to see that this radom quatity will be well approximated by ormal distributio. We will prove that MLE satisfies (usually) the followig two properties called cosistecy ad asymptotic ormality.. Cosistecy. We say that a estimate ϕˆ is cosistet if ϕˆ ϕ i probability as, where ϕ is the true ukow parameter of the distributio of the sample.. Asymptotic Normality. We say that ϕˆ is asymptotically ormal if (ϕˆ ϕ ) d N(, π ) where π is called the asymptotic variace of the estimate ϕˆ. Asymptotic ormality says that the estimator ot oly coverges to the ukow parameter, but it coverges fast eough, at a rate /. Cosistecy of MLE. To make our discussio as simple as possible, let us assume that a likelihood fuctio is smooth ad behaves i a ice way like show i figure 3., i.e. its maximum is achieved at a uique poit ϕ. ˆ.8.6.4 (...4 ( PSfrag replacemets.6.8.5.5.5 3 3.5 4 ϕˆ ϕ Figure 3.: Maximum Likelihood Estimator (MLE) Suppose that the data X,..., X is geerated from a distributio with ukow parameter ϕ ad ϕˆ is a MLE. Why ϕˆ coverges to the ukow parameter ϕ? This is ot immediately obvious ad i this sectio we will give a sketch of why this happes. 7

First of all, MLE ϕˆ is the maximizer of L ( = i= log X i which is a log-likelihood fuctio ormalized by (of course, this does ot affect maximizatio). Notice that fuctio L ( depeds o data. Let us cosider a fuctio l( = log ad defie L( = E l(, where E deotes the expectatio with respect to the true ukow parameter ϕ of the sample X,..., X. If we deal with cotiuous distributios the L( = (log x )x ϕ )dx. By law of large umbers, for ay ϕ, L ( E l( = L(. Note that L( does ot deped o the sample, it oly depeds o ϕ. We will eed the followig Lemma. We have that for ay ϕ, L( L(ϕ ). Moreover, the iequality is strict, L( < L(ϕ ), uless which meas that P = P. Proof. Let us cosider the differece P ( = ϕ )) =. L( L(ϕ ) = E (log log ϕ )) = E log f ( ϕ ). Sice log t t, we ca write E log E = x ϕ )dx f ( ϕ) ϕ) x x ϕ) = x dx x ϕ )dx = =. Both itegrals are equal to because we are itegratig the probability desity fuctios. This proves that L( L(ϕ ). The secod statemet of Lemma is also clear. 8

We will use this Lemma to sketch the cosistecy of the MLE. Theorem: Uder some regularity coditios o the family of distributios, MLE ϕˆ is cosistet, i.e. ϕˆ ϕ as. The statemet of this Theorem is ot very precise but but rather tha provig a rigorous mathematical statemet our goal here is to illustrate the mai idea. Mathematically iclied studets are welcome to come up with some precise statemet. L ( L( PSfrag replacemets ˆϕ ϕ ϕ Figure 3.: Illustratio to Theorem. Proof. We have the followig facts:. ϕˆ is the maximizer of L ( (by defiitio).. ϕ is the maximizer of L( (by Lemma). 3. ϕ we have L ( L( by LLN. This situatio is illustrated i figure 3.. Therefore, sice two fuctios L ad L are gettig closer, the poits of maximum should also get closer which exactly meas that ϕˆ ϕ. Asymptotic ormality of MLE. Fisher iformatio. We wat to show the asymptotic ormality of MLE, i.e. to show that (ϕˆ ϕ ) d N(, πmle ) for some πmle ad compute πmle. This asymptotic variace i some sese measures the quality of MLE. First, we eed to itroduce the otio called Fisher Iformatio. Let us recall that above we defied the fuctio l( = log. To simplify the otatios we will deote by l (, l (, etc. the derivatives of l( with respect to ϕ. Defiitio. (Fisher iformatio.) Fisher iformatio of a radom variable X with distributio P from the family {P : ϕ } is defied by I(ϕ ) = E (l (ϕ )) E log. ϕ = 9

Remark. Let us give a very iformal iterpretatio of Fisher iformatio. The derivative l (ϕ ) = (log ϕ )) = f ( ϕ) ϕ ) ca be iterpreted as a measure of how quickly the distributio desity or p.f. will chage whe we slightly chage the parameter ϕ ear ϕ. Whe we square this ad take expectatio, i.e. average over X, we get a averaged versio of this measure. So if Fisher iformatio is large, this meas that the distributio will chage quickly whe we move the parameter, so the distributio with parameter ϕ is quite differet ad ca be well distiguished from the distributios with parameters ot so close to ϕ. This meas that we should be able to estimate ϕ well based o the data. O the other had, if Fisher iformatio is small, this meas that the distributio is very similar to distributios with parameter ot so close to ϕ ad, thus, more difficult to distiguish, so our estimatio will be worse. We will see precisely this behavior i Theorem below. Next lemma gives aother ofte coveiet way to compute Fisher iformatio. Lemma. We have, E l (ϕ ) E ϕ log ϕ ) = I(ϕ ). ad Proof. First of all, we have Also, sice p.d.f. itegrates to, l ( = (log ) = f ( (log ) = f ( (f ( ). f ( x dx =, if we take derivatives of this equatio with respect to ϕ (ad iterchage derivative ad itegral, which ca usually be doe) we will get, ϕ x dx = ad ϕ x dx = f (x dx =. To fiish the proof we write the followig computatio E l ( ϕ ) = E log ϕ ) = (log x ϕ )) x ϕ )dx ϕ f f ( x ϕ (x ϕ) = x ϕ )dx x ϕ)) x ϕ) = f (x ϕ )dx E (l (ϕ )) = I(ϕ = I(ϕ ).

We are ow ready to prove the mai result of this sectio. Theorem. (Asymptotic ormality of MLE.) We have, (ϕˆ ϕ ) N,. I(ϕ ) As we ca see, the asymptotic variace/dispersio of the estimate aroud true parameter will be smaller whe Fisher iformatio is larger. Proof. Sice MLE ϕˆ is maximizer of L ( = i= log X i, we have Let us use the Mea Value Theorem a) b) a b L (ϕˆ) =. = f (c) or a) = b) + f (c)(a b) for c [a, b] with = L (, a = ϕˆ ad b = ϕ. The we ca write, for some ϕ ˆ [ˆ ϕ, ϕ ]. From here we get that = L (ϕˆ) = L (ϕ ) + L (ϕˆ)(ϕˆ ϕ ) ˆ ϕ ϕ = L (ϕ) L (ϕ ) ad (ϕˆ ϕ ) =. (3..) L ( ˆ ( ˆ ϕ ) L ϕ ) Sice by Lemma i the previous sectio we kow that ϕ is the maximizer of L(, we have L (ϕ ) = E l ( ϕ ) =. (3..) Therefore, the umerator i (3..) L (ϕ ) = = l (X i ϕ ) i= l (X i ϕ ) E l (X ϕ ) N, Var (l (X ϕ )) i= (3..3) coverges i distributio by Cetral Limit Theorem. Next, let us cosider the deomiator i (3..). First of all, we have that for all ϕ, L ( = l (X i E l (X by LLN. (3..4) Also, sice ϕˆ [ˆ ϕ, ϕ ϕ ϕ, we have ˆ ] ad by cosistecy result of previous sectio, ˆ ϕ ϕ. Usig this together with (..3) we get L (ϕˆ) E l (X ϕ ) = I(ϕ ) by Lemma above.

Combiig this with (3..3) we get L (ϕ ) d N, Var (l (X ϕ )). L (ϕˆ) (I(ϕ )) Fially, the variace, Var (l (X ϕ )) = E (l (ϕ )) (E l (x ϕ )) = I(ϕ ) where i the last equality we used the defiitio of Fisher iformatio ad (3..). Let us compute Fisher iformatio for some particular distributios. Example. The family of Beroulli distributios B(p) has p.f. ad takig the logarithm x p) = p x ( p) x log x p) = x log p + ( x) log( p). The secod derivative with respect to parameter p is x x log x p) =, log x p) = x x. p p p p p ( p) The the Fisher iformatio ca be computed as I(p) = E log X p) = EX + EX = p + p =. p p ( p) p ( p) p( p) The MLE of p is ˆp = X ad the asymptotic ormality result states that (ˆp p ) N(, p ( p )) which, of course, also follows directly from the CLT. Example. The family of expoetial distributios E() has p.d.f. e x, x x ) =, x < ad, therefore, This does ot deped o X ad we get log x ) = log x log x ) =. I() = E log ) =. Therefore, the MLE ˆ = /X is asymptotically ormal ad (ˆ ) N(, ).