Properties of MLE: consistency, asymptotic normality. Fisher information.



Similar documents
Overview of some probability distributions.

Maximum Likelihood Estimators.

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method


Chapter 7 Methods of Finding Estimators

I. Chi-squared Distributions

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

Sampling Distribution And Central Limit Theorem

Lecture 4: Cauchy sequences, Bolzano-Weierstrass, and the Squeeze theorem

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

1. C. The formula for the confidence interval for a population mean is: x t, which was

Chapter 14 Nonparametric Statistics

Normal Distribution.

A probabilistic proof of a binomial identity

CHAPTER 7: Central Limit Theorem: CLT for Averages (Means)

Asymptotic Growth of Functions

Math C067 Sampling Distributions

Confidence Intervals for One Mean

1 Computing the Standard Deviation of Sample Means

Sequences and Series

Convexity, Inequalities, and Norms

The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx

Output Analysis (2, Chapters 10 &11 Law)

Incremental calculation of weighted mean and variance

4.3. The Integral and Comparison Tests

5: Introduction to Estimation

UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Solutions 9 Spring 2006

Section 11.3: The Integral Test

Lecture 5: Span, linear independence, bases, and dimension

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

Statistical inference: example 1. Inferential Statistics

Infinite Sequences and Series

Hypothesis testing. Null and alternative hypotheses

BASIC STATISTICS. f(x 1,x 2,..., x n )=f(x 1 )f(x 2 ) f(x n )= f(x i ) (1)

1. MATHEMATICAL INDUCTION

Our aim is to show that under reasonable assumptions a given 2π-periodic function f can be represented as convergent series

Measures of Spread and Boxplots Discrete Math, Section 9.4

Lecture 4: Cheeger s Inequality

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable

Theorems About Power Series

Lesson 17 Pearson s Correlation Coefficient

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

Unbiased Estimation. Topic Introduction

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

Example 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here).

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals

Lecture 3. denote the orthogonal complement of S k. Then. 1 x S k. n. 2 x T Ax = ( ) λ x. with x = 1, we have. i = λ k x 2 = λ k.

Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork

Section 8.3 : De Moivre s Theorem and Applications

Trigonometric Form of a Complex Number. The Complex Plane. axis. ( 2, 1) or 2 i FIGURE The absolute value of the complex number z a bi is

Your organization has a Class B IP address of Before you implement subnetting, the Network ID and Host ID are divided as follows:

LECTURE 13: Cross-validation

Class Meeting # 16: The Fourier Transform on R n

A Recursive Formula for Moments of a Binomial Distribution

Determining the sample size

The Stable Marriage Problem

Confidence Intervals. CI for a population mean (σ is known and n > 30 or the variable is normally distributed in the.

Repeating Decimals are decimal numbers that have number(s) after the decimal point that repeat in a pattern.

A Mathematical Perspective on Gambling

Lesson 15 ANOVA (analysis of variance)

Center, Spread, and Shape in Inference: Claims, Caveats, and Insights

Universal coding for classes of sources

THE TWO-VARIABLE LINEAR REGRESSION MODEL

CS103X: Discrete Structures Homework 4 Solutions

How To Solve The Homewor Problem Beautifully

Chapter 5: Inner Product Spaces

A gentle introduction to Expectation Maximization

Approximating Area under a curve with rectangles. To find the area under a curve we approximate the area using rectangles and then use limits to find

CS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations

arxiv: v1 [stat.me] 10 Jun 2015

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Central Limit Theorem and Its Applications to Baseball

Plug-in martingales for testing exchangeability on-line

Chapter 7: Confidence Interval and Sample Size

Topic 5: Confidence Intervals (Chapter 9)

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM

3 Basic Definitions of Probability Theory

Soving Recurrence Relations

A Note on Sums of Greatest (Least) Prime Factors

10-705/ Intermediate Statistics

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 8

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

arxiv: v1 [math.st] 21 Aug 2009

MARTINGALES AND A BASIC APPLICATION

NOTES ON PROBABILITY Greg Lawler Last Updated: March 21, 2016

THE HEIGHT OF q-binary SEARCH TREES

3. Greatest Common Divisor - Least Common Multiple

FIBONACCI NUMBERS: AN APPLICATION OF LINEAR ALGEBRA. 1. Powers of a matrix

Ekkehart Schlicht: Economic Surplus and Derived Demand

Inference on Proportion. Chapter 8 Tests of Statistical Hypotheses. Sampling Distribution of Sample Proportion. Confidence Interval

5.3. Generalized Permutations and Combinations

Elementary Theory of Russian Roulette

, a Wishart distribution with n -1 degrees of freedom and scale matrix.

Transcription:

Lecture 3 Properties of MLE: cosistecy, asymptotic ormality. Fisher iformatio. I this sectio we will try to uderstad why MLEs are good. Let us recall two facts from probability that we be used ofte throughout this course. Law of Large Numbers (LLN): If the distributio of the i.i.d. sample X,..., X is such that X has fiite expectatio, i.e. EX <, the the sample average X +... + X X = EX coverges to its expectatio i probability, which meas that for ay arbitrarily small α >, P( X EX > θ) as. Note. Wheever we will use the LLN below we will simply say that the average coverges to its expectatio ad will ot metio i what sese. More mathematically iclied studets are welcome to carry out these steps more rigorously, especially whe we use LLN i combiatio with the Cetral Limit Theorem. Cetral Limit Theorem (CLT): If the distributio of the i.i.d. sample X,..., X is such that X has fiite expectatio ad variace, i.e. EX < ad π = Var(X) <, the (X EX ) d N(, π ) coverges i distributio to ormal distributio with zero mea ad variace π, which meas that for ay iterval [a, b], b x P (X EX ) [a, b] π e dx. 6 a

I other words, the radom variable (X EX ) will behave like a radom variable from ormal distributio whe gets large. Exercise. Illustrate CLT by geeratig Beroulli radom varibles B(p) (or oe Biomial r.v. B(, p)) ad the computig (X EX ). Repeat this may times ad use dfittool to see that this radom quatity will be well approximated by ormal distributio. We will prove that MLE satisfies (usually) the followig two properties called cosistecy ad asymptotic ormality.. Cosistecy. We say that a estimate ϕˆ is cosistet if ϕˆ ϕ i probability as, where ϕ is the true ukow parameter of the distributio of the sample.. Asymptotic Normality. We say that ϕˆ is asymptotically ormal if (ϕˆ ϕ ) d N(, π ) where π is called the asymptotic variace of the estimate ϕˆ. Asymptotic ormality says that the estimator ot oly coverges to the ukow parameter, but it coverges fast eough, at a rate /. Cosistecy of MLE. To make our discussio as simple as possible, let us assume that a likelihood fuctio is smooth ad behaves i a ice way like show i figure 3., i.e. its maximum is achieved at a uique poit ϕ. ˆ.8.6.4 (...4 ( PSfrag replacemets.6.8.5.5.5 3 3.5 4 ϕˆ ϕ Figure 3.: Maximum Likelihood Estimator (MLE) Suppose that the data X,..., X is geerated from a distributio with ukow parameter ϕ ad ϕˆ is a MLE. Why ϕˆ coverges to the ukow parameter ϕ? This is ot immediately obvious ad i this sectio we will give a sketch of why this happes. 7

First of all, MLE ϕˆ is the maximizer of L ( = i= log X i which is a log-likelihood fuctio ormalized by (of course, this does ot affect maximizatio). Notice that fuctio L ( depeds o data. Let us cosider a fuctio l( = log ad defie L( = E l(, where E deotes the expectatio with respect to the true ukow parameter ϕ of the sample X,..., X. If we deal with cotiuous distributios the L( = (log x )x ϕ )dx. By law of large umbers, for ay ϕ, L ( E l( = L(. Note that L( does ot deped o the sample, it oly depeds o ϕ. We will eed the followig Lemma. We have that for ay ϕ, L( L(ϕ ). Moreover, the iequality is strict, L( < L(ϕ ), uless which meas that P = P. Proof. Let us cosider the differece P ( = ϕ )) =. L( L(ϕ ) = E (log log ϕ )) = E log f ( ϕ ). Sice log t t, we ca write E log E = x ϕ )dx f ( ϕ) ϕ) x x ϕ) = x dx x ϕ )dx = =. Both itegrals are equal to because we are itegratig the probability desity fuctios. This proves that L( L(ϕ ). The secod statemet of Lemma is also clear. 8

We will use this Lemma to sketch the cosistecy of the MLE. Theorem: Uder some regularity coditios o the family of distributios, MLE ϕˆ is cosistet, i.e. ϕˆ ϕ as. The statemet of this Theorem is ot very precise but but rather tha provig a rigorous mathematical statemet our goal here is to illustrate the mai idea. Mathematically iclied studets are welcome to come up with some precise statemet. L ( L( PSfrag replacemets ˆϕ ϕ ϕ Figure 3.: Illustratio to Theorem. Proof. We have the followig facts:. ϕˆ is the maximizer of L ( (by defiitio).. ϕ is the maximizer of L( (by Lemma). 3. ϕ we have L ( L( by LLN. This situatio is illustrated i figure 3.. Therefore, sice two fuctios L ad L are gettig closer, the poits of maximum should also get closer which exactly meas that ϕˆ ϕ. Asymptotic ormality of MLE. Fisher iformatio. We wat to show the asymptotic ormality of MLE, i.e. to show that (ϕˆ ϕ ) d N(, πmle ) for some πmle ad compute πmle. This asymptotic variace i some sese measures the quality of MLE. First, we eed to itroduce the otio called Fisher Iformatio. Let us recall that above we defied the fuctio l( = log. To simplify the otatios we will deote by l (, l (, etc. the derivatives of l( with respect to ϕ. Defiitio. (Fisher iformatio.) Fisher iformatio of a radom variable X with distributio P from the family {P : ϕ } is defied by I(ϕ ) = E (l (ϕ )) E log. ϕ = 9

Remark. Let us give a very iformal iterpretatio of Fisher iformatio. The derivative l (ϕ ) = (log ϕ )) = f ( ϕ) ϕ ) ca be iterpreted as a measure of how quickly the distributio desity or p.f. will chage whe we slightly chage the parameter ϕ ear ϕ. Whe we square this ad take expectatio, i.e. average over X, we get a averaged versio of this measure. So if Fisher iformatio is large, this meas that the distributio will chage quickly whe we move the parameter, so the distributio with parameter ϕ is quite differet ad ca be well distiguished from the distributios with parameters ot so close to ϕ. This meas that we should be able to estimate ϕ well based o the data. O the other had, if Fisher iformatio is small, this meas that the distributio is very similar to distributios with parameter ot so close to ϕ ad, thus, more difficult to distiguish, so our estimatio will be worse. We will see precisely this behavior i Theorem below. Next lemma gives aother ofte coveiet way to compute Fisher iformatio. Lemma. We have, E l (ϕ ) E ϕ log ϕ ) = I(ϕ ). ad Proof. First of all, we have Also, sice p.d.f. itegrates to, l ( = (log ) = f ( (log ) = f ( (f ( ). f ( x dx =, if we take derivatives of this equatio with respect to ϕ (ad iterchage derivative ad itegral, which ca usually be doe) we will get, ϕ x dx = ad ϕ x dx = f (x dx =. To fiish the proof we write the followig computatio E l ( ϕ ) = E log ϕ ) = (log x ϕ )) x ϕ )dx ϕ f f ( x ϕ (x ϕ) = x ϕ )dx x ϕ)) x ϕ) = f (x ϕ )dx E (l (ϕ )) = I(ϕ = I(ϕ ).

We are ow ready to prove the mai result of this sectio. Theorem. (Asymptotic ormality of MLE.) We have, (ϕˆ ϕ ) N,. I(ϕ ) As we ca see, the asymptotic variace/dispersio of the estimate aroud true parameter will be smaller whe Fisher iformatio is larger. Proof. Sice MLE ϕˆ is maximizer of L ( = i= log X i, we have Let us use the Mea Value Theorem a) b) a b L (ϕˆ) =. = f (c) or a) = b) + f (c)(a b) for c [a, b] with = L (, a = ϕˆ ad b = ϕ. The we ca write, for some ϕ ˆ [ˆ ϕ, ϕ ]. From here we get that = L (ϕˆ) = L (ϕ ) + L (ϕˆ)(ϕˆ ϕ ) ˆ ϕ ϕ = L (ϕ) L (ϕ ) ad (ϕˆ ϕ ) =. (3..) L ( ˆ ( ˆ ϕ ) L ϕ ) Sice by Lemma i the previous sectio we kow that ϕ is the maximizer of L(, we have L (ϕ ) = E l ( ϕ ) =. (3..) Therefore, the umerator i (3..) L (ϕ ) = = l (X i ϕ ) i= l (X i ϕ ) E l (X ϕ ) N, Var (l (X ϕ )) i= (3..3) coverges i distributio by Cetral Limit Theorem. Next, let us cosider the deomiator i (3..). First of all, we have that for all ϕ, L ( = l (X i E l (X by LLN. (3..4) Also, sice ϕˆ [ˆ ϕ, ϕ ϕ ϕ, we have ˆ ] ad by cosistecy result of previous sectio, ˆ ϕ ϕ. Usig this together with (..3) we get L (ϕˆ) E l (X ϕ ) = I(ϕ ) by Lemma above.

Combiig this with (3..3) we get L (ϕ ) d N, Var (l (X ϕ )). L (ϕˆ) (I(ϕ )) Fially, the variace, Var (l (X ϕ )) = E (l (ϕ )) (E l (x ϕ )) = I(ϕ ) where i the last equality we used the defiitio of Fisher iformatio ad (3..). Let us compute Fisher iformatio for some particular distributios. Example. The family of Beroulli distributios B(p) has p.f. ad takig the logarithm x p) = p x ( p) x log x p) = x log p + ( x) log( p). The secod derivative with respect to parameter p is x x log x p) =, log x p) = x x. p p p p p ( p) The the Fisher iformatio ca be computed as I(p) = E log X p) = EX + EX = p + p =. p p ( p) p ( p) p( p) The MLE of p is ˆp = X ad the asymptotic ormality result states that (ˆp p ) N(, p ( p )) which, of course, also follows directly from the CLT. Example. The family of expoetial distributios E() has p.d.f. e x, x x ) =, x < ad, therefore, This does ot deped o X ad we get log x ) = log x log x ) =. I() = E log ) =. Therefore, the MLE ˆ = /X is asymptotically ormal ad (ˆ ) N(, ).