Econ 514: Probability and Statistics. Lecture 9: Point estimation. Point estimators

Similar documents
Chapter 6: Point Estimation. Fall Probability & Statistics

Basics of Statistical Machine Learning

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Practice problems for Homework 11 - Point Estimation

Maximum Likelihood Estimation

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

THE CENTRAL LIMIT THEOREM TORONTO

Master s Theory Exam Spring 2006

Differentiating under an integral sign

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

STAT 830 Convergence in Distribution

Multivariate Normal Distribution

Statistical Machine Learning

Metric Spaces. Chapter Metrics

MATHEMATICAL METHODS OF STATISTICS

Sections 2.11 and 5.8

Advanced statistical inference. Suhasini Subba Rao

Lecture Notes 1. Brief Review of Basic Probability

1 Teaching notes on GMM 1.

University of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014.

PS 271B: Quantitative Methods II. Lecture Notes

What is Statistics? Lecture 1. Introduction and probability review. Idea of parametric inference

171:290 Model Selection Lecture II: The Akaike Information Criterion

Principle of Data Reduction

Big Data - Lecture 1 Optimization reminders

Statistical Theory. Prof. Gesine Reinert

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

1 if 1 x 0 1 if 0 x 1

HOMEWORK 5 SOLUTIONS. n!f n (1) lim. ln x n! + xn x. 1 = G n 1 (x). (2) k + 1 n. (n 1)!

Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page

Lecture 3: Linear methods for classification

Fixed Point Theorems

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan , Fall 2010

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

Random graphs with a given degree sequence

1. Let A, B and C are three events such that P(A) = 0.45, P(B) = 0.30, P(C) = 0.35,

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 18. A Brief Introduction to Continuous Probability

Definition: Suppose that two random variables, either continuous or discrete, X and Y have joint density

Estimating an ARMA Process

Lecture 13: Martingales

Lecture 8. Confidence intervals and the central limit theorem

Exact Nonparametric Tests for Comparing Means - A Personal Summary

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

To give it a definition, an implicit function of x and y is simply any relationship that takes the form:

Analysis of Reliability and Warranty Claims in Products with Age and Usage Scales

Stat 5102 Notes: Nonparametric Tests and. confidence interval

Non Parametric Inference

Sums of Independent Random Variables

MSc. Econ: MATHEMATICAL STATISTICS, 1995 MAXIMUM-LIKELIHOOD ESTIMATION

CHAPTER 2 Estimating Probabilities

An extension of the factoring likelihood approach for non-monotone missing data

Walrasian Demand. u(x) where B(p, w) = {x R n + : p x w}.

Lecture 21. The Multivariate Normal Distribution

Adaptive Online Gradient Descent

Numerical methods for American options

IEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem

I. Pointwise convergence

The sample space for a pair of die rolls is the set. The sample space for a random number between 0 and 1 is the interval [0, 1].

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

Tail inequalities for order statistics of log-concave vectors and applications

BANACH AND HILBERT SPACE REVIEW

Lecture 6: Discrete & Continuous Probability and Random Variables

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Maximum likelihood estimation of mean reverting processes

Testing predictive performance of binary choice models 1

**BEGINNING OF EXAMINATION** The annual number of claims for an insured has probability function: , 0 < q < 1.

5 Numerical Differentiation

Interpreting Kullback-Leibler Divergence with the Neyman-Pearson Lemma

Chapter 3: The Multiple Linear Regression Model

. (3.3) n Note that supremum (3.2) must occur at one of the observed values x i or to the left of x i.

4. Continuous Random Variables, the Pareto and Normal Distributions

x a x 2 (1 + x 2 ) n.

Notes from Week 1: Algorithms for sequential prediction

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

1 Maximum likelihood estimation

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Reject Inference in Credit Scoring. Jie-Men Mok

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

Game Theory: Supermodular Games 1

GLM, insurance pricing & big data: paying attention to convergence issues.

NOV /II. 1. Let f(z) = sin z, z C. Then f(z) : 3. Let the sequence {a n } be given. (A) is bounded in the complex plane

Undergraduate Notes in Mathematics. Arkansas Tech University Department of Mathematics

Managing uncertainty in call centers using Poisson mixtures

A General Approach to Variance Estimation under Imputation for Missing Survey Data

5. Continuous Random Variables

Introduction to Probability

Notes on metric spaces

Linda Staub & Alexandros Gekenidis

Inner Product Spaces

LS.6 Solution Matrices

Transcription:

Econ 514: Probability and Statistics Lecture 9: Point estimation Point estimators In Lecture 7 we discussed the setup of a study of the income distribution in LA. Regarding the population we considered the following possibilities No assumption on the population distribution (except finite population mean and variance): nonparametric approach. Population distribution has density f(x; θ), i.e. the density is known up to a vector of parameters: parametric approach. In this lecture we take the parametric approach. 1

We have a random sample X 1,..., X n from a distribution with density f(x; θ). Upon observation we have the numbers x 1,..., x n and we use these numbers to estimate θ. Definition: A (point) estimator of θ is a statistic ˆθ = t(x 1,..., X n ). Example: If the population distribution is N(µ, σ 2 ), then θ 1 = µ and θ 2 = σ 2, we estimate θ with the estimators ˆθ 1 = X n and ˆθ 2 = S 2 n. We could write ˆµ and ˆσ 2 for these estimators. This raises two questions: How do we find (good) estimators? How do we evaluate estimators, i.e. how do we decide that an estimator is a good estimator? 2

Finding estimators Method of moments Any statistic t(x 1,..., X n ) can be an estimator for θ. Natural estimators have a relation to θ. Example: In N(µ, σ 2 ) population we have µ = E(X) and σ 2 = var(x) so that the sample mean and variance are natural estimators. General procedure: Let θ be a K vector of parameters θ 1 θ =. θ K 3

Let for r = 1,..., K E(X r ) = µ r (θ 1,..., θ K ) (1) Consider the system of equations 1 n Xi r = µ r (θ 1,..., θ K ) (2) n that equates the sample and population r-th moments. If (3) has a unique solution for θ, then by the strong law of large numbers (4) has a unique solution if n. Definition: If (3) has a unique solution for θ, then a solution ˆθ of (4) is called the Method of moments (MM) estimator of θ. 4

Example: X 1,..., X n is a random sample from the exponential distribution with density f(x; λ) = λe λx 0 < x < = 0 otherwise We have E(X) = 1 λ and the MM estimator is the solution to 1 n X i = 1ˆλ n or ˆλ = 1 X n Also E(X 2 ) = 2 λ 2 so that another MM estimator is 2 ˆλ = 1 n n X2 i Conclusion: MM estimators are not unique. 5

Theorem: If (3) has a unique solution θ, the functions µ r are continuous for r = 1,..., K and E( X K ) <, then p ˆθ n θ Proof: By the (weak) law of large numbers for r = 1,..., K n p E(X r ) so that from (4) X r i µ r (ˆθ n ) p E(X r ) By the continuous mapping theorem this implies that ˆθ n has a limit that is equal to θ, because the solution of (3) is unique.. 6

Maximum likelihood Example Population density, e.g. coin tossing. f(x; p) = p x (1 p) 1 x x = 0, 1 = 0 otherwise We do 3 independent tosses which is a random sample X 1, X 2, X 3. For the observed values x 1, x 2, x 3 we have Pr(X 1 = x 1, X 2 = x 2, X 3 = x 3 ) = = 3 p x i (1 p) 1 x i = p 3 x i (1 p) 3 3 x i Note that X i is a sufficient statistic for p. We think that p = 1 2 or p = 1 4. 7

Possible observations and their probability 3 x i 0 1 2 3 p = 1 4 p = 1 2 27 64 1 8 9 64 1 8 3 64 1 8 1 64 1 8 If we observe x 1 = x 2 = x 3 = 0 would you choose ˆp = 1 4 or ˆp = 1 2? An obvious method to select the estimate is to maximize the probability of the observed sample: ˆp = 1 4 = 1 2,, 3 x i = 0, 1 3 x i = 2, 3 This is a function of the observations only and hence an estimator of p. Note that ˆp is a function of the sufficient statistic. 8

General case In the example the estimator maximized the joint density of the random sample. If X 1,..., X n a random sample from a population with density f(x; θ), the joint density is n f(x 1,..., x n ; θ) = f(x i ; θ) = L(θ; x 1,..., x n ) If we consider this as a function of θ for fixed (observed) x 1,..., x n, this function is called the likelihood function. The maximum likelihood estimator (MLE) is ˆθ = arg max L(θ; x) θ with (note the abuse of notation) x 1 x =. x n 9

The solution is a function of x 1,..., x n Before observation ˆθ = t(x 1,..., x n ) ˆθ = t(x 1,..., X n ) Because the maximizing value is unaffected by monotone transformations of the maximand ˆθ = arg max ln L(θ; x) θ ˆθ = arg min θ L(θ; x) For a random sample the loglikelihood is n ln L(θ; x) = ln f(x i ; θ) Finding the first-order condition is easier with the loglikelihood and the set of equations (as many as parameters) ln L (θ; x) = 0 θ are called the likelihood equations. The derivative of the loglikelihood ln L θ (θ; x) is called the score function. 10

Example Same population as in previous example, but random sample of size n. Likelihood function n L(p; x) = f(x i ; p) = n p x i (1 p) 1 x i = p n x i (1 p) n n x i Loglikelihood function n ln L(p; x) = x i ln p + (n n x i ) ln(1 p) Score function ln L p (p; x) = 1 p n 1 n 1 p (n x i ) MLE ˆp = 1 n n x i = x n 11

Invariance of MLE Let L(θ; x) be a likelihood function and define τ = h(θ). Define A τ = {θ h(θ) = τ} and define If ˆθ is the MLE of θ, then with equality if ˆτ = h(ˆθ). L(τ; x) = sup θ A τ L(θ; x) sup L(τ; x) L(ˆθ; x) τ Conclusion: The MLE of τ is ˆτ = h(ˆθ). This is called the invariance property of the MLE. 12

Evaluation of estimators Sampling distribution Population parameter θ is unknown, but often there are restrictions on θ, i.e. θ Θ with Θ R K. Θ is called the parameter space. Example: In N(µ, σ 2 ) Θ = {(µ, σ 2 ) < µ <, σ 2 > 0} Because and estimator is a statistic ˆθ = t(x 1,..., X n ) it has a sampling distribution derived from the joint distribution of X 1,..., X n n f(x 1,..., x n ; θ) = f(x i ; θ) 13

Ideally this sampling distribution should be concentrated in θ irrespective what population value of θ is. Example: Random sample of size n from N(θ, 1) with Θ = {θ < θ < }. For estimator ˆθ = X n we have for the sampling distribution ˆθ N(θ, 1 n ) For sampling distribution: E(ˆθ) = θ and Var(ˆθ) 0 if n. 14

Compare sampling distributions in graph. Ranking of estimators ˆθ 1 dominates ˆθ 2. ˆθ 1 dominates ˆθ 3. What about ˆθ 1 and ˆθ 4? 15

Performance measure: MSE Ranking is possible if the performance of an estimator is captured by a single number. That is always somewhat arbitrary. Estimation error is ˆθ θ and squared estimation error (ˆθ θ) 2 treats positive and negative errors in the same way, penalizes large errors more. The mean squared error (MSE) is the average squared error in the sampling distribution of ˆθ ] MSE(ˆθ, θ) = E [(ˆθ θ) 2 The MSE depends in general on the population parameter θ. 16

Unbiased estimators Definition: An estimator ˆθ is unbiased if for all θ Θ E(ˆθ) = θ Consider MSE [ ( ) ] 2 MSE(ˆθ, θ) = E (ˆθ E(ˆθ)) + (E(ˆθ) θ) = = E [ ] [ ] (ˆθ E(ˆθ)) 2 + +(E(ˆθ) θ) 2 +2E (ˆθ E(ˆθ))(E(ˆθ) θ) = = Var(ˆθ) + (E(ˆθ) θ) 2 E(ˆθ) θ is the bias of the estimator, so that we have MSE(ˆθ, θ) = Var(ˆθ) + bias 2 and if the estimator is unbiased then MSE(ˆθ, θ) = Var(ˆθ) 17

If we only consider unbiased estimators ranking by MSE is ranking by variance. Example: For a random sample from N(θ, 1) and ˆθ = X n E(ˆθ) = θ, i.e. the estimator is unbiased. MSE(ˆθ, θ) = Var(X n ) = 1 n independent of θ. 18

Best estimators Definition: Let X 1,..., X n be a random sample from a population with density f(x; θ). The estimator ˆθ is a uniformly minimum variance unbiased (UMVU)estimator if for all θ Θ (i) E(ˆθ) = θ (ii) For all unbiased estimators θ, Var( θ) Var(ˆθ). Instead of looking for UMVU estimators directly we use a different approach (i) Find a lower bound on the variance of all unbiased estimators ˆθ (ii) Verify that for an estimator the variance is equal to the lower bound. That estimator is UMVU. 19

Theorem (Cramér-Rao): Let X 1,..., X n be a random sample from a population with density f(x; θ) with theta a scalar parameter. Then for all unbiased estimators ˆθ of θ 1 Var(ˆθ) [ ( ) ] 2 ne ln f θ (X; θ) Proof: We consider the case of a density w.r.t. Lebesgue measure. For all θ Θ n... f(x i ; θ)dx 1... dx n 1... t(x 1,..., x n ) n f(x i ; θ)dx 1... dx n θ We want to interchange differentation w.r.t. θ and integration. From Lecture 2 a sufficient condition is that f(x; θ) θ M(x) with M(x) integrable (check this!). 20

If this (or other sufficient condition) holds ( n )... f(x i ; θ) dx 1... dx n 0 (3) θ and ( n )... t(x 1,..., x n ) f(x i ; θ) dx 1... dx n 1 θ (4) Hence ( n ) 1... t(x 1,..., x n ) f(x i ; θ) dx 1... dx n = θ =... (t(x 1,..., x n ) θ) θ ( ln ) n n f(x i ; θ) f(x i ; θ)dx 1... dx n = 21

[ ( )] = E (t(x 1,..., X n ) θ) n ln f(x i ; θ) (5) θ E [(t(x 1,..., X n ) θ) 2] ( ( )) 2 E n ln f(x i ; θ) θ so that Var(ˆθ) 1 [ ( E θ ( n ln f(x i; θ)) ) ] (6) 2 In (5) we used the Cauchy-Schwartz inequality (see Lecture 6). 22

Because X 1,..., X n are independently and identically, abbreviated i.i.d. distributed ( ( n )) 2 E ln f(x i ; θ) = θ ( n = E because from [ ] ln f E θ (X 1; θ) = ) 2 [ ( ) ] 2 ln f θ (X i; θ) ln f = ne θ (X 1; θ) f θ (x; θ)dx = θ for the expectation of the cross-products [ ln f E θ (X i; θ) ln f ] θ (X j; θ) = 0. f(x; θ)dx = 0 23

The lower bound on the variance of an unbiased estimator is called the Cramér-Rao lower bound. If X 1,..., X n is not a random sample, but has joint density f(x 1,..., x n ; θ), then the lower bound is 1 [ ( E θ ln f(x 1,..., X n ; θ) ) ] 2 The inequality in (5) is an equality if and only if θ ln L(θ; x 1,..., x n ) = c(θ)(t(x 1,..., x n ) θ) Hence the estimator has a variance that reaches the lower bound if this equation holds. 24

Example: For an N(θ, 1) population f(x; θ) = 1 2π e 1 2 (x θ)2 Hence and E ln f(x; θ) = 1 2 ln(2π) 1 (x θ)2 2 ln f θ (x; θ) = x θ [ ( ) ] 2 ln f θ (X 1; θ) = E[(X 1 θ) 2 ] = 1 Conclusion: Cramér-Rao lower bound is 1 n. The estimator ˆθ = X n is unbiased and has a variance equal to this bound. ˆθ = X n is UMVU. 25

Information matrix We have ln f... θ (x 1,..., x n ; θ)dx 1... dx n 0 Differentiating with respect to θ and interchanging differentiation and integration (suggest a sufficient condition that allows this) 2 ln f 0... (x θ 2 1,..., x n ; θ)f(x 1,..., x n ; θ)dx 1... dx n + ln f +... θ (x 1,..., x n ; θ) f θ (x 1,..., x n ; θ)dx 1... dx n = [ 2 ] [ ( ) ] 2 ln f ln f = E (X θ 2 1,..., X n ; θ) +E θ (X 1,..., X n ; θ) 26

Conclusion: [ ( ) ] 2 ln f E θ (X 1,..., X n ; θ) = E [ 2 ] ln f (X θ 2 1,..., X n ; θ) The left-hand side is called Fisher information matrix. The equation shows that this is the variance of the score function and that it is equal to minus the expected value of the Hessian of the min-loglikelihood. 27

Example 1: Random sample of size n from N(µ, σ 2 ). The sample variance S 2 n = 1 n 1 is an unbiased estimator of σ 2. It can be shown i = 1 n (X i X n ) 2 Var(S 2 n) = 2σ4 n 1 The second derivative of the log density w.r.t. σ 2 is 2 ( ) 1 (σ 2 ) ln 1 2 σ2 2π e 2σ 2 (x µ 0) 2 = 1 2σ (x µ 0) 2 4 σ 6 By the information matrix equality the information matrix is [ E 1 2σ + (X µ 0) 2 ] = 1 4 σ 6 2σ 4 and the Cramér-Rao lower bound is 2σ 4 Conclusion: The MSE of S 2 n is strictly greater than the lower bound. n 28

The MLE of σ 2 is ˆσ 2 = 1 i = 1 n (X i X n ) 2 = n 1 n n S2 n and Var(ˆσ 2 ) = (n 1)2 n 2 Var(S 2 n) = E(ˆσ 2 ) = n 1 n σ2 < σ 2 2(n 1) n 2 σ 4 Conclusion: The MLE is biased. However MSE(S 2 n; σ 2 ) = 2σ4 n 1 > 2n 1 σ 4 = MSE(ˆσ 2 ; σ 2 ) n 2 Consider the case that µ is known. We have ln L = n σ 2 2σ + n 1 n (x 2 2σ 4 i µ) 2 = n = n 2σ 4 ( 1 n ) n (x i µ) 2 σ 2 = c(σ 2 )( σ 2 σ 2 ) Conclusion: If µ is known the estimator σ 2 = 1 n (x i µ) 2 n is unbiased and reaches the Cramér-Rao lower bound. 29

Example 2 Random sample of size n from a uniform distribution with density Because f(x; θ) = 1 θ I [0,θ](x) 2 ln f θ 2 (x; θ) = 1 θ 2 the Cramér-Rao lower bound seems to be θ2 n. Estimator θ = max{x 1,..., X n }. The statistic T = max{x 1,..., X n } has density f T (t) = nyn 1, 0 < y < θ θ n = 0, otherwise We find E( θ) = E(T ) = n n + 1 θ so that ˆθ = n+1 θ n is an unbiased estimator with Var(ˆθ) = 1 n(n + 2) θ2 < θ2 n 30

It seems that we have an unbiased estimator with a variance that is smaller than the lower bound. Problem is θ 0 f(x; θ)dx 1 Taking the derivative w.r.t. θ f(θ; θ) + θ 0 f (x; θ)dx θ θ 0 f (x; θ)dx θ i.e. we cannot interchange differentiation and integration. 31

Uniqueness of UMVU estimators and Rao-Blackwell theorem We show that UMVU estimators are unique. Let ˆθ be UMVU and let θ be another unbiased estimator of θ. Define a third estimator θ = ˆθ+t(ˆθ θ) with t some real number. We have Var(θ ) = Var(ˆθ) + t 2 Var(ˆθ θ) + 2tCov(ˆθ θ, ˆθ) This variance is minimal if t = Cov(ˆθ θ, ˆθ) Var(ˆθ θ) Because ˆθ is UMVU, t has to be 0 (otherwise we would have an estimator with a variance smaller than the lower bound). 32

Hence so that 0 = Cov(ˆθ θ, ˆθ) = Var(ˆθ) Cov(ˆθ, θ) Var( θ ˆθ) = Var(ˆθ)+Var( θ) 2Cov(ˆθ, θ) = Var( θ) Var(ˆθ) (7) Conclusions: If θ is also UMVU, then Var( θ ˆθ) = 0 and hence ˆθ = θ with probability 1. (7) simplifies the calculation of the variance of the difference of an unbiased and and UMVU estimator. 33

The next theorem shows how to improve an unbiased estimator. Rao-Blackwell theorem: If θ is an unbiased estimator of θ and T is a sufficient statistic for θ, then is unbiased and E(ˆθ) = E( θ T ) Var(ˆθ) Var( θ) Proof: For all θ Θ by the law of iterated expectations E(ˆθ) = E(E( θ T )) = E( θ) = θ and by the law of iterated variances Var( θ) = Var(E( θ T ))+E(Var( θ T )) = Var(ˆθ)+E(Var( θ T )) Var(ˆθ) Finally, because T is sufficient E( θ T ) does not depend on θ. 34

Asymptotic properties of estimators In most cases the sampling distribution of estimators is too complicated to compute mean, variance, and MSE. In that case we use asymptotic analysis, i.e. we let the sample size n. Consider random sample X 1,..., X n and let ˆθ n = t(x 1,..., X n ) be an estimator. What happens to the sampling distribution of ˆθ n if the sample size becomes large? Because if n becomes larger we know more and more of the population, the sampling distribution should behave as in the figure 35

36

if n = we know the populations and the sampling distribution should be degenerate in θ. Let ˆθ n be the sequence of estimators for increasing sample size. We say that θ n is (weakly) consistent if ˆθ n p θ If the convergence is a.s. we say that the estimator (sequence) is strongly consistent. Why does it not make sense to consider convergence in distribution? Example: For N(θ, σ 2 ) population the sample means is strongly and weakly consistent for θ. 37

Large sample behavior of MLE If for all θ Θ ( ) ln f E (X; θ) θ < then Also 1 n E n ( ) ln f θ (X i; θ) p ln f E θ (X 1; θ) ( ) ln f θ (X 1; θ) f = θ so that for θ = θ ( ) ln f E θ (X 1; θ) = f(x; θ) (x; θ) dx f(x; θ) f θ (x; θ)dx = θ f(x; θ)dx = 0 where we have interchanged differentiation and integration, which is allowed e.g. if for θ is a small interval around θ, f(x; θ) < g(x) with g integrable. 38

For the MLE ˆθ n 0 = 1 n n This suggests that ln f θ (X i; ˆθ n ) = E ( ) ln f θ (X 1; θ) ˆθ n p θ i.e. the MLE is weakly consistent. 39

Asymptotic distribution of MLE By Taylor s theorem 0 = 1 n ln f n θ (X i; ˆθ n ) = 1 n + 1 n n n 2 ln f θ 2 (X i ; θ n )(ˆθ n θ) with θ θ n ˆθ n or ˆθ n θ n θ. Consider the first term on rhs and define Y i = ln f θ (X i; θ) ln f θ (X i; θ)+ then E(Y i ) = 0 and [ ( ) ] 2 ln f Var(Y i ) = E θ (X i; θ) = I(θ) By the Central Limit Theorem 1 n ln f n θ (X i; θ) d N(0, I(θ)) 40

Next consider the second term on rhs. Because ˆθ n p θ also θ n θ. Further 1 n 2 ( ln f (X n θ 2 i ; θ) p 2 ) ln f E (X θ 2 1 ; θ) if ( 2 ) ln f E (X θ 2 1 ; θ) < This suggests that 1 n 2 ( ln f (X n θ 2 i ; θ n ) p 2 ) ln f E (X θ 2 1 ; θ) = A(θ) By the Slutsky theorem 0 = 1 n ln f n θ (X i; θ)+ 1 n n 2 ln f θ 2 implies that n(ˆθn θ) d N(A(θ) 1 I(θ)A(θ) 1 ) This is the limit distribution of the MLE. p (X i ; θ n ) n(ˆθ n θ) 41

By the information matrix equality A(θ) = I(θ) so that this simplifies to n(ˆθn θ) d N(I(θ) 1 ) Note the we consider the sequence n(ˆθ n θ) which is similar to nx n in the CLT. The variance is equal to the Cramér-Rao lower bound. This means that the limit distribution has a variance equal to the lower bound. We say that the MLE is asymptotically efficient. The asymptotic normal distribution can be used to compute a confidence interval for θ of the form ˆθ n c I(ˆθ n ) n < θ < ˆθ n + c I(ˆθ n ) n 42

Bootstrap To obtain the sampling distribution of a statistic for finite n one could use the computer. Consider the sample mean X n = 1 n n X i X n is the mean over the empirical distribution, P n, that assigns probability 1 n to X 1,..., X n X n = xdp n The observed sample x 1,..., x n is a realization of X 1,..., X n. 43

Now draw from x 1,..., x n a random sample of size n with replacement x 1,..., x n. This is a draw from the empirical distribution that gives x n = 1 n x i n Do this N times and consider the x n -s as draws from the sampling distribution of X n. The justification of this procedure is that if n then the empirical distribution converges to the population distribution, so that averaging over the empirical distribution is the same in large samples as averaging over the population distribution. This method of approximating the sampling distribution is called the bootstrap method after a tall tale by the (in)famous baron Von Münchhausen (1720-1797). 44