Econ 514: Probability and Statistics. Lecture 9: Point estimation. Point estimators

Econ 514: Probability and Statistics Lecture 9: Point estimation Point estimators In Lecture 7 we discussed the setup of a study of the income distribution in LA. Regarding the population we considered the following possibilities No assumption on the population distribution (except finite population mean and variance): nonparametric approach. Population distribution has density f(x; θ), i.e. the density is known up to a vector of parameters: parametric approach. In this lecture we take the parametric approach. 1

We have a random sample X 1,..., X n from a distribution with density f(x; θ). Upon observation we have the numbers x 1,..., x n and we use these numbers to estimate θ. Definition: A (point) estimator of θ is a statistic ˆθ = t(x 1,..., X n ). Example: If the population distribution is N(µ, σ 2 ), then θ 1 = µ and θ 2 = σ 2, we estimate θ with the estimators ˆθ 1 = X n and ˆθ 2 = S 2 n. We could write ˆµ and ˆσ 2 for these estimators. This raises two questions: How do we find (good) estimators? How do we evaluate estimators, i.e. how do we decide that an estimator is a good estimator? 2

Finding estimators Method of moments Any statistic t(x 1,..., X n ) can be an estimator for θ. Natural estimators have a relation to θ. Example: In N(µ, σ 2 ) population we have µ = E(X) and σ 2 = var(x) so that the sample mean and variance are natural estimators. General procedure: Let θ be a K vector of parameters θ 1 θ =. θ K 3

Let for r = 1,..., K E(X r ) = µ r (θ 1,..., θ K ) (1) Consider the system of equations 1 n Xi r = µ r (θ 1,..., θ K ) (2) n that equates the sample and population r-th moments. If (3) has a unique solution for θ, then by the strong law of large numbers (4) has a unique solution if n. Definition: If (3) has a unique solution for θ, then a solution ˆθ of (4) is called the Method of moments (MM) estimator of θ. 4

Example: X 1,..., X n is a random sample from the exponential distribution with density f(x; λ) = λe λx 0 < x < = 0 otherwise We have E(X) = 1 λ and the MM estimator is the solution to 1 n X i = 1ˆλ n or ˆλ = 1 X n Also E(X 2 ) = 2 λ 2 so that another MM estimator is 2 ˆλ = 1 n n X2 i Conclusion: MM estimators are not unique. 5

Theorem: If (3) has a unique solution θ, the functions µ r are continuous for r = 1,..., K and E( X K ) <, then p ˆθ n θ Proof: By the (weak) law of large numbers for r = 1,..., K n p E(X r ) so that from (4) X r i µ r (ˆθ n ) p E(X r ) By the continuous mapping theorem this implies that ˆθ n has a limit that is equal to θ, because the solution of (3) is unique.. 6

Maximum likelihood Example Population density, e.g. coin tossing. f(x; p) = p x (1 p) 1 x x = 0, 1 = 0 otherwise We do 3 independent tosses which is a random sample X 1, X 2, X 3. For the observed values x 1, x 2, x 3 we have Pr(X 1 = x 1, X 2 = x 2, X 3 = x 3 ) = = 3 p x i (1 p) 1 x i = p 3 x i (1 p) 3 3 x i Note that X i is a sufficient statistic for p. We think that p = 1 2 or p = 1 4. 7

Possible observations and their probability 3 x i 0 1 2 3 p = 1 4 p = 1 2 27 64 1 8 9 64 1 8 3 64 1 8 1 64 1 8 If we observe x 1 = x 2 = x 3 = 0 would you choose ˆp = 1 4 or ˆp = 1 2? An obvious method to select the estimate is to maximize the probability of the observed sample: ˆp = 1 4 = 1 2,, 3 x i = 0, 1 3 x i = 2, 3 This is a function of the observations only and hence an estimator of p. Note that ˆp is a function of the sufficient statistic. 8

General case In the example the estimator maximized the joint density of the random sample. If X 1,..., X n a random sample from a population with density f(x; θ), the joint density is n f(x 1,..., x n ; θ) = f(x i ; θ) = L(θ; x 1,..., x n ) If we consider this as a function of θ for fixed (observed) x 1,..., x n, this function is called the likelihood function. The maximum likelihood estimator (MLE) is ˆθ = arg max L(θ; x) θ with (note the abuse of notation) x 1 x =. x n 9

The solution is a function of x 1,..., x n Before observation ˆθ = t(x 1,..., x n ) ˆθ = t(x 1,..., X n ) Because the maximizing value is unaffected by monotone transformations of the maximand ˆθ = arg max ln L(θ; x) θ ˆθ = arg min θ L(θ; x) For a random sample the loglikelihood is n ln L(θ; x) = ln f(x i ; θ) Finding the first-order condition is easier with the loglikelihood and the set of equations (as many as parameters) ln L (θ; x) = 0 θ are called the likelihood equations. The derivative of the loglikelihood ln L θ (θ; x) is called the score function. 10

Example Same population as in previous example, but random sample of size n. Likelihood function n L(p; x) = f(x i ; p) = n p x i (1 p) 1 x i = p n x i (1 p) n n x i Loglikelihood function n ln L(p; x) = x i ln p + (n n x i ) ln(1 p) Score function ln L p (p; x) = 1 p n 1 n 1 p (n x i ) MLE ˆp = 1 n n x i = x n 11

Invariance of MLE Let L(θ; x) be a likelihood function and define τ = h(θ). Define A τ = {θ h(θ) = τ} and define If ˆθ is the MLE of θ, then with equality if ˆτ = h(ˆθ). L(τ; x) = sup θ A τ L(θ; x) sup L(τ; x) L(ˆθ; x) τ Conclusion: The MLE of τ is ˆτ = h(ˆθ). This is called the invariance property of the MLE. 12

Evaluation of estimators Sampling distribution Population parameter θ is unknown, but often there are restrictions on θ, i.e. θ Θ with Θ R K. Θ is called the parameter space. Example: In N(µ, σ 2 ) Θ = {(µ, σ 2 ) < µ <, σ 2 > 0} Because and estimator is a statistic ˆθ = t(x 1,..., X n ) it has a sampling distribution derived from the joint distribution of X 1,..., X n n f(x 1,..., x n ; θ) = f(x i ; θ) 13

Ideally this sampling distribution should be concentrated in θ irrespective what population value of θ is. Example: Random sample of size n from N(θ, 1) with Θ = {θ < θ < }. For estimator ˆθ = X n we have for the sampling distribution ˆθ N(θ, 1 n ) For sampling distribution: E(ˆθ) = θ and Var(ˆθ) 0 if n. 14

Compare sampling distributions in graph. Ranking of estimators ˆθ 1 dominates ˆθ 2. ˆθ 1 dominates ˆθ 3. What about ˆθ 1 and ˆθ 4? 15

Performance measure: MSE Ranking is possible if the performance of an estimator is captured by a single number. That is always somewhat arbitrary. Estimation error is ˆθ θ and squared estimation error (ˆθ θ) 2 treats positive and negative errors in the same way, penalizes large errors more. The mean squared error (MSE) is the average squared error in the sampling distribution of ˆθ ] MSE(ˆθ, θ) = E [(ˆθ θ) 2 The MSE depends in general on the population parameter θ. 16

Unbiased estimators Definition: An estimator ˆθ is unbiased if for all θ Θ E(ˆθ) = θ Consider MSE [ ( ) ] 2 MSE(ˆθ, θ) = E (ˆθ E(ˆθ)) + (E(ˆθ) θ) = = E [ ] [ ] (ˆθ E(ˆθ)) 2 + +(E(ˆθ) θ) 2 +2E (ˆθ E(ˆθ))(E(ˆθ) θ) = = Var(ˆθ) + (E(ˆθ) θ) 2 E(ˆθ) θ is the bias of the estimator, so that we have MSE(ˆθ, θ) = Var(ˆθ) + bias 2 and if the estimator is unbiased then MSE(ˆθ, θ) = Var(ˆθ) 17

If we only consider unbiased estimators ranking by MSE is ranking by variance. Example: For a random sample from N(θ, 1) and ˆθ = X n E(ˆθ) = θ, i.e. the estimator is unbiased. MSE(ˆθ, θ) = Var(X n ) = 1 n independent of θ. 18

Best estimators Definition: Let X 1,..., X n be a random sample from a population with density f(x; θ). The estimator ˆθ is a uniformly minimum variance unbiased (UMVU)estimator if for all θ Θ (i) E(ˆθ) = θ (ii) For all unbiased estimators θ, Var( θ) Var(ˆθ). Instead of looking for UMVU estimators directly we use a different approach (i) Find a lower bound on the variance of all unbiased estimators ˆθ (ii) Verify that for an estimator the variance is equal to the lower bound. That estimator is UMVU. 19

Theorem (Cramér-Rao): Let X 1,..., X n be a random sample from a population with density f(x; θ) with theta a scalar parameter. Then for all unbiased estimators ˆθ of θ 1 Var(ˆθ) [ ( ) ] 2 ne ln f θ (X; θ) Proof: We consider the case of a density w.r.t. Lebesgue measure. For all θ Θ n... f(x i ; θ)dx 1... dx n 1... t(x 1,..., x n ) n f(x i ; θ)dx 1... dx n θ We want to interchange differentation w.r.t. θ and integration. From Lecture 2 a sufficient condition is that f(x; θ) θ M(x) with M(x) integrable (check this!). 20

If this (or other sufficient condition) holds ( n )... f(x i ; θ) dx 1... dx n 0 (3) θ and ( n )... t(x 1,..., x n ) f(x i ; θ) dx 1... dx n 1 θ (4) Hence ( n ) 1... t(x 1,..., x n ) f(x i ; θ) dx 1... dx n = θ =... (t(x 1,..., x n ) θ) θ ( ln ) n n f(x i ; θ) f(x i ; θ)dx 1... dx n = 21

[ ( )] = E (t(x 1,..., X n ) θ) n ln f(x i ; θ) (5) θ E [(t(x 1,..., X n ) θ) 2] ( ( )) 2 E n ln f(x i ; θ) θ so that Var(ˆθ) 1 [ ( E θ ( n ln f(x i; θ)) ) ] (6) 2 In (5) we used the Cauchy-Schwartz inequality (see Lecture 6). 22

Because X 1,..., X n are independently and identically, abbreviated i.i.d. distributed ( ( n )) 2 E ln f(x i ; θ) = θ ( n = E because from [ ] ln f E θ (X 1; θ) = ) 2 [ ( ) ] 2 ln f θ (X i; θ) ln f = ne θ (X 1; θ) f θ (x; θ)dx = θ for the expectation of the cross-products [ ln f E θ (X i; θ) ln f ] θ (X j; θ) = 0. f(x; θ)dx = 0 23

The lower bound on the variance of an unbiased estimator is called the Cramér-Rao lower bound. If X 1,..., X n is not a random sample, but has joint density f(x 1,..., x n ; θ), then the lower bound is 1 [ ( E θ ln f(x 1,..., X n ; θ) ) ] 2 The inequality in (5) is an equality if and only if θ ln L(θ; x 1,..., x n ) = c(θ)(t(x 1,..., x n ) θ) Hence the estimator has a variance that reaches the lower bound if this equation holds. 24

Example: For an N(θ, 1) population f(x; θ) = 1 2π e 1 2 (x θ)2 Hence and E ln f(x; θ) = 1 2 ln(2π) 1 (x θ)2 2 ln f θ (x; θ) = x θ [ ( ) ] 2 ln f θ (X 1; θ) = E[(X 1 θ) 2 ] = 1 Conclusion: Cramér-Rao lower bound is 1 n. The estimator ˆθ = X n is unbiased and has a variance equal to this bound. ˆθ = X n is UMVU. 25

Information matrix We have ln f... θ (x 1,..., x n ; θ)dx 1... dx n 0 Differentiating with respect to θ and interchanging differentiation and integration (suggest a sufficient condition that allows this) 2 ln f 0... (x θ 2 1,..., x n ; θ)f(x 1,..., x n ; θ)dx 1... dx n + ln f +... θ (x 1,..., x n ; θ) f θ (x 1,..., x n ; θ)dx 1... dx n = [ 2 ] [ ( ) ] 2 ln f ln f = E (X θ 2 1,..., X n ; θ) +E θ (X 1,..., X n ; θ) 26

Conclusion: [ ( ) ] 2 ln f E θ (X 1,..., X n ; θ) = E [ 2 ] ln f (X θ 2 1,..., X n ; θ) The left-hand side is called Fisher information matrix. The equation shows that this is the variance of the score function and that it is equal to minus the expected value of the Hessian of the min-loglikelihood. 27

Example 1: Random sample of size n from N(µ, σ 2 ). The sample variance S 2 n = 1 n 1 is an unbiased estimator of σ 2. It can be shown i = 1 n (X i X n ) 2 Var(S 2 n) = 2σ4 n 1 The second derivative of the log density w.r.t. σ 2 is 2 ( ) 1 (σ 2 ) ln 1 2 σ2 2π e 2σ 2 (x µ 0) 2 = 1 2σ (x µ 0) 2 4 σ 6 By the information matrix equality the information matrix is [ E 1 2σ + (X µ 0) 2 ] = 1 4 σ 6 2σ 4 and the Cramér-Rao lower bound is 2σ 4 Conclusion: The MSE of S 2 n is strictly greater than the lower bound. n 28

The MLE of σ 2 is ˆσ 2 = 1 i = 1 n (X i X n ) 2 = n 1 n n S2 n and Var(ˆσ 2 ) = (n 1)2 n 2 Var(S 2 n) = E(ˆσ 2 ) = n 1 n σ2 < σ 2 2(n 1) n 2 σ 4 Conclusion: The MLE is biased. However MSE(S 2 n; σ 2 ) = 2σ4 n 1 > 2n 1 σ 4 = MSE(ˆσ 2 ; σ 2 ) n 2 Consider the case that µ is known. We have ln L = n σ 2 2σ + n 1 n (x 2 2σ 4 i µ) 2 = n = n 2σ 4 ( 1 n ) n (x i µ) 2 σ 2 = c(σ 2 )( σ 2 σ 2 ) Conclusion: If µ is known the estimator σ 2 = 1 n (x i µ) 2 n is unbiased and reaches the Cramér-Rao lower bound. 29

Example 2 Random sample of size n from a uniform distribution with density Because f(x; θ) = 1 θ I [0,θ](x) 2 ln f θ 2 (x; θ) = 1 θ 2 the Cramér-Rao lower bound seems to be θ2 n. Estimator θ = max{x 1,..., X n }. The statistic T = max{x 1,..., X n } has density f T (t) = nyn 1, 0 < y < θ θ n = 0, otherwise We find E( θ) = E(T ) = n n + 1 θ so that ˆθ = n+1 θ n is an unbiased estimator with Var(ˆθ) = 1 n(n + 2) θ2 < θ2 n 30

It seems that we have an unbiased estimator with a variance that is smaller than the lower bound. Problem is θ 0 f(x; θ)dx 1 Taking the derivative w.r.t. θ f(θ; θ) + θ 0 f (x; θ)dx θ θ 0 f (x; θ)dx θ i.e. we cannot interchange differentiation and integration. 31

Uniqueness of UMVU estimators and Rao-Blackwell theorem We show that UMVU estimators are unique. Let ˆθ be UMVU and let θ be another unbiased estimator of θ. Define a third estimator θ = ˆθ+t(ˆθ θ) with t some real number. We have Var(θ ) = Var(ˆθ) + t 2 Var(ˆθ θ) + 2tCov(ˆθ θ, ˆθ) This variance is minimal if t = Cov(ˆθ θ, ˆθ) Var(ˆθ θ) Because ˆθ is UMVU, t has to be 0 (otherwise we would have an estimator with a variance smaller than the lower bound). 32

Hence so that 0 = Cov(ˆθ θ, ˆθ) = Var(ˆθ) Cov(ˆθ, θ) Var( θ ˆθ) = Var(ˆθ)+Var( θ) 2Cov(ˆθ, θ) = Var( θ) Var(ˆθ) (7) Conclusions: If θ is also UMVU, then Var( θ ˆθ) = 0 and hence ˆθ = θ with probability 1. (7) simplifies the calculation of the variance of the difference of an unbiased and and UMVU estimator. 33

The next theorem shows how to improve an unbiased estimator. Rao-Blackwell theorem: If θ is an unbiased estimator of θ and T is a sufficient statistic for θ, then is unbiased and E(ˆθ) = E( θ T ) Var(ˆθ) Var( θ) Proof: For all θ Θ by the law of iterated expectations E(ˆθ) = E(E( θ T )) = E( θ) = θ and by the law of iterated variances Var( θ) = Var(E( θ T ))+E(Var( θ T )) = Var(ˆθ)+E(Var( θ T )) Var(ˆθ) Finally, because T is sufficient E( θ T ) does not depend on θ. 34

Asymptotic properties of estimators In most cases the sampling distribution of estimators is too complicated to compute mean, variance, and MSE. In that case we use asymptotic analysis, i.e. we let the sample size n. Consider random sample X 1,..., X n and let ˆθ n = t(x 1,..., X n ) be an estimator. What happens to the sampling distribution of ˆθ n if the sample size becomes large? Because if n becomes larger we know more and more of the population, the sampling distribution should behave as in the figure 35

if n = we know the populations and the sampling distribution should be degenerate in θ. Let ˆθ n be the sequence of estimators for increasing sample size. We say that θ n is (weakly) consistent if ˆθ n p θ If the convergence is a.s. we say that the estimator (sequence) is strongly consistent. Why does it not make sense to consider convergence in distribution? Example: For N(θ, σ 2 ) population the sample means is strongly and weakly consistent for θ. 37

Large sample behavior of MLE If for all θ Θ ( ) ln f E (X; θ) θ < then Also 1 n E n ( ) ln f θ (X i; θ) p ln f E θ (X 1; θ) ( ) ln f θ (X 1; θ) f = θ so that for θ = θ ( ) ln f E θ (X 1; θ) = f(x; θ) (x; θ) dx f(x; θ) f θ (x; θ)dx = θ f(x; θ)dx = 0 where we have interchanged differentiation and integration, which is allowed e.g. if for θ is a small interval around θ, f(x; θ) < g(x) with g integrable. 38

For the MLE ˆθ n 0 = 1 n n This suggests that ln f θ (X i; ˆθ n ) = E ( ) ln f θ (X 1; θ) ˆθ n p θ i.e. the MLE is weakly consistent. 39

Asymptotic distribution of MLE By Taylor s theorem 0 = 1 n ln f n θ (X i; ˆθ n ) = 1 n + 1 n n n 2 ln f θ 2 (X i ; θ n )(ˆθ n θ) with θ θ n ˆθ n or ˆθ n θ n θ. Consider the first term on rhs and define Y i = ln f θ (X i; θ) ln f θ (X i; θ)+ then E(Y i ) = 0 and [ ( ) ] 2 ln f Var(Y i ) = E θ (X i; θ) = I(θ) By the Central Limit Theorem 1 n ln f n θ (X i; θ) d N(0, I(θ)) 40

Next consider the second term on rhs. Because ˆθ n p θ also θ n θ. Further 1 n 2 ( ln f (X n θ 2 i ; θ) p 2 ) ln f E (X θ 2 1 ; θ) if ( 2 ) ln f E (X θ 2 1 ; θ) < This suggests that 1 n 2 ( ln f (X n θ 2 i ; θ n ) p 2 ) ln f E (X θ 2 1 ; θ) = A(θ) By the Slutsky theorem 0 = 1 n ln f n θ (X i; θ)+ 1 n n 2 ln f θ 2 implies that n(ˆθn θ) d N(A(θ) 1 I(θ)A(θ) 1 ) This is the limit distribution of the MLE. p (X i ; θ n ) n(ˆθ n θ) 41

By the information matrix equality A(θ) = I(θ) so that this simplifies to n(ˆθn θ) d N(I(θ) 1 ) Note the we consider the sequence n(ˆθ n θ) which is similar to nx n in the CLT. The variance is equal to the Cramér-Rao lower bound. This means that the limit distribution has a variance equal to the lower bound. We say that the MLE is asymptotically efficient. The asymptotic normal distribution can be used to compute a confidence interval for θ of the form ˆθ n c I(ˆθ n ) n < θ < ˆθ n + c I(ˆθ n ) n 42

Bootstrap To obtain the sampling distribution of a statistic for finite n one could use the computer. Consider the sample mean X n = 1 n n X i X n is the mean over the empirical distribution, P n, that assigns probability 1 n to X 1,..., X n X n = xdp n The observed sample x 1,..., x n is a realization of X 1,..., X n. 43

Now draw from x 1,..., x n a random sample of size n with replacement x 1,..., x n. This is a draw from the empirical distribution that gives x n = 1 n x i n Do this N times and consider the x n -s as draws from the sampling distribution of X n. The justification of this procedure is that if n then the empirical distribution converges to the population distribution, so that averaging over the empirical distribution is the same in large samples as averaging over the population distribution. This method of approximating the sampling distribution is called the bootstrap method after a tall tale by the (in)famous baron Von Münchhausen (1720-1797). 44