CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 1

Transcription

1 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 1 CHAPTER 5: MAXIMUM LIKELIHOOD ESTIMATION

2 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 2 Introduction to Efficient Estimation

3 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 3 Goal MLE is asymptotically efficient estimator under some regularity conditions.

4 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 4 Basic setting Suppose X 1,..., X n are i.i.d from P θ0 in the model P. (A0). θ θ implies P θ P θ (identifiability). (A1). P θ has a density function p θ with respect to a dominating σ-finite measure µ. (A2). The set {x : p θ (x) > 0} does not depend on θ.

5 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 5 MLE definition L n (θ) = n p θ (X i ), l n (θ) = n log p θ (X i ). L n (θ) and l n (θ) are called the likelihood function and the log-likelihood function of θ, respectively. An estimator ˆθ n of θ 0 is the maximum likelihood estimator (MLE) of θ 0 if it maximizes the likelihood function L n (θ).

6 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 6 Ad Hoc Arguments

7 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 7 n(ˆθn θ 0 ) d N(0, I(θ 0 ) 1 ) Consistency: ˆθ n θ 0 (no asymptotic bias) Efficiency: asymptotic variance attains efficiency bound I(θ 0 ) 1.

8 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 8 Consistency Definition 5.1 Let P be a probability measure and let Q be another measure on (Ω, A) with densities p and q with respect to a σ-finite measure µ (µ = P + Q always works). P (Ω) = 1 and Q(Ω) 1. Then the Kullback-Leibler information K(P, Q) is K(P, Q) = E P [log p(x) q(x) ].

9 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 9 Proposition 5.1 K(P, Q) is well-defined, and K(P, Q) 0. K(P, Q) = 0 if and only if P = Q. Proof By the Jensen s inequality, K(P, Q) = E P [ log q(x) p(x) ] log E P [ q(x) ] = log Q(Ω) 0. p(x) The equality holds if and only if p(x) = Mq(x) almost surely with respect P and Q(Ω) = 1 P = Q.

10 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 10 Why is MLE consistent? ˆθ n maximizes l n (θ), 1 n n pˆθn (X i ) 1 n n p θ0 (X i ). Suppose ˆθ n θ. Then we would expect to the both sides converge to E θ0 [p θ (X)] E θ0 [p θ0 (X)], which implies K(P θ0, P θ ) 0. From Prop. 5.1, P θ0 = P θ. From A0, θ = θ 0. That is, ˆθ n converges to θ 0.

11 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 11 Why is MLE efficient? Suppose ˆθ n θ 0. ˆθ n solves the following likelihood (or score) equations l n (ˆθ n ) = Taylor expansion at θ 0 : n n l θ0 (X i ) = where θ is between θ 0 and ˆθ. lˆθn (X i ) = 0. n lθ (X i )(ˆθ θ 0 ), n(ˆθ θ0 ) = 1 n { n 1 n lθ (X i ) } { n l θ0 (X i ) }.

12 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 12 n(ˆθn θ 0 ) is asymptotically equivalent to 1 n n I(θ 0 ) 1 lθ0 (X i ). Then ˆθ n is an asymptotically linear estimator of θ 0 with the influence function I(θ 0 ) 1 lθ0 = l(, P θ0 θ, P).

13 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 13 Consistency Results

14 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 14 Theorem 5.1 Consistency with dominating function Suppose that (a) Θ is compact. (b) log p θ (x) is continuous in θ for all x. (c) There exists a function F (x) such that E θ0 [F (X)] < and log p θ (x) F (x) for all x and θ. Then ˆθ n a.s. θ 0.

15 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 15 Proof For any sample ω Ω, ˆθ n is compact. By choosing a subsequence, ˆθ n θ. If 1 n n lˆθn (X i ) E θ0 [l θ (X)], then since 1 n n E θ0 [l θ (X)] E θ0 [l θ0 (X)]. θ = θ 0. Done! lˆθn (X i ) 1 n n l θ0 (X i ), It remains to show P n [lˆθn (X)] 1 n n lˆθn (X i ) E θ0 [l θ (X)]. It suffices to show P n [lˆθ(x)] E θ0 [lˆθ(x)] 0.

16 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 16 We can even prove the following uniform convergence result Define sup P n [l θ (X)] E θ0 [l θ (X)] 0. θ Θ ψ(x, θ, ρ) = sup (l θ (x) E θ0 [l θ (X)]). θ θ <ρ Since l θ is continuous, ψ(x, θ, ρ) is measurable and by the DCT, E θ0 [ψ(x, θ, ρ)] decreases to E θ0 [l θ (x) E θ0 [l θ (X)]] = 0. for ϵ > 0, for any θ Θ, there exists a ρ θ such that E θ0 [ψ(x, θ, ρ θ )] < ϵ.

17 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 17 The union of {θ : θ θ < ρ θ } covers Θ. By the compactness of Θ, there exists a finite number of θ 1,..., θ m such that lim sup n Θ m {θ : θ θ i < ρ θi }. sup {P n [l θ (X)] E θ0 [l θ (X)]} sup P n [ψ(x, θ i, ρ θi )]. θ Θ 1 i m sup {P n [l θ (X)] E θ0 [l θ (X)]} sup P θ [ψ(x, θ i, ρ θi )] ϵ. θ Θ 1 i m lim sup n sup θ Θ {P n [l θ (X)] E θ0 [l θ (X)]} 0. Similarly, lim sup n sup θ Θ { P n [l θ (X)] + E θ0 [l θ (X)]} 0. lim n sup P n [l θ (X)] E θ0 [l θ (X)] 0. θ Θ

18 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 18 Theorem 5.2 Wald s Consistency Θ is compact. Suppose θ l θ (x) = log p θ (x) is upper-semicontinuous for all x, in the sense lim sup θ θ l θ (x) l θ (x). Suppose for every sufficient small ball U Θ, E θ0 [sup θ U l θ (X)] <. Then ˆθ n p θ 0.

19 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 19 Proof E θ0 [l θ0 (X)] > E θ0 [l θ (X)] for any θ θ 0 there exists a ball U θ containing θ such that E θ0 [l θ0 (X)] > E θ0 [ sup θ U θ l θ (X)]. Otherwise, there exists a sequence θ m θ but E θ0 [l θ0 (X)] E θ0 [l θ m (X)]. Since l θ m (x) sup U l θ (X) where U is the ball satisfying the condition, lim sup m E θ0 [l θ m (X)] E θ0 [lim sup m E θ0 [l θ0 (X)] E θ0 [l θ (X)] contradiction! l θ m (X)] E θ0 [l θ (X)].

20 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 20 For any ϵ, the balls θ U θ covers the compact set Θ { θ θ 0 > ϵ} there exists a finite covering balls, U 1,..., U m. P ( ˆθ n θ 0 > ϵ) P ( sup θ θ 0 >ϵ P n [l θ (X)] P n [l θ0 (X)]) P ( max P n[ sup l θ (X)] P n [l θ0 (X)]) 1 i m θ U i m P (P n [ sup l θ (X)] P n [l θ0 (X)]). θ U i Since P n [ sup θ U i l θ (X)] a.s. E θ0 [ sup θ U i l θ (X)] < E θ0 [l θ0 (X)], the right-hand side converges to zero. ˆθ n p θ 0.

21 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 21 Asymptotic Efficiency Result

22 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 22 Theorem 5.3 Suppose that the model P = {P θ : θ Θ} is Hellinger differentiable at an inner point θ 0 of Θ R k. Furthermore, suppose that there exists a measurable function F with E θ0 [F 2 ] < such that for every θ 1 and θ 2 in a neighborhood of θ 0, log p θ1 (x) log p θ2 (x) F (x) θ 1 θ 2. If the Fisher information matrix I(θ 0 ) is nonsingular and ˆθ n is consistent, then n(ˆθn θ 0 ) = 1 n n I(θ 0 ) 1 lθ0 (X i ) + o pθ0 (1). In particular, n(ˆθ n θ 0 ) d N(0, I(θ 0 ) 1 ).

23 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 23 Proof For any h n h, by the Hellinger differentiability, ( pθ0 +h W n = 2 n / ) n 1 h T lθ0, in L 2 (P θ0 ). p θ0 n(log pθ0 +h n / n log p θ0 ) = 2 n log(1 + W n /2) p h T lθ0. [ n(pn E θ0 P )[ ] n(log p θ0 +h n / n log p θ0 ) h T lθ0 ] [ n(pn V ar θ0 P )[ ] n(log p θ0 +h n / n log p θ0 ) h T lθ0 ] n(pn P )[ n(log p θ0 +h n / n log p θ0 ) h T lθ0 ] p

24 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 24 From Step I in proving Theorem 4.1, n log p θ0 +h log n / n = 1 n h T lθ0 (X i ) 1 log p θ0 n 2 ht I(θ 0 )h + o pθ0 (1). ne θ0 [log p θ0 +h n / n log p θ0 ] h T I(θ 0 )h/2. np n [log p θ0 +h n / n log p θ0 ] = 1 2 ht n I(θ 0 )h n + h T n n(pn P )[ l θ0 ] +o pθ0 (1). Choose h n = n(ˆθ n θ 0 ) and h n = I(θ 0 ) 1 n(p n P )[ l θ0 ]. np n [log pˆθn log p θ0 ] = 1 2 { n(p n P )[ l θ0 ]} T I(θ 0 ) 1 { n(p n P )[ l θ0 ]} +o pθ0 (1).

25 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 25 Compare the above two equations: 1 2 { n(ˆθn θ 0 ) + I(θ 0 ) 1 n(p n P )[ l θ0 ]} T I(θ0 ) { n(ˆθn θ 0 ) + I(θ 0 ) 1 } n(p n P )[ l θ0 ] +o pθ0 (1) 0. n(ˆθn θ 0 ) = I(θ 0 ) 1 n(p n P )[ l θ0 ] + o pθ0 (1).

26 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 26 Theorem 5.4 For each θ in an open subset of Euclidean space. Let θ l θ (x) = log p θ (x) be twice continuously differentiable for every x. Suppose E θ0 [ l θ0 l θ0 ] < and E[ l θ0 ] exists and is nonsingular. Assume that the second partial derivative of l θ (x) is dominated by a fixed integrable function F (x) for every θ in a neighborhood of θ 0. Suppose ˆθ n p θ 0. Then n(ˆθn θ 0 ) = (E θ0 [ l θ0 ]) 1 1 n n l θ0 (X i ) + o pθ0 (1).

27 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 27 Proof ˆθ n solves 0 = n lˆθ(x i ). { n n 0 = l θ0 (X i ) + lθ0 (ˆθ n θ 0 ) (ˆθ n n θ 0 ) T l (3) θ n } (ˆθ n θ 0 ). { 1 n } n lθ0 (X i ) (ˆθ n θ 0 ) 1 n n l θ0 (X i ) 1 n n F (X i ). (ˆθ n θ 0 ) = o p (1). { 1 n(ˆθn θ 0 ) n } n lθ0 (X i ) + o p (1) = 1 n n l θ0 (X i ).

28 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 28 Computation of MLE

29 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 29 Solve likelihood equation n l θ (X i ) = 0. Newton-Raphson iteration: at kth iteration, { } 1 n 1 { 1 n θ (k+1) = θ (k) lθ (k)(x i ) l n n θ (k)(x i ) }. Note 1 n lθ n (k)(x i ) I(θ (k) ). Fisher scoring algorithm: θ (k+1) = θ (k) + I(θ (k) ) 1 { 1 n n } l θ (k)(x i ).

30 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 30 Optimize the likelihood function optimum search algorithm: grid search, quasi-newton method (gradient decent algorithm), MCMC, simulation annealing

31 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 31 EM Algorithm for Missing Data

32 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 32 When part of data is missing or some mis-measured data is observed, a commonly used algorithm is called the expectation-maximization (EM) algorithm. Framework of EM algorithm Y = (Y mis, Y obs ). R is a vector of 0/1 indicating which subjects are missing/not missing. Then Y obs = RY. the density function for the observed data (Y obs, R) Y mis f(y ; θ)p (R Y )dy mis.

33 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 33 Missing mechanism Missing at random assumption (MAR): P (R Y ) = P (R Y obs ) and P (R Y ) does not depend on θ; i.e., the missing probability only depends on the observed data and it is informative about θ. Under MAR, We maximize Y mis f(y ; θ)dy mis P (R Y ). Y mis f(y ; θ)dy mis or log Y mis f(y ; θ)dy mis

34 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 34 Details of EM algorithm We start from any initial value of θ (1) and use the following iterations. The kth iteration consists both E-step and M-step: E-step. We evaluate the conditional expectation E [ log f(y ; θ) Y obs, θ (k)]. E [ log f(y ; θ) Y obs, θ (k)] = Y mis [log f(y ; θ)]f(y ; θ (k) )dy mis Y mis f(y ; θ (k) )dy mis.

35 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 35 M-step. We obtain θ (k+1) by maximizing E [ log f(y ; θ) Y obs, θ (k)]. We then iterate till the convergence of θ; i.e., the difference between θ (k+1) and θ (k) is less than a given criteria.

36 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 36 Rationale why EM works Theorem 5.5 At each iteration of the EM algorithm, log f(y obs ; θ (k+1) ) log f(y obs, θ (k) ) and the equality holds if and only if θ (k+1) = θ (k).

37 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 37 Proof E [ log f(y ; θ (k+1) ) Y obs, θ (k)] E [ log f(y ; θ (k) ) Y obs, θ (k)]. E [ log f(y mis Y obs ; θ (k+1) ) Y obs, θ (k)] + log f(y obs ; θ (k+1) ) E [ log f(y mis Y obs, θ (k) ) Y obs, θ (k)] + log f(y obs ; θ (k) ). E E [log f(y mis Y obs ; θ (k+1) ) Y obs, θ (k)] [ log f(y mis Y obs, θ (k) ) Y obs, θ (k)], log f(y obs ; θ (k+1) ) log f(y obs, θ (k) ). The equality holds iff log f(y mis Y obs, θ (k+1) ) = log f(y mis Y obs, θ (k) ), log f(y ; θ (k+1) ) = log f(y ; θ (k) ).

38 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 38 Incorporating Newton-Raphson in EM E-step. We evaluate the conditional expectation [ ] E θ log f(y ; θ) Y obs, θ (k) and E [ 2 θ 2 log f(y ; θ) Y obs, θ (k) ]

39 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 39 M-step. We obtain θ (k+1) by solving [ 0 = E θ log f(y ; θ) Y obs, θ (k) using one-step Newton-Raphson iteration: θ (k+1) = θ (k) { E [ 2 θ 2 log f(y ; θ) Y obs, θ (k) ] ]} 1 E [ θ log f(y ; θ) Y obs, θ (k) ] θ=θ (k).

40 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 40 Example Suppose a random vector Y has a multinomial distribution with n = 197 and p = ( θ 4, 1 θ 4, 1 θ 4, θ 4 ). Then the probability for Y = (y 1, y 2, y 3, y 4 ) is given by n! y 1!y 2!y 3!y 4! (1 2 + θ 4 )y 1 ( 1 θ 4 )y 2 ( 1 θ 4 )y 3 ( θ 4 )y 4. Suppose we observe Y = (125, 18, 20, 34). If we start with θ (1) = 0.5, after the convergence in the Newton-Raphson iteration, we obtain θ (k) =

41 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 41 EM algorithm: the full data is X has a multivariate normal distribution with n and the p = (1/2, θ/4, (1 θ)/4, (1 θ)/4, θ/4). Y = (X 1 + X 2, X 3, X 4, X 5 ).

42 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 42 The score equation for the complete data X is simple 0 = X 2 + X 5 θ X 3 + X 4 1 θ. M-step of the EM algorithm needs to solve the equation [ X2 + X 5 0 = E X ] 3 + X 4 Y, θ(k) ; θ 1 θ while the E-step evaluates the above expectation. E[X Y, θ (k) 1/2 ] = (Y 1 1/2 + θ (k) /4, Y θ (k) /4 1 1/2 + θ (k) /4, Y 2, Y 3, Y 4 ). θ (k+1) = E[X 2 + X 5 Y, θ (k) ] E[X 2 + X 5 + X 3 + X 4 Y, θ (k) ] = Y 1 θ (k) /4 1/2+θ (k) /4 + Y 4 Y 1 θ (k) /4 1/2+θ (k) /4 + Y 2 + Y 3 + Y 4. We start form θ (1) = 0.5.

43 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 43 k θ (k+1) θ (k+1) θ (k) θ(k+1) ˆθ n θ (k) ˆθ n

44 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 44 Conclusions the EM converges and the result agrees with what is obtained form the Newton-Raphson iteration; the EM convergence is linear as (θ (k+1) ˆθ n )/(θ (k) ˆθ n ) becomes a constant when convergence; the convergence in the Newton-Raphson iteration is quadratic in the sense (θ (k+1) ˆθ n )/(θ (k) ˆθ n ) 2 becomes a constant when convergence; the EM is much less complex than the Newton-Raphson iteration and this is the advantage of using the EM algorithm.

45 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 45 More example the example of exponential mixture model: Suppose Y P θ where P θ has density p θ (y) = { pλe λy + (1 p)µe µy} I(y > 0) and θ = (p, λ, µ) (0, 1) (0, ) (0, ). Consider estimation of θ based on Y 1,..., Y n i.i.d p θ (y). Solving the likelihood equation using the Newton-Raphson is much computation involved.

46 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 46 EM algorithm: the complete data X = (Y, ) p θ (x) where p θ (x) = p θ (y, δ) = (pye λy ) δ ((1 p)µe µy ) 1 δ. This is natural from the following mechanism: is a bernoulli variable with P ( = 1) = p and we generate Y from Exp(λ) if = 1 and from Exp(µ) if = 0. Thus, is missing. The score equation for θ based on X is equal to 0 = l n { i p (X 1,..., X n ) = p 1 } i, 1 p 0 = l λ (X 1,..., X n ) = 0 = l µ (X 1,..., X n ) = n i ( 1 λ Y i), n (1 i )( 1 µ Y i).

47 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 47 M-step solves the equations n [{ i 0 = E p 1 } ] i Y 1,..., Y n, p (k), λ (k), µ (k) 1 p 0 = = 0 = n E n E = = [{ i p 1 } ] i Y i, p (k), λ (k), µ (k), 1 p [ i ( 1λ ] Y i) Y 1,..., Y n, p (k), λ (k), µ (k) n E [ i ( 1λ ] Y i) Y i, p (k), λ (k), µ (k), n E [1 i )( 1µ ] Y i) Y 1,..., Y n, p (k), λ (k), µ (k) n E [1 i )( 1µ ] Y i) Y i, p (k), λ (k), µ (k).

48 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 48 This immediately gives p (k+1) = 1 n n E[ i Y i, p (k), λ (k), µ (k) ], λ (k+1) = n E[ i Y i, p (k), λ (k), µ (k) ] n Y ie[ i Y i, p (k), λ (k), µ (k) ], µ (k+1) = n E[(1 i) Y i, p (k), λ (k), µ (k) ] n Y ie[(1 i ) Y i, p (k), λ (k), µ (k) ]. The conditional expectation E[ Y, θ] = pλe λy pλe λy + (1 p)µe µy. As seen above, the EM algorithm facilitates the computation.

49 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 49 Information Calculation in EM

50 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 50 Notation l c as the score function for θ in the full data; l mis obs as the score for θ in the conditional distribution of Y mis given Y obs ; l obs as the the score for θ in the distribution of Y obs. l c = l mis obs + l obs. V ar( l c ) = V ar(e[ l c Y obs ]) + E[V ar( l c Y obs )].

51 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 51 Information in the EM algorithm We obtain the following Louis formula I c (θ) = I obs (θ) + E[I mis obs (θ, Y obs )]. Thus, the complete information is the summation of the observed information and the missing information. One can even show when the EM converges, the convergence linear rate, denote as (θ (k+1) ˆθ n )/(θ (k) ˆθ n ) approximates the 1 I obs (ˆθ n )/I c (ˆθ n ).

52 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 52 Nonparametric Maximum Likelihood Estimation

53 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 53 First example Let X 1,..., X n be i.i.d random variables with common distribution F, where F is any unknown distribution function. The likelihood function for F is given by L n (F ) = n f(x i ), where f(x i ) is the density function of F with respect to some dominating measure. However, the maximum of L n (F ) does not exists. We instead maximize an alternative function L n (F ) = n F {X i },

54 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 54 where F {X i } denotes the value F (X i ) F (X i ).

55 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 55 Second example Suppose X 1,..., X n are i.i.d F and Y 1,..., Y n are i.i.d G. We observe i.i.d pairs (Z 1, 1 ),..., (Z n, n ), where Z i = min(x i, Y i ) and i = I(X i Y i ). We can think X i as survival time and Y i as censoring time. Then it is easy to calculate the joint distributions for (Z i, i ), i = 1,..., n, is equal to L n (F, G) = n {f(z i )(1 G(Z i ))} i {(1 F (Z i ))g(z i )} 1 i L n (F, G) does not have the maximum so we consider an alternative function n {F {Z i }(1 G(Z i ))} i {(1 F (Z i ))G{Z i }} 1 i.

56 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 56 Third example Suppose T is survival time and Z is covariate. Assume T Z has a conditional hazard function λ(t Z) = λ(t)e θt Z. Then the likelihood function from n i.i.d (T i, Z i ), i = 1,..., n is given by L n (θ, Λ) = n { λ(ti ) exp{ Λ(T i )e θt Z i }f(z i ) }. Note f(z i ) is not informative about θ and λ so we can discard it from the likelihood function. Again, we replace

57 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 57 λ{t i } by Λ{T i } and obtain a modified function L n (θ, Λ) = n Let p i = Λ{T i } we maximize n or its logarithm as n { Λ{Ti } exp{ Λ(T i )e θt Z i } }. p i exp{ ( Y j Y i p j )e θt Z i } θt Z i exp{θ T Z i } Y j Y i p j + log p j.

58 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 58 Fourth example We consider X 1,..., X n are i.i.d F and Y 1,..., Y n are i.i.d G. We only observe (Y i, i ) where i = I(X i Y i ) for i = 1,..., n. This data is one type of interval censored data (or current status data). The likelihood for the observations is n { F (Yi ) i (1 F (Y i )) 1 i g(y i ) }. To derive the NPMLE for F and G, we instead maximize n { P i i (1 P i ) 1 i q i }, subject to the constraint that q i = 1 and 0 P i 1 increases with Y i.

59 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 59 Clearly, ˆq i = 1/n (suppose Y i are all different). This constrained maximization turns out to be solved by the following steps: (i) Plot the points (i, Y j Y i j ), i = 1,..., n. This is called the cumulative sum diagram. (ii) Form the H (t), the greatest the convex minorant of the cumulative sum diagram. (iii) Let ˆP i be the left derivative of H at i. Then ( ˆP 1,..., ˆP n ) maximizes the object function.

60 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 60 Summary of NPMLE The NPMLE is a generalization of the maximum likelihood estimation in the parametric model the semiparametric or nonparametric models. We replace the functional parameter by an empirical function with jumps only at observed data and maximize a modified likelihood function. Both computation of the NPMLE and the asymptotic property of the NPMLE can be difficult and vary for different specific problems.

61 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 61 Alternative Efficient Estimation

62 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 62 One-step efficient estimation start from a strongly consistent estimator for parameter θ, denoted by θ n, assuming that θ n θ 0 = O p (n 1/2 ). One-step procedure is a one-step Newton-Raphson iteration in solving the likelihood score equation; ˆθ n = θ n { ln ( θ n ) } 1 ln ( θ n ), where l n (θ) is the sore function and l n (θ) is the derivative of l n (θ).

63 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 63 Result about the one-step estimation Theorem 5.6 Let l θ (X) be the log-likelihood function of θ. Assume that there exists a neighborhood of θ 0 such that in this neighborhood, l (3) θ (X) F (X) with E[F (X)] <. Then n(ˆθn θ 0 ) d N(0, I(θ 0 ) 1 ), where I(θ 0 ) is the Fisher information.

64 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 64 Proof Since θ n a.s. θ 0, we perform the Taylor expansion on the right-hand side of the one-step equation and obtain ˆθ n = θ n { ln ( θ n )} { ln (θ 0 ) + l } n (θ )( θ n θ 0 ) where θ is between θ n and θ 0. [ ] 1 ˆθ n θ 0 = I { ln ( θ n )} ln (θ ) ( θ n θ 0 ) { ln ( θ n )} ln (θ 0 ). On the other hand, by the condition that l (3) θ (X) F (X) with E[F (X)] <, 1 n l n (θ ) a.s. E[ l θ0 (X)], ˆθ n θ 0 = o p ( θ n θ 0 ) 1 n l n ( θ n ) a.s. E[ l θ0 (X)]. { E[ l θ0 (X)] + o p (1) } 1 1 n l n (θ 0 ).

65 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 65 Sightly different one-step estimation ˆθ n = θ n + I( θ n ) 1 l( θn ). Other efficient estimation the Bayesian estimation method (posterior mode, minimax estimator etc.)

66 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 66 Conclusions The maximum likelihood approach provides a natural and simple way of deriving an efficient estimator. Other estimation approaches are possible for efficient estimation such as one-step estimation, Bayesian estimation etc. Generalization from parametric models to semiparametric or nonparametric models. How?

67 CHAPTER 5 MAXIMUM LIKELIHOOD ESTIMATION 67 READING MATERIALS: Ferguson, Sections 16-20, Lehmann and Casella, Sections