Maximum Likelihood Estimation

Maximum Likelihood Estimation Quantitative Microeconomics R. Mora Department of Economics Universidad Carlos III de Madrid

Outline Motivation 1 Motivation 2 3 4 5

General Approaches to Parameter Estimation There are estimation criteria that produce estimators with good properties Least Squares (OLS or GLS Method of Moments (OLS, GLS, and IV: Maximum Likelihood (ML θ = g (E(Y ˆθ = g (E N [y i ] It chooses the vector ˆθ which makes the estimation of the probability of the sample most likely

Basic Setup Motivation Let {y 1, y 2,..., y N } be an iid sample from the population with density f (Y ;θ 0. We aim to estimate θ 0 Because of the iid assumption, the joint distribution of {y 1, y 2,..., y N } is simply the product of the densities: f (y 1, y 2,..., y N ;θ 0 = f (y 1 ;θ 0 f (y 2 ;θ 0...f (y N ;θ 0 The Likelihood Function is the function obtained for a given sample after replacing true θ 0 by any θ L(θ = f (y 1 ;θf (y 2 ;θ...f (y N ;θ L(θ is a random variable because it depends on the sample

Motivation The maximum likelihood estimator of θ 0, ˆθ ML, is the value of θ that maximizes the likelihood function L(θ It is more convenient to work with the logarithm of the likelihood function l (θ = N i=1 log (f (y i;θ Since the logarithmiic transform is monotonic, ˆθ ML also maximizes l (θ

Example: Bernoulli (1/3 Assume that Y is Bernoulli: { Likelihood for observation i : { 1 with probability p0 0 with probability 1 p 0 p 0 if y i = 1 1 p 0 if y i = 0 Let n 1 be the number of observations with 1. Then, under iid sampling L(p = p n 1 (1 p n n 1 We have a likelihood for each sample With {0,1,0,0,0} L(p = p (1 p 4 With {1,0,0,1,1} L(p = p 3 (1 p ²

Example: Bernoulli (2/3 Sample: {0,1,0,0,0} Sample: {1,0,0,1,1} 0.09 0.035 0.08 0.03 0.07 0.06 0.025 L(p 0.05 0.04 L(p 0.02 0.015 0.03 0.02 0.01 0.01 0.005 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 p p With {0,1,0,0,0} ˆp = 0.2 With {1,0,0,1,1} ˆp ML = 0.6

Example: Bernoulli (3/3 The maximum likelihood estimator is the value that maximizes L(p = p n 1 (1 p n n 1 The same ˆp ML maximizes the logarithm of the likelihood function l(p = n 1 log(p + (n n 1 log(1 p l(p p = 0 n 1 = n n ˆp ML 1 1 ˆp ML ˆp ML = n 1 n With {0,1,0,0,0} ˆp ML = 1 5 = 0.2 With {1,0,0,1,1} ˆp ML = 3 5 = 0.6

Basic Setup Motivation ( Let {y 1, y 2,..., y N } be an iid sample from y x N β0 x,σ0 2. We aim to estimate θ 0 = ( β 0,σ0 2 Because of the iid assumption, the joint distribution of {y 1, y 2,..., y N } is simply the product of the densities: f (y 1, y 2,..., y N x 1,..., x N ;θ 0 = f (y 1 x 1 ;θ 0 f (y 2 x 2 ;θ 0...f (y N x N ;θ 0 Note that y x N ( β0 x,σ 2 0 y β0 x ε N ( 0,σ 2 0. This implies that f y x (y i x i ;θ 0 = f ε (y i β x i ;θ 0

Density of the Error Term We have that ε N ( 0,σ 2 0, so what is its density fε (z;θ 0? ( 1 ε N 0,σ 2 0 ε σ 0 N (0,1 ( 2 CDF ε (z Pr (ε z = ε Pr z σ0 σ0 ( 3 Hence, CDF ε (z = Φ z σ 0 4 The density of a continuous random variable is the rst derivative of its CDF: ( ( 1 z f ε (z;θ 0 = σ 0 φ σ 0

Density of the Sample Since and ( ( 1 z f ε (z;θ 0 = σ 0 φ σ 0 f y x (y i x i ;θ 0 = f ε (y i β x i ;θ 0 and f (y 1, y 2,..., y N x 1,..., x N ;θ 0 = f (y 1 x 1 ;θ 0 f (y 2 x 2 ;θ 0...f (y N x N ;θ 0 then we have that f (y 1, y 2,..., y N x 1,..., x N ;θ 0 = N i=1 {( 1 σ 0 φ ( } yi β 0 x i σ 0

The Log-likelihood function The likelihood replaces the actual values of the parameters for real variables: {( ( } N 1 yi β x i L(β,σ = φ σ σ i=1 i=1 taking the log makes the problem easier { ( N 1 log (L(β,σ = log + log σ and given that φ ( yi β x i σ = (2π 1 2 exp [ φ [ ( ]} yi β x i ( yi β x i σ σ ] 2 have that ( 1 1 2 ( N yi β x i log (L(β,σ = Nlog 2πσ 2 σ i=1 2 we

The ML Estimator: FOC With respect to β : which implies 2 ˆσ ² N i=1 N i=1 With respect to σ, this implies ˆσ ² = 1 N x i ( y i ˆβ x i = 0 x i ( y i ˆβ x i = 0 N i=1 (y i ˆβ x i 2 MLE for ˆβ is exactly the same estimator as OLS; ˆσ 2 = N 1 N s2 is biased, but the bias disappears as N increases

Computing the MLE ML estimates are often easy to compute, as in the two previous examples Sometimes, however, there is no algebraic solution to the maximization problem It is then necessary to use some sort of numerical maximization procedure

Numerical Maximization Procedures Newton's method Start with an initial value ˆθ 0 At any iteration, ˆθ j+1 = ˆθ j H 1 g g is the rst derivative of the likelihood (i.e. the gradient H is the second derivative (the Hessian Check if there is convergence Which ˆθ increases the most the quadratic Taylor approximation of L(ˆθ + ˆθ, L(ˆθ + ˆθ L(ˆθ + g (ˆθ ˆθ + (ˆθ 1 2 H ˆθ 2?

Quasi-Newton Methods Newton's Method will not work well when the Hessian is not negative denite. In such cases, one popular way to obtain the MLE is to replace the Hessian by a matrix which is always negative denite These approaches are referred to as quasi-newton algorithms gretl uses one of them: the BFGS algorithm (Broyden, Fletcher, Goldfarb and Shanno

Consistency Motivation Assumptions 1 Finite-sample identication: l(θ takes dierent values for dierent θ 2 Sampling: a law of large numbers is satised by 1 n Σ i l i (ˆθ 3 Asymptotic identication: max l(θ provides a unique way to determine the parameter in the limit as the sample size tends to innity. Under these conditions, the ML estimator is consistent plim (ˆθ ML = θ 0

Identicación Motivation These are the crucial assumptions to exploit the fact that the expected maximum likelihood attains its maximum at the true value θ 0 If these conditions did not hold, there would be some value θ 1 such that θ 0 and θ 1 generate an identical distribution of the observable data Then we wouldn't be able to distinguish between these two parameters even with an innite amount of data We then say that these parameters are observationally equivalent and that the model is not identied

Asymptotic Normality Assumptions 1 Consistency 2 l(θ is dierentiable and attains an interior maximum 3 A CLT can be applied to the gradient Under these conditions the ML estimator is asymptotically normal n 1 /2 (ˆθ θ N (0,Σ as n where Σ = ( plim 1 n 1 H i

Asymptotic Eciency and Variance Estimation If l(θ is dierentiable and attains an interior maximum the MLE must be at least as asymptotically ecient as any other consistent estimator that is asymptotically unbiased Consistent estimators of the Varianze-Covariance Matrix empirical hessian: var H (ˆθ = = [ 1 n H 1 i (ˆθ ] 1 [ ( T ( ] 1 1 BHHH, var BHHH (ˆθ = n g 1 i(ˆθ n g i(ˆθ the sandwich estimator: valid even if the model is misspecied

Motivation ML estimates are the values which maximize the likelihood function under general assumptions, ML is consistent, asymptotically normal, and asymptotically ecient