MSc. Econ: MATHEMATICAL STATISTICS, 1995 MAXIMUM-LIKELIHOOD ESTIMATION

Transcription

1 MAXIMUM-LIKELIHOOD ESTIMATION The General Theory of M-L Estimation In orer to erive an M-L estimator, we are boun to make an assumption about the functional form of the istribution which generates the ata However, the assumption can often be varie without affecting the form of the M-L estimator; an the general theory of maximum-likelihoo estimation can be evelope without reference to a specific istribution In fact, the M-L metho is of such generality that it provies a moel for most other methos of estimation For the other methos ten to generate estimators which can be epicte as approximations to the maximum-likelihoo estimators, if they are not actually ientical to the latter In orer to reveal the important characteristics of the likelihoo estimators, we shoul investigate the properties of the log-likelihoo function itself Consier the case where θ is the sole parameter of a log-likelihoo function log L(y; θ) wherein y =[y,,y T ] is a vector of sample elements In seeking to estimate the parameter, we regar θ as an argument of the function whilst the elements of y are consiere to be fixe However, in analysing the statistical properties of the function, we restore the ranom character to the sample elements The ranomness is conveye to the maximising value ˆθ which thereby acquires a istribution A funamental result is that, as the sample size increases, the likelihoo function ivie by the sample size tens to stabilise in the sense that it converges in probability, at every point in its omain, to a constant function In the process, the istribution of ˆθ becomes increasingly concentrate in the vicinity of the true parameter value θ 0 This accounts for the consistency of maximum-likelihoo estimation To emonstrate the convergence of the log-likelihoo function, we shall assume, as before, that the elements of y =[y,,y T ] form a ranom sample Then () L(y; θ) = an therefore T f(y t ; θ), t= (2) T log L(y; θ) = T T log f(y t ; θ) t= For any value of θ, this represents a sum of inepenently an ientically istribute ranom variables Therefore the law of large numbers can be applie

2 to show that MSc Econ: MATHEMATICAL STATISTICS, 995 (3) plim(t ) T log L(y; θ) =E log f(y t ; θ) } The next step is to emonstrate that Elog L(y; θ 0 )} Elog L(y; θ)}, which is to say that the expecte log-likelihoo function, to which the sample likelihoo function converges, is maximise by the true parameter value θ 0 The first erivative of log-likelihoo function is (4) = L(y; θ) L(y; θ) This is known as the score of the log-likelihoo function at θ Uner conitions which allow the erivative an the integral to commute, the erivative of the expectation is the expectation of the erivative Thus, from (4), (5) E log L(y; θ) } = y L(y; θ) } L(y; θ) L(y; θ 0 )y, where θ 0 is the true value of θ an L(y, θ 0 ) is the probability ensity function of y When θ = θ 0, the expression on the RHS simplifies in consequence of the cancellation of L(y, θ) in the enominator with L(y, θ 0 ) in the numerator Then we get L(y; θ 0 ) (6) y = L(y; θ 0 )y =0, y y where the final equality follows from the fact that the integral is unity, which implies that its erivative is zero Thus (7) E log L(y; θ 0 ) } } log L(y; θ0 ) = E =0; an this is a first-orer conition which inicates that the Elog L(y; θ)/t } is maximise at the true parameter value θ 0 Given that the log L(y; θ)/t converges to Elog L(y; θ)/t }, it follows, by some simple analytic arguments, that the maximising value of the former must converge to the maximising value of the latter: which is to say that ˆθ must converge to θ 0 Now let us ifferentiate (4) in respect to θ an take expectations Provie that the orer of these operations can be interchange, then (8) y L(y; θ)y = 2 2 L(y; θ)y =0, y 2

3 where the final equality follows in the same way as that of (6) The LHS can be expresse as 2 log L(y; θ) L(y; θ) (9) y 2 L(y; θ)y + y =0 y an, on substituting from (4) into the secon term, this becomes 2 } 2 log L(y; θ) y 2 L(y; θ)y + L(y; θ)y =0 (0) y Therefore, when θ = θ 0,weget } [ } ] 2 () E 2 log L(y; θ 0 ) log L(y; θ0 ) 2 = E =Φ This measure is know as Fisher s Information Since (7) inicates that the score log L(y; θ 0 )/ has an expecte value of zero, it follows that Fisher s Information represents the variance of the score at θ 0 Clearly, the information measure increases with the size of the sample To obtain a measure of the information about θ which is containe, on average, in a single observation, we may efine φ =Φ/T The importance of the information measure Φ is that its inverse provies an approximation to the variance of the maximum-likelihoo estimator which become increasingly accurate as the sample size increases Inee, this is the explanation of the terminology The famous Cramèr Rao theorem inicates that the inverse of the information measure provies a lower boun for the variance of any unbiase estimator of θ The fact that the asymptotic variance of the maximum-likelihoo estimator attains this boun, as we shall procee to show, is the proof of the estimator s efficiency The Asymptotic Distribution of the M-L Estimator The asymptotic istribution of the maximum-likelihoo estimator is establishe uner the assumption that the log-likelihoo function obeys certain regularity conitions Some of these conitions are not reaily explicable without a context Therefore, instea of itemising the conitions, we shall make an overall assumption which is appropriate to our own purposes but which is stronger than is strictly necessary We shall image that log L(y; θ) is an analytic function which can be represente by a Taylor-series expansion about the point θ 0 : (2) log L(θ) = log L(θ 0 )+ log L(θ 0) (θ θ 0 )+ 2 log L(θ 0 ) 2 2 (θ θ 0 ) 2 + 3! 3 log L(θ 0 ) 3 (θ θ 0 ) 3 + 3

4 In pursuing the asymptotic istribution of the maximum-likelihoo estimator, we can concentrate upon a quaratic approximation which is base the first three terms of this expansion The reason is that, as we have shown, the istribution of the estimator becomes increasingly concentrate in the vicinity of the true parameter value as the size of the sample increases Therefore the quaratic approximation becomes increasingly accurate for the range of values of θ which we are liable to consier It follows that, amongst the regularity conitions, there must be at least the provision that the erivatives of the function are finite-value up to the thir orer The quaratic approximation to the function, taken at the point θ 0,is (3) log L(θ) = log L(θ 0 )+ log L(θ 0) (θ θ 0 )+ 2 log L(θ 0 ) 2 2 (θ θ 0 ) 2 Its erivative with respect to θ is (4) log L(θ) = log L(θ 0) + 2 log L(θ 0 ) 2 (θ θ 0 ) By setting θ = ˆθ an by using the fact that log L(ˆθ)/ = 0, which follows from the efinition of the maximum-likelihoo estimator, we fin that (5) T (ˆθ θ0 )= T 2 } log L(θ 0 ) 2 T log L(θ 0 ) The argument which establishes the limiting istribution of T (ˆθ θ 0 ) has two strans First, the law of large numbers is invoke in to show that } (6) T 2 log L(y; θ 0 ) 2 = T t 2 log f(y t ; θ 0 ) 2 must converge to its expecte value which is the information measure φ =Φ/T Next, the central limit theorem is invoke to show that (7) log L(y; θ 0 ) T = T t log f(y t ; θ 0 ) has a limiting normal istribution which is N(0,φ) This result epens crucially on the fact that Φ = Tφ is the variance of log L(y; θ 0 )/ Thus the limiting istribution of the quantity T (ˆθ θ 0 ) is the normal N(0,φ ) istribution, since this is the istribution of φ times an N(0,φ) variable Within this argument, the evice of scaling ˆθ by T has the purpose of preventing the variance from vanishing, an the istribution from collapsing, 4

5 as the sample size increases inefinitely Having complete the argument, we can remove the scale factor; an the conclusion which is to be rawn is the following: (8) Let ˆθ be the maximum-likelihoo estimator obtaine by solving the equation log L(y, θ)/ = 0, an let θ 0 be the true value of the parameter Then ˆθ is istribute approximately accoring to the istribution N(θ 0, Φ ), where Φ is the inverse of Fisher s measure of information In establishing these results, we have consiere only the case where a single parameter is to estimate This has enable us to procee without the panoply of vectors an matrices Nevertheless, nothing essential has been omitte from our arguments In the case where θ is a vector of k elements, we efine the information matrix to be the matrix whose elements are the variances an covariances of the elements of the score vector Thus the generic element of the information matrix, in the ijth position, is (9) E } 2 log L(θ 0 ) log L(θ0 ) = E log L(θ } 0) θ i θ j θ i θ j 5