Advanced statistical inference. Suhasini Subba Rao

Transcription

1 Advanced statistical inference Suhasini Subba Rao August 1, 2012

2 2

3 Chapter 1 Basic Inference 1.1 A review of results in statistical inference In this section, we review some results that you came across in STAT611 or equivalent. We will review the Cramer-Rao bound and some properties of the likelihood. In later sections, we will use the likelihood as a means of parameter estimation ie. the maximimum likelihood estimator which you would have done in previous courses and heuristically argue why the Fisher information which gives the Cramer-Rao bound is extremely important The likelihood function Let {X i } be iid random variables with probability function or probability density function fx;θ, where f is known but the parameter θ is unknown. The likelihood function is defined as LX;θ = and the log-likelihood is T fx i ;θ 1.1 loglx;θ = LX;θ = logfx i ;θ. 1.2 Example i Suppose that {X t } are iid normal random variables with mean µ and variance σ 2 the log likelihood is L T X;µ,σ 2 Tσ 2 X t µ σ 2

4 ii Suppose that {X t } are iid binomial random variables X t Binn,π. Then the log likelihood is L T X;π n log + X t X t log π +nlog1 π. 1 π iii Suppose that {X t } are independent binomial random variables such that X t Binn t,π t, where the regressors z t influence the mean of X t, such that π t = gβ x t. Then the log likelihood is L T X;π log nt X t + X t log gβ x t +nt 1 gβ log1 gβ x t. x t iv Suppose that {X t } are independent exponential random variables which have the density θ 1 exp x/θ. The log-likelihood is L T X;θ = αlogθ + Y t. θ v A generalisation of the exponential distribution which gives more freedom in terms of shape of the distribution is the Weibull. Suppose that {X t } are independent Weibull random variables which have the density αyα 1 θ exp y/θ α where θ,α > 0 in the case that α α = 0 we have the regular exponential and y is defined over the positive real line. The log-likelihood is L T X;α,θ = logα+α 1logY t αlogθ Y t α. θ In the case, that α is known, but θ is unknown the likelihood is proportional to L T X;θ = αlogθ Y t α.. θ Bounds for the variance of an unbiased estimator We require the following assumptions, often called the regularity assumptions. We state the assumptions and result scalar θ, but they can easily be extended to the case that θ is a vector. Assumption Regularity Conditions 1 Let us suppose that L T is the likelihood with true parameter θ, and i logl T x;θ L T x;θdx = 0 for iid this is equivalent to logfx;θ fx;θdx = 0. 4

5 ii LT x;θdx = L T x;θ dx = 0. iii gxlt x;θdx = gx L Tx;θ dx, where g is any function which is not a function of θ for example the estimator of θ. iv E logl T X;θ 2 > 0. Theorem The Cramer-Rao bound Let θx be an unbiased estimator of θ. Suppose the likelihood L T X;θ satisfies the regularity conditions given in Assumption and θx is an unbiased estimator of θ, then we have var θx E logl T X;θ 1 2 = PROOF. Recall that θx is an unbiased estimator of θ therefore θxl T x;θdx = θ. Differentiating both sides wrt to θ gives θx L Tx;θ dx = 1. Since L T x;θ dx = 0 we have { θ θx } LT x;θ dx = 1. Multiplying and dividing by L T x;θ gives { θx θ } 1 L T x;θ E 2 logl T X;θ 1 2, Hence since L T x;θ is the distribution of X we have } 1 logl T X;θ E { θx θ = 1. L T X;θ L T x;θ L T x;θdx = Recalling that the Cauchy-Schwartz inequality is EUV EU 2 1/2 EV 2 1/2 where equality only arises if U = av +b where a and b are constants and applying it to the above we have var θxe logl T X;θ Thus giving us the Cramer-Rao inequality. Finally we need to prove that E logl T X;θ 2 = E 2 logl T X;θ. To prove this result we use the fact that 2 LT is a density to obtain L T x;θdx = 1. 5

6 Now by differentiating the above with respect to θ gives L T x;θdx = 0. By using Assumption 1.1.1ii we have LT x;θ dx = 0 loglt x;θ L T x;θdx = 0 Differentiating again with respect to θ and taking the derivative inside gives 2 logl T x;θ loglt x;θ L T x;θ 2 L T x;θdx+ dx = 0 2 logl T x;θ loglt x;θ 1 L T x;θ 2 L T x;θdx+ L T x;θdx = 0 L T x;θ 2 logl T x;θ loglt x;θ 2 2 L T x;θdx+ L T x;θdx = 0 Thus 2 logl T X;θ loglt X;θ E 2 = E L T X;θ. Which gives us the required result. Corollary Estimators which attain the C-R bound Suppose Assumption is satisfied. Then the estimator θx attains the C-R bound only if it can be written as for some functions a and b. ˆθX = aθ+bθ logl TX;θ PROOF. The proof is clear and follows from when the Cauchy-Schwartz inequality is an actual equality in the derivation of the C-R bound. We mention that there exists some well known distributions which do not satisfy Assumption These are non-regular distributions. A classical example of a distribution which violates this assumption is the uniform distribution fx;θ = 1/θ, for x [0,θ] and zero elsewhere. Other examples, include distributions where the support of the distribution is a function of the parameter. The Cramer-Rao lower bound does hold or even exist for such distributions. Example The classical example of the uniform Let us consider the example if the uniform distribution, which has the density fx;θ = θ 1 exp x/θ. Given the iid uniform 6

7 random variables {X t } the likelihood it is easier to study the likelihood rather than the loglikelihood is L T X T ;θ = 1 θ T T I [0,θ] X t. Since the support of density involves the unknown parameter, then the derivative of logl T X T ;θ is not well defined what is the derivative of logi [0,θ] X t = logi [Xt, θ with respect to θ? - observe that at log0 is not well defined and the derivative at X t is not well defined and Assumption 1.1.1ii is not satisfied. This is a classical example of a density which does not satisfy the regularity conditions. This means that the inverse of the Fisher information does not give a lower bound for the variance estimator. And below we will show why. In fact, using L T X T ;θ, the maximum likelihood estimator of θ is ˆθ T = max 1 t T X t you can see this by making a plot of L T X T ;θ against θ. It it well known that the distribution of max 1 t T X t is P max 1 t T X t x = PX 1 x,...,x T x = and the density of max 1 t T X t is fˆθt x = Tx T 1 /θ T. Exercise: Find the variance of ˆθ T definedabove. T PX t x = x T, θ Often we want to estimate a function of θ, τθ. The following corollary is a small generalisation of the Cramer-Rao bound. Corollary Suppose the regularity conditions Assumption are satisfied and TX is an unbiased estimator of τθ. Then we have vartx τ θ 2 E logl T X;θ We now define the notion of sufficiency, which gives us the ingredients for constructing a good estimator see also Sections and Davison Definition Sufficiency and the factorisation theorem Suppose that X = X 1,...,X T is a random vector. The statistic sx is called a sufficient statistic of the parameter θ, if the conditional distribution of X given sx is not a function of θ. Normally it is extremely hard to obtain the sufficient statistic from its definition. However, the factorisation theorem gives us a way of obtaining the sufficient statistic. The Factorisation Theorem Suppose that the likelihood function can be partitioned as follows, L T X;θ = hxgsx;θ, where hx is not a function of θ, then sx is a sufficient statistic of θ.. 7

8 We see that a sufficient statistic contains all the ingredients about the parameter θ. Theorem Rao-Blackwell Theorem Suppose sx is a sufficient statistic and θx is an unbiased estimator of θ then if we define the new unbiased estimator EθX SX, then vare θx SX var θx. The Rao-Blackwell theorem tells us that estimators with the smallest variance must be a function of the sufficient statistic. Of course this begs the question is there a unique estimator with the minumum variance. For this we require completeness of the sufficient statistic. Uniqueness immediately follows from completeness. Definition Completeness Let sx be a sufficient statistic for θ. Suppose Z is a function of sx such that EZsX = 0. sx is a complete sufficient statistic if and only if EZsX = 0 implies Zt = 0 for all t. Theorem Lehmann-Scheffe Theorem Suppose that SX is a complete sufficient statistic and θsx is an unbiased estimator estimator of θ then θsx is the unique minimum variance unbiased estimator of θ. The theorems above are theoretical, in the sense that, under certain conditions they give a lower bound for the variance of a plausible estimator and practical in the sense that they tell us that the best estimator should be a function of sufficient statistic. The natural question to ask, is how to construct such estimators. One of the most popular estimators in statistics are maximum likelihood estimators mle. That is the mle of θ is ˆθ T = argmax θ Θ L T θ, where Θ is the parameter space which contains all values of θ with fx;θdx = 1. There are two reasons that they are so widely used i it can be shown for a wide range of probability distributions - including under certain conditions the exponential family of distributions, defined below, that the mle is a function of the sufficient statistic, hence the mle is often the minimum variance unbiased estimator ii asymptotically at least the mle under certain conditions attains the C-R bound. Of course one can construct examples, where the regularity conditions are not satisfied and the mle is not the optimal estimator examples include estimation of the range in the uniform distribution, where an estimator can be constructed which has a small variance than the mle. But for a mass majority of distributions the mle is optimal. It is also worth mentioning that there can exist biased estimators which have a smaller mean squared error than the MLE this intriguing notion it called super-efficiency, which is beyond this course - see Stoica and Ottesten 1996 for a review. 8

9 1.1.3 Additional Notes We will use various distributions in this course, it would be useful if you complied a list of these distributions and become familiar with them. Example Useful transformations Question: The distribution function of the random variable X t is F t x = 1 exp λ t x. i Give a transformation of X t, such that the transformed variable is uniformly distributed on the interval [0, 1]. ii Suppose that I observe the independent but not necessarily identically distributed Answer: random variables {X t }, and I want to to check whether they have the distribution function F t x = 1 exp λ t x. Using i, suggest a method for checking this? i It is well known that if the random variable X t has the distribution function F t x, then the transformed random variable Y t = F t X t is uniformly distributed on the interval [0,1]. To see this, note that the distribution of Y t can be evaluated as PY t y = PF t X t y = PX t F 1 t y = F t F 1 t y = y, y [0,1]. Thus to answer the question, we let Y t = 1 exp λ t X t, which as a uniform distribution. ii If we want to check whether X t follows the distribution F t x = 1 exp λ t x, we can make the transformation Y t = 1 exp λ t X t, and use, for example, the Kolmogorov- Smirnov test to check whether {Y t } follows a uniform distribution. Example Question Suppose that Z is a Weibull random variable with density fx;φ,α = α φ x φ α 1 exp x/φ α. Show that EZ r = φ r Γ 1+ r. α Hint: Use x a exp x b dx = 1 b Γa b + 1 a,b > 0. b Solution This result may be useful in some of the examples given in this course. 9

10 10

11 Chapter 2 The Bayesian Cramer-Rao 2.1 The Bayesian Cramer-Rao inequality The classical Cramér-Rao inequality is useful for assessing the quality of a given estimator. But from the derivation we can clearly see that it only holds if the estimator is unbiased. No such inequality can be derived to include estimators which are biased. For example, this can be a problem in nonparametric regression, where estimators in general will be biased. How does one access the estimator in such cases? To answer this question we consider the Bayesian Cramer-Rao inequality. This is similar to the Cramer-Rao inequality but does not require that the estimator is unbiased, so long as we place a prior on the parameter space. This inequality is known as the Bayesian Cramer-Rao or van-trees inequality. Suppose {X i } T are random variables with distribution function L TX;θ. Let θx be an estimator of θ. We now Bayesianise the set-up by placing a prior distribution on the parameter space Θ, the density of this prior we denote as λ. Let Egx θ = gxl T x θdx and E λ denote the expectation over the density of the parameter λ. For example b E λ E θx θ = θxlt x θdxdθ. a R T Assumption θ is defined over the compact interval [a,b] and λx 0 as x a and x b so λa = λb = 0. Theorem Suppose Assumptions and hold. Let θx be an estimator of θ. Then we have E λ Eθ θx θ 2 θ E λ Iθ+Iλ 1 11

12 where E λ Iθ = and Iλ = loglt x;θ 2 L T x;θλθdθ logλθ 2 λθdθ. PROOF. We first note that under Assumption we have b Therefore by using the above we have a ] L T x;θλθ b dθ = L T x;θλθ = 0. a R T θx b a L T x;θλθ dθdx = Now let us consider b R T a θ L Tx;θλθ dθdx. By integration by parts we have R T = b a θ L Tx;θλθ dθdx = θl T x;θλθ R T ] b a dx R n b a L T x;θλθdθdx L T x;θλθdθdx. 2.2 Subtracting 2.2 from 2.1 we have R T b a LT x;θλθ θx θ dθdx = Multiplying and dividing by L T x;θλθ gives R T b a b R T R T b a L T x;θλθdθdx = 1. 1 L T x;θλθ θx θ L T x;θλθdxdθ = 1. L T x;θλθ loglt x;θλθ θx θ L T x;θλθdxdθ = 1 a Now by using the Cauchy-Schwartz inequality we have b 2LT 1 θx θ x;θλθdxdθ a R } T {{ } E λ E θx θ 2 θ Rearranging the above gives b a R T loglt x;θλθ 2 L T x;θλθdxdθ. E λ Eθ θx θ 2 b a R T loglt x;θλθ 1 2LT x;θλθdxdθ 12

13 Finally we want to show that the denominator of the RHS of the above is b loglt x;θλθ 2 L T x;θλθdxdθ = E λ Iθ+Iλ. a R T Using basic algebra we have loglt x;θλθ 2 L T x;θλθdxdθ = = b loglt x;θ + logλθ 2 L T x;θλθdxdθ a R T loglt x;θ 2 b L T x;θλθdxdθ+2 a } {{ } E λ Iθ b logλθ 2 + L T x;θλθdxdθ. a R T } {{ } Iλ logl T x;θ R T logλθ L T x;θλθdxdθ We note that loglt x;θ and b a R T logλθ logλθ logλθ dxdθ = 2LT x;θλθdxdθ = b a loglt x;θλθ 2 L T x;θλθdxdθ LT x;θ dxdθ } {{ } =0 logλθ 2λθdθ. Therefore we have = b loglt x;θ 2 L T x;θλθdxdθ+ a R T } {{ } E λ Iθ b logλθ 2 L T x;θλθdxdθ, a R T } {{ } Iλ we required. We will consider applications of the Bayesian Cramer-Rao bound in Section for obtaining lower bounds of nonparametric density estimators. 13

14 14

15 Chapter 3 The Exponential Family 3.1 The exponential family of distributions See also Section 5.2, Davison It is possible to derive the properties eg. mean, variance and maximum likelihood estimators - to be defined properly later on for every distribution of interest. However, this can be cumbersome, the algebra can be tedious and we may not see the big picture. Instead, we now consider an umbrella family of distributions which include several well known distributions. We will derive a general expression for the mean and variance of such distributions which will be useful when we consider Generalised Linear models later in this course, and use these results to show that the maximum likelihood estimator is a function of the sufficient statistic :- thus is the best unbiased estimator under the assumption of completeness. In other words, we that for this family of distributions the maximum likelihood estimator which we have encountered many times previously is indeed the best parameter estimator in terms of minimum variance. Suppose that the distribution of the random variable X t can be written in the form fy;ω = exp syηω bω+cy. 3.1 If the distribution of X t both the probability distribution function for discrete random variables and probability density function for continuous random variables has the above representation, then X t is said to belong to the exponential family of distributions. A large number of well known distribution functions belong to this family. Hence by understanding the properties the exponential family, we can draw conclusions on a large number of distribution functions. Example a The exponential distribution X Expλ, hence the pdf is fy; λ = λexp λy, which can be written as logfy;λ = yλ+logλ. 15

16 Therefore sy = y and ηλ = λ. b The binomial distribution PX = y = n y π1 π n y can be rewritten as π n logpy;λ = ylog 1 π +nlog1 π+log. y Therefore sy = y, ηπ = log π 1 π, bπ = nlog1 π 1 and cy = log n y. It should be mentioned that it is straightforward to generalise the exponential family to the case that θ is a vector of dimension greater than one. Suppose that θ is a p-dimensional vector. The order p exponential family is defined as distributions which satisfy fy;ω = expsy θω bω+cy, where sy = s 1 y,...,s p y with {s i } linearly independent and θω = θ 1 ω,...,θ p ω The natural exponential family If we let θ = ηω and η is an invertible function hence there is a one-to-one correspondence between the space containing ω and the space containing θ, then we can rewrite 3.1 we fy;θ = expsyθ κθ+cy, where κθ = bη 1 θ. The natural exponential family is when sy = y. Now by transformation we give example of distributions which have natural form. i The exponential distribution is already in natural exponential form. ii For the binomial distribution we let θ = log π π 1 π, since log 1 π is invertible this gives the log distribution as π logfy;θ = logfy;log 1 π = yθ nlog 1 n. +log 1+expθ y Hence the parameter of interest, π, has been transformed, and often we fit a model later in the course to θ, and transform back to obtain an estimator of π. Some properties of the natural exponental Distributions which have a natural exponential have interesting properties which we now discuss. Lemma Suppose that X is a random variable which has the natural exponential representation. Then the moment generating function of X is EexpXt = expκt + θ κθ. Furthermore, EX = κ θ and varx = κ θ. 16

17 PROOF. Let us suppose that t is sufficiently small such that fy;θ+t is a distribution. The mfg is M X t = EexptY = expty expθy κθ + cydy = expκθ +t κθ expθ +ty κθ +t+cydy = expκθ +t κθ, since expθ + ty κθ + t + cydy = fy;θ + tdy = 1. To obtain the moments we recall that M X 0 = EX and varx = M X 0 M X 02. Therefore M Xt = κ θ +texpκθ +t κθ M Xt = κ θ +t+κ θ +t 2 expκθ +t κθ. Hence M X 0 = κ θ and M X 0 = κ θ+κ θ 2, which gives the result. Remark The mean and variance of the natural exponential family make obtaining the mle estimators quite simple. We derive this later but we first observe that since EX = κ θ, therefore the mean of X is a function of θ, hence we can write µθ = κ θ. Moreover, since varx = κ θ, then the derivative of µ, µ θ, is strictly positive. In other words, µθ = κ θ is an increasing function in θ. Thus µθ is an invertible function, therefore given µθ, we can uniquely determine θ. This observation will prove useful later when obtaining the mle estimators of θ Maximum likelihood estimation for the exponential family Suppose that {X t } are iid random variables which have a natural exponential distribution representation. Then the log likelihood function is L T X;θ = θ X t Tκθ+T cx t. Hence by using the factorisation theorem we see that the sufficient statistic for θ is sx = T X t. Hence, supposing that Assumption is satisfied, then the minimum variance unbiased estimator of θ should be a function of sx. We now obtain the maximum likelihood estimator of θ, and derive conditions under which the mle is a function of sx hence, by Rao-Blackwell theorem and the Lehmann-Scheffe lemma it is the best estimator. The mle of θ is ˆθ T where ˆθ T = argmax θ Θ { T θ X t Tκθ+ 17 cx t }.

18 Thenaturalwaytoobtain ˆθ T istofindthesolutionof L TX;θ = 0. Howeverfor L TX;θ θ=ˆθt = 0, depends on a few conditions. Before we derive these conditions we first consider the solution of the derivative of L T X;θ. Differentiating L T X;θ gives L T X;θ = X t Tκ θ. Therefore, since µθ = κ prime θ is an invertible function, then L TX;θ = 0 when ˆθ T = µ 1 1 T X t. Of course, we need to know under what conditions µ 1 1 T X t = argmax θ Θ { T θ X t Tκθ+T cx t }. The above really depends on the parameter space Θ. Definition Let Θ be the parameter space of θ and the space of outcomes of the random variable X, Y. Let M = {µ = µθ;θ Θ} denote the man space. Let Ȳ T = {y = 1 T T x t;x t Y} the sample mean space. Lemma Suppose that {X t } are iid random variables which have a natural exponential representation. If Y T M then µ 1 1 T X t = argmax θ Θ { T θ X t Tκθ+T cx t }. PROOF.Theproofisstraightforward,sincethefirstderivativeiszerowhenθ = µ 1 1 T T X t. Then this is the maximum of lx;θ in the sample mean space ȲT. Hence in order for it the minimum over the mean space M, then either M = ȲT or ȲT M. Remark Minimum variance unbiased estimators Suppose X t has a distribution in the natural exponential family, the conditions of the above lemma are satisfied and sx is the complete statistic of θ. Moreover if µ 1 1 T T X t is an unbiased estimator of θ, then µ 1 1 T T X t is the minumum variance unbiased estimator of θ. However, in general, this will not be case. But by using Slutsky s theorem it can be shown that µ 1 1 T T X t P θ. 18

19 Remark Estimating ω Often we are interested in estimating ω, where θ = θω. However, since lx; θ ω = ω lx;θ = θ ω T X t Tκ θ. Then if all conditions regarding parameter and sample mean spaces are satisfied then the mle of ω is ˆω T = η 1 µ 1 1 T X t. It should be noted that one great advantage of the exponential family of distributions is that the mle is easy to obtain with explicit expressions!. Many of the results above can be generalised to the setting that {X t } are independent but not necessarily identically distributed and there exists regressors z which are known to influence the mean of X t. We will revisit this problem when we consider generalised linear models. 19

20 20

21 Chapter 4 The Maximum Likelihood Estimator 4.1 The Maximum likelihood estimator As illustrated in the exponential family of distributions, discussed above, the maximum likelihood estimator of θ 0 the true parameter is defined as ˆθ T = argmax θ Θ L TX;θ = argmax θ Θ L Tθ. Often we find that L Tθ θ=ˆθt = 0, hence solution can be obtained by solving the derivative of the log likelihood often called the score function. However, if θ 0 lies on or close to the boundary of the parameter space this will not necessarily be true. Below we consider the sampling properties of ˆθ T when the true parameter θ 0 lies in the interior of the parameter space Θ. We note that the likelihood is invariant to transformations of the data. For example if X has the density f ;θ and we define the transformed random variable Z = gx, where the function g has an inverse its a 1-1 transformation, then it is easy to show that the density of Z is fg 1 z;θ g 1 z z. Therefore the likelihood of {Z t = gx t } is T fg 1 Z t ;θ g 1 z z=zt = z T fx t ;θ g 1 z z=zt. z Hence it is proportional to the likelihood of {X t } and the maximum of the likelihood of {Z t = gx t } is the same as the maximum of the likelihood of {X t }. 21

22 4.1.1 Evaluating the MLE Examples Example {X t } are iid random variables, which follow a Normal Gaussian distribution Nµ,σ 2. The likelihood is proportional to L T X;µ,σ 2 = T logσ 1 2σ 2 X t µ 2. Maximising the above with respect to µ and σ 2 gives ˆµ T = X and ˆσ 2 = 1 T T X t X 2. Example Question: Solution: {X t } are iid random variables, which follow a Weibull distribution, which has the density αy α 1 θ α exp y/θ α θ,α > 0. Suppose that α is known, but θ is unknown and we need to estimate it. What is the maximum likelihood estimator of θ? The log-likelihood of interest is proportional to L T X;θ = logα+α 1logY t αlogθ Y t α θ αlogθ Y t α.. θ The derivative of the log-likelihood wrt to θ is L T = Tα θ + α θ α+1 Solving the above gives ˆθ T = 1 T T Y α t 1/α. Y α t = 0. Example Notice that if α is given, an explicit solution for the maximum of the likelihood, in the above example, can be obtained. Consider instead the maximum of the likelihood with respect to α and θ, ie. arg max θ,α logα+α 1logY t αlogθ Y t α. θ 22

23 The derivative of the likelihood is L T L T α = Tα θ + α θ α+1 Y α t = 0 = T T α logy t T logθ Tα T θ + log Y t θ Y t θ α = 0. It is clear that an explicit expression to the solution of the above does not exist and we need to find alternative methods for finding a solution. Below we shall describe numerical routines which can be used in the maximisation. In special cases, one can use other methods, such as the Profile likelihood we cover this later on. Numerical Routines In an ideal world to maximise a likelihood, we would consider the derivative of the likelihood and solve it L Tθ θ=ˆθt = 0, and an explicit expression would exist for this solution. In reality this rarely happens as we illustrated in the section above. Usually, we will be unable to obtain an explicit expression for the MLE. In such cases, one has to do the maximisation using alternative, numerical methods. Typically it is relative straightforward to maximise the likelihood of random variables which belong to the exponential family numerical algorithms sometimes have to be used, but they tend to be fast and attain the maximum of the likelihood - not just the local maximum. However, the story becomes more complicated even if we consider mixtures of exponential family distributions - these do not belong to the exponential family, and can be difficult to maximise using conventional numerical routines. We give an example of such a distribution here. Let us suppose that {X t } are iid random variables which follow the classical normal mixture distribution fy;θ = pf 1 y;θ 1 +1 pf 2 y;θ 2, where f 1 is the density of the normal with mean µ 1 and variance σ 2 1 and f 2 is the density of the normal with mean µ 2 and variance σ2 2. The log likelihood is L T Y;θ = 1 log p exp 1 2πσ 2 1 2σ 2 1 X t µ p exp 1 2πσ 2 2 2σ 2 2 X t µ 2 2. Studying the above it is clear there does not explicit solution to the maximum. Hence one needs to use a numerical algorithm to maximise the above likelihood. We discuss a few such methods below. 23

24 The Newton Raphson Routine The Newton-Raphson routine is the standard method to numerically maximise the likelihood, this can often be done automatically in R by using the R functions optim or nlm. To apply Newton-Raphson, we have to assume that the derivative of the likelihood exists this is not always the case - think about the l 1 - norm based estimators! and the minimum lies inside the parameter space such that L T θ θ=ˆθt = 0. We choose an initial value θ 1 and apply the routine 2 L T θ 1 L T θ n 1 θ n = θ n θn 1 θn 1. Where this routine comes from will be clear by using the Taylor expansion of L Tθ n 1 about θ 0 see Section If the likelihood has just one global maximum and no local maximums hence it is convex, then it is quite easy to maximise. If on the other hand, the likelihood has a few local maximums and the initial value θ 1 is not chosen close enough to the true maximum, then the routine may converge to a local maximum not good!. In this case it may be a good idea to do the routine several times for several different initial values θ i i 1. For each convergence value ˆθ T evaluate the likelihood L Tˆθ i T and select the value which gives the largest likelihood. It is best to avoid these problems by starting with an informed choice of initial value. Implementing without any thought a Newton-Rapshon routine can lead to estimators which take an incredibly long time to converge. If one carefully considers the likelihood one can shorten the convergence time by rewriting the likelihood and using faster methods often based on the Newton-Raphson. Iterative least squares This is a method that we shall describe later when we consider Generalised linear models. As the name suggests the algorithm has to be interated, however at each step weighted least squares is implemented see later in the course. The EM-algorithm This is done by the introduction of dummy variables, which leads to a new unobserved likelihood which can easily be maximised. In fact one the simplest methods of maximising the likelihood of mixture distributions is to use the EM-algorithm. We cover this later in the course. See Example 4.23 on page 117 in Davison The likelihood for dependent data We mention that the likelihood for dependent data can also be constructed though often the estimation and the asymptotic properties can be a lot harder to derive. Using Bayes rule ie. 24

25 PA 1,A 2,...,A T = PA 1 T i=2 PA i A i 1,...,A 1 we have L T X;θ = fx 1 ;θ T fx t X t 1,...,X 1 ;θ. t=2 Under certain conditions on {X t } the structure above T t=2 fx t X t 1,...,X 1 ;θ can be simplified. For example if X t were Markovian then we have X t conditioned on the past on depends the most recent past observation, ie. fx t X t 1,...,X 1 ;θ = fx t X t 1 ;θ in this case the above likelihood reduces to L T X;θ = fx 1 ;θ n fx t X t 1 ;θ. 4.1 t=2 Example A lot of the material we will cover in this class will be for independent observations. However likelihood methods can also work for dependent observations too. Consider the AR1 time series X t = ax t 1 +ε t, where ε t are iid random variables with mean zero. We will assume that a < 1. We see from the above that the observation X t 1 as a linear influence on the next observation and it is Markovian, that it given X t 1, the random variable X t 2 has no influence on X t to see this consider the distribution function PX t x X t 1,X t 2. Therefore by using 4.1 the likelihood of {X t } t is L T X;a = fx 1 ;a T f ε X t ax t 1, 4.2 t=2 where f ε is the density of ε and fx 1 ;a is the marginal density of X 1. This means the likelihood of {X t } only depends on f ε and the marginal density of X t. We use â T = argmaxl T X;a as the mle estimator of a. Often we ignore the term fx 1 ;a because this is often hard to know - try and figure it out - its relatively easy in the Gaussian case and consider what is called the conditional likelihood Q T X;a = T f ε X t ax t t=2 ã T = argmaxl T X;a as the quasi-mle estimator of a. Exercise: What is the quasi-likelihood proportional to in the case that {ε t } are Gaussian random variables with mean zero. It should be mentioned that often the conditional likelihood is derived as if the errors {ε t } are Gaussian - even if they are not. This is often called the quasi or pseudo likelihood. 25

26 4.1.2 A quick review of the central limit theorem In this section we will not endeavour to proof the central limit theorem which is usually based on showing that the characteristic function - a close cousin of the moment generating function - of the average converges to the characteristic function of the normal distribution. However, we will recall the general statement of the CLT and generalisations of it. The purpose of this section is not to lumber you with unnecessary mathematics but to help you understand when an estimator is close to normal or not. Lemma The famous CLT Let us suppose that {X t } are iid random variables, let µ = EX t < and σ 2 = varx t <. Define X = 1 T T X t. Then we have D T X µ N0,σ 2, alternatively X µ D N0, σ2 T. What this means that if we have a large enough sample size and plotted the histogram of several replications of the average, this should be close to normal. Remark i The above lemma appears to be restricted to just averages. However, it can be used in several different contexts. Averages arise in several different situations. It is not just restricted to the average of the observations. By judicious algebraic manipulations, one can show that several estimators can be rewritten as an average or approximately as an average. At first appearance, the MLE of the Weibull parameters given in Example does not look like an average, however, in the section we will consider the general maximum likelihood estimators, and show that they can be rewritten as an average hence the CLT applies to them too. ii The CLT can be extended in several ways. a To random variables whose variance are not all the same ie. indepedent but identically distributed random variables. b Dependent random variables so long as the dependency decays in some way. c To not just averages but weighted averages too so long as the weight depends in certain way. However, the weights should be distributed well over all the random variables. Ie. suppose that {X t } are iid random variables. Then it is clear that X t will never be normal unless {X t } is normal - observe 10 is fixed!, but it seems plausible that 1 n n sin2πt/12x t is normal despite this not being the sum of iid random variables. 26

27 There exists several theorems which one can use to prove normality. But really the take home message is, look at your estimator and see whether asymptotic normality it looks plausible - you could even check it through simulations. Example Some problem cases One should think a little before blindly applying the CLT. Suppose that the iid random variables {X t } follow a t-distribution with 2 degrees of freedom, ie. the density function is fx = Γ3/2 2π 1+ x2 2 3/2. Let X = 1 n n X t denote the sample mean. It is well known that the mean of the t-distribution with two degrees of freedom exists, but the variance does not it is too thick tailed. Thus, the assumptions required for the CLT to hold are violated and X is not normally distributed in fact it follows a stable law distribution. Intuitively this is clear, recall that the chance of outliers for a t-distribution with a small number of degrees of freedom, if large. This makes it impossible that even averages should be well behaved there is a large chance that an average could also be too large or too small. To see why the variance is infinite, study the form of t-distribution with two degrees. For the variance to be finite, the tails of the distribution should converge to zero fast enough in other words the probability of outliers should not be too large. See that the tails of the t-distribution for large x behaves like fx Cx 3 make a plot in Maple to check, thus the second moment EX 2 M Cx 3 x 2 dx = M Cx 1 dx for some C and M, is clearly not finite! This argument can be made precise The Taylor series expansion - the statisticians tool The Taylor series is used all over the place in statistics and you should be completely fluent with using it. It can be used to prove consistency of an estimator, normality based on the assumption that averages converge to a normal distribution, obtaining the limiting variance of an estimator etc. We start by demonstrating its use for the log likelihood. We recall that the mean value in the univariate case states that fx = fx 0 +x x 0 f x 1 fx = fx 0 +x x 0 f x 0 + x x 0 2 f x 2, 2 where x 1 and x 2 both lie between x and x 0. In the case that f is a multivariate function, then we have fx = fx 0 +x x 0 fx x= x1 fx = fx 0 +x x 0 fx x=x x x 0 2 fx x= x2 x x 0, 27

28 where x 1 and x 2 both lie between x and x 0. In the case that fx is a vector, then the mean value theorem does not directly work. Strictly speaking we cannot say that fx = fx 0 +x x 0 fx x= x1, where x 1 lies between x and x 0. However, it is quite straightforward to overcome this inconvience. The mean value theorem does hold pointwise, for every element of the vector fx = f 1 x,...,f d x, ie. for every 1 i d we have f i x = f i x 0 +x x 0 f i x x= xi, where x i lies between x and x 0. Thus if f i x x= xi f i x x=x0, we do have that fx fx 0 +x x 0 fx x=x0. We use the above below. Application 1 An expression for L T ˆθ T L T θ 0 in terms of ˆθ T θ 0 : The expansion of L T ˆθ T about θ 0 the true parameter L T θ 0 L T ˆθ T = L Tθ θ ˆθT 0 ˆθ T θ 0 ˆθ T 2 L T θ 2 θt θ 0 ˆθ T where θ T lies between θ 0 and ˆθ T. If ˆθ T lies in the interior of the parameter space this is an extremely important assumption here then L Tθ = 0. Moreover, if it can ˆθT P be shown that ˆθ T θ 0 0 we show this in the section below, then under certain conditions on L Tθ that 2 L T θ 2 θt P E 2 L T θ 2 such as the existence of the third derivative etc. it can be shown θ0 = Iθ 0. Hence the above is roughly 2L T ˆθ T L T θ 0 ˆθ T θ 0 Iθ 0 ˆθ T θ 0 Note that in many of the derivations below we will use that 2 L T θ P 2 L T θ 2 θt E 2 θ0 = Iθ0. But it should be noted that this only true if i ˆθ T θ 0 P 0 and ii 2 L T θ 2 uniformly to E 2 L T θ 2 θ0. We consider below another closely related application. converges 28

29 Application 2 An expression for ˆθ T θ 0 in terms of L Tθ θ0 : Theexpansionofthep-dimensionvector L Tθ ˆθT pointwiseaboutθ 0 thetrueparameter gives for 1 i d L i,t θ ˆθT = L i,tθ θ0 + L i,tθ θt ˆθ T θ 0. Now by using the same argument as in Application 1 we have L T θ θ0 Iθ 0 ˆθ T θ 0. We mention that U T θ 0 = L Tθ θ0 is often called the score or U statistic. And we see that the asymptotic sampling properties of U T determine the sampling properties of ˆθ T θ 0. Example The Weibull Evaluate the second derivative of the likelihood given in Example 4.1.3, take the expection on this, Iθ,α = E 2 L T we use the to denote the second derivative with respect to the parameters α and θ. Exercise: Evaluate Iθ, α. Application 2 implies that the maximum likelihood estimators ˆθ T and ˆα T recalling that no explicit expression for them exists can be written as T ˆθT θ Iθ,α 1 ˆα T α T α θ + α Y α θ α+1 t 1 α logy t logθ α θ +logyt θ Yt θ α Sampling properties of the maximum likelihood estimator See also Section p118, Davison These proofs will not be examined, but you should have some idea why Theorem is true. We have shown that under certain conditions the maximum likelihood estimator can often be the minimum variance unbiased estimator for example, in the case of exponential family of distributions. However, for finite samples, the mle may not attain the C-R lower bound. Hence for finite sample varˆθ T > Iθ 1. However, it can be shown that asymptotically the variance of the mle attains the mle lower bound. In other words, for large samples, the variance of the mle is close to the Cramer-Rao bound. We will prove the result in the case that l T is the log likelihood of independent, identically distributed random variables. The proof can be generalised to the case of non-identically distributed random variables. We first state sufficient conditions for this to be true. Assumption [Regularity Conditions 2] Let {X t } be iid random variables with density fx;θ. 29

30 i Suppose the conditions in Assumption hold. ii Almost sure uniform convergence This is optional For every ε > 0 there exists a δ such that P lim sup T θ 1 θ 2 >δ 1 T L TX;θ EL T θ > ε 0. We mention that directly verifying uniform convergence can be difficult. However, it can be established by showing that the parameter space is compact, point wise convergence of the likelihood to its expectation and almost sure equicontinuity in probability. iii Model identifiability For every θ Θ, there does not exist another θ Θ such that fx;θ = fx; θ for all x. iv The parameter space Θ is finite and compact. v supe 1 T L T X;θ <. We require Assumption 4.1.1ii,iii to show consistency and Assumptions and 4.1.1iiiv to show asymptotic normality. Theorem Supppose Assumption 4.1.1ii,iii holds. Let θ 0 be the true parameter and ˆθ T be the mle. Then we have ˆθ a.s. T θ 0 consistency. PROOF. To prove the result we first need to show that the expectation of the maximum likelihood is maximum at the true parameter and that this is the unique maximum. In other words we need to show that E 1 T L TX;θ E 1 T L TX;θ 0 0 for all θ Θ. To do this, we have E 1 T L TX;θ E 1 T L TX;θ 0 = log fx;θ fx;θ 0 fx;θ 0dx = E log fx;θ. fx;θ 0 Now by using Jensen s inequality we have E log fx;θ fx;θ loge = log fx;θ 0 fx;θ 0 fx;θdx = 0. ThusgivingE 1 T L TX;θ E 1 T L TX;θ 0 0. ToprovethatE 1 T L TX;θ E 1 T L TX;θ 0 = 0 only when θ 0 we note that identifiability assumption in Assumption 4.1.1iii, which means that fx;θ = fx;θ 0 only when θ 0 and no other function of f gives equality. 30

31 Hence E 1 T L TX;θ is uniquely maximum at θ 0. Finally, we need to show that ˆθ P T θ0. By Assumption 4.1.1ii and also the LLN we have that for all θ Θ that 1 T L TX;θ a.s. lθ. Therefore, for every mle ˆθ T we have 1 T L TX;θ 0 1 T L TX;ˆθ T a.s. E 1 T L TX;ˆθ T E 1 T L TX;θ To bound E 1 T L TX;θ 0 1 T L TX;ˆθ T we note that Now by using 4.4 we have and E 1 T L TX;θ 0 1 T L TX;ˆθ T = { E 1 T L TX;θ 0 1 T L TX;θ 0 } + { E 1 T L TX;ˆθ T 1 T L TX;ˆθ T } + { 1 T L TX;θ 0 E 1 T L TX;ˆθ T }. E 1 T L TX;θ 0 1 T L TX;ˆθ T { E 1 T L TX;θ 0 1 T L TX;θ 0 } + { E 1 T L TX;ˆθ T 1 T L TX;ˆθ T } + { E 1 T L TX;ˆθ T 1 T L TX;ˆθ T } E 1 T L TX;θ 0 1 T L TX;ˆθ T { E 1 T L TX;θ 0 1 T L TX;θ 0 } + { E 1 T L TX;ˆθ T 1 T L TX;ˆθ T } + { E 1 T L TX;θ 0 1 T L TX;θ 0 }. Therefore, under Assumption 4.1.1ii we have E 1 T L TX;θ 0 1 T L TX;ˆθ T 3sup E 1 θ Θ T L TX;θ 1 T L TX;θ a.s. 0. Since L T θ has a unique minimum this implies ˆθ T a.s. θ 0. Hence we have shown consistency of the mle. We now need to show asymptotic normality. Theorem Suppose Assumption is satisfied. i Then the score statistic is ii Then the mle is { 1 L T X;θ D logfx;θ 2 θ0 N 0, E θ0 }. 4.5 T { D logfx;θ 2 } 1 T ˆθT θ 0 N 0, E θ0. 31

32 iii The log likelihood ratio is D 2 L T X;ˆθ T L T X;θ 0 χ 2 p PROOF. First we will prove i. We recall because {X t } are iid random variables, then 1 L T X;θ T θ0 = 1 T T logfx t ;θ θ0. Hence L TX;θ θ0 isthesumofiidrandomvariableswithmeanzeroandvariancevar logfxt;θ θ0. Therefore, by the CLT for iid random variables we have 4.5. We use i and Taylor mean value theorem to prove ii. We first note that by the mean value theorem we have 1 T L T X;θ = 1 ˆθT T L T X;θ θ0 +ˆθ T θ L T X;θ T 2 θt. 4.6 Now it can be shown because Θ has a compact support, ˆθ T θ 0 a.s. 0 and the expectations of the third derivative of L T is bounded that 1 2 L T X;θ P 1 2 T 2 θt T E L T X;θ 2 Substituting 4.7 into 4.6 gives θ0 2 logfx;θ = E 2 θ L T X;θ 1 1 L TˆθT θ 0 = T 2 T X;θ θt θ0 T 1 2 L T X;θ 1 1 L = E T 2 θ0 T X;θ θ0 +o p 1. T We mention that the proof above is for univariate 2 L T X;θ 2 θt, but by redo-ing the above steps pointwise it can easily be generalised to the multivariate case too. Hence by substituting the 4.5 into the above we have ii. It is straightfoward to prove iii by using 2 L T X;ˆθ T L T X;θ 0 ˆθ T θ 0 Iθ 0 ˆθ T θ 0, i and the result that if X N0,Σ, then AX N0,A ΣA. Example The Weibull By using Example we have T ˆθT θ Iθ,α 1 α θ + α Y θ α+1 t α ˆα T α T 1 α logy t logθ α θ +logyt θ Yt θ α 32.

33 Now we observe that RHS consists of a sum iid random variables this can be viewed as an average. Since the variance of this exists you can show that it is Iθ,α, the CLT can be applied and we have that ˆθT θ ˆα T α D N 0,Iθ,α 1. Remark i We recall that for iid random variables that the Fisher information for sample size T is { } loglt X;θ 2 logfx;θ 2 Iθ = E θ0 = TE θ0. Hence comparing with the above theorem, we see that for iid random variables so long as the regularity conditions are satisfied the MLE, asympotitically, attains the Cramer-Rao bound even if for finite samples this may not be true. Moreover, since ˆθ T θ 0 Iθ 0 1 L Tθ θ0 = T 1 Iθ T L T θ θ0, and var 1 T L T θ θ0 = 1 T Iθ 0, then it can be seen that ˆθ T θ 0 = O p T 1/2. ii Under suitable conditions a similar result holds true for data which is not iid. In summary, the MLE under certain regularity conditions tend to have the smallest variance, and for large samples, the variance is close to the lower bound, which is the Cramer-Rao bound. In the case that Assumption is satisfied, the MLE is said to be asymptotically efficient. This means for finite samples the MLE may not attain the C-R bound but asymptotically it will. iii A simple application of Theorem is to the derivation of the distribution of Iθ 0 1/2 ˆθ T θ 0. It is clear that by using Theorem we have where I p is the identity matrix and Iθ 0 1/2 ˆθ T θ 0 D N0,I p ˆθ T θ 0 Iθ 0 ˆθ T θ 0 D χ 2 p. iv Note that these results apply when θ 0 lies inside the parameter space Θ. As θ gets closer to the boundary of the parameter space 33

34 Remark Generalised estimating equations Closely related to the MLE are generalised estimating equations GEE, which are relate to the score statistic. These are estimators not based on maximising the likelihood but are related to equating the score statistic derivative of the likelihood to zero and solving for the unknown parameters. Often they are equivalent to the MLE but they can be adapted to be useful in themselves and some adaptions will not be the derivative of a likelihood The Fisher information See also Section 4.3, Davison Let us return to the Fisher information. We recall that undercertain regularity conditions an unbiased estimator, θx, of a parameter θ 0 is such that var θx Iθ 0 1, where LT θ 2 Iθ = E = E 2 L T θ 2. is the Fisher information. Furthermore, under suitable regularity conditions, the MLE will asymptotically attain this bound. It is reasonable to ask, how one can interprete this bound. i Situation 1. Iθ 0 = E 2 L T θ 2 θ0 is large hence variance of the mle will be small then it means that the gradient of L Tθ is steep. Hence even for small deviations from θ 0, L Tθ is likely to be far from zero. This means the mle ˆθ T is likely to be in a close neighbourhood of θ 0. ii Situation 2. Iθ 0 = E 2 L T θ 2 θ0 is small hence variance of the mle will large. In this case the gradient of the likelihood L Tθ is flatter and hence L Tθ 0 for a large neighbourhood about the true parameter θ. Therefore the mle ˆθ T can lie in a large neighbourhood of θ 0. This is one explanation as to why Iθ is called the Fisher information. It contains information on how close close any estimator of θ can be. Look at the censoring example, Example 4.20, page 112, Davison

35 Chapter 5 Confidence Intervals 5.1 Confidence Intervals and testing We first summarise the results in the previous section which will be useful in this section. For convenience, we will assume that the likelihood is for iid random variables, whose density is fx;θ 0 though it is relatively simple to see how this can be generalised to general likelihoods - of not necessarily iid rvs. Let us suppose that θ 0 is the true parameter that we wish to estimate. Based on Theorem we have { D logfx;θ 2 } 1 T ˆθT θ 0 N 0, E θ0, 5.1 { 1 L T T D logfx;θ 2 } θ=θ 0 N 0, E θ0 5.2 and 2 L T ˆθ T L T θ 0 D χ 2 p, 5.3 where p are the number of parameters in the vector θ. Using any of 5.1, 5.2 and 5.3 we can construct 95% CI for θ Constructing confidence intervals using the likelihood See also Section 4.5, Davison One the of main reasons that we show asymptotic normality of an estimator it is usually not possible to derive normality for finite samples is to construct confidence intervals CIs and to test. 35

36 In the case that θ 0 is a scaler vector of dimension one, it is easy to use 5.1 to obtain { logfx;θ 2 } 1/2 D T E θ0 ˆθT θ 0 N0, Based on the above the 95% CI for θ 0 is [ ˆθ T 1 logfx;θ 2 E θ0 z T α/2,ˆθ T + 1 logfx;θ 2 E θ0 z T α/2 ]. 2 logfx;θ The above, of course, requires an estimate of thestandardised Fisher information E θ0 = E 2 logfx;θ 2 θ0 Usually, we evaluate the second derivative of 1 T logl Tθ = 1 T L Tθ and replace θ with the estimator of θ, ˆθ T. Exercise: Use 5.2 to construct a CI for θ 0 based on the score The CI constructed above works well if θ is a scalar. But beyond dimension one, constructing a CI based on 5.1 and the p-dimensional normal is extremely difficult. More precisely, if θ 0 is a p-dimensional vector then the analogous version of 5.4 is { logfx;θ 2 } 1/2 D T E θ0 ˆθT θ 0 N0,Ip, using this it is difficult to obtain the CI of θ 0. One way to construct the CI is to square ˆθT θ 0 and use Based on above a 95% CI is TE logfx;θ 2 } D ˆθT θ 0 θ0 ˆθT θ 0 χ 2 p. 5.5 { θ; ˆθT θ logfx;θ 2 TE θ0 ˆθT θ } χ 2 p Notethat asinthescalarcase, this leadstotheintervalwith thesmallestlength. Adisadvantage of 5.6 is that we have to a estimate the information matrix and b try to find all θ such the above holds. This can be quite unwielding. An alternative method, which is asymptotically equivalent to the above but removes the need to estimate the information matrix and is to use 5.3. By using 5.3, a 1001 α% CI for θ 0 is { θ;2 L T ˆθ T L T θ } χ 2 p1001 α. 5.7 The above is not easy to calculate, but it is feasible. 36

37 Example In the case that θ 0 is a scalar the 95% CI based on 5.7 is { θ;l T θ L T ˆθ T 1 } 2 χ2 p0.95. Both the 95% CIs in 5.6 and 5.7 will be very close for relatively large sample sizes. However one advantage of using 5.7 instead of 5.6 is that it is easier to evaluate - no need to obtain the second derivative of the likelihood etc. Another feature which differentiates the CIs in 5.6 and 5.7 is that the CI based on 5.6 is symmetric about ˆθ T recall that X 1.96σ/ T, X σ/ T is symmetric about X, whereas the symmetry condition may not hold for sample sizes when constructing a CI for θ 0 using 5.7. This is a positive advantage of using 5.7 instead of 5.6. A disadvantage of using 5.7 instead of 5.6 is that sometimes in the CI based on 5.7 may have more than one interval. As you can see if the dimension of θ is large it is quite difficult to evaluate the CI try it for the simple case that the dimension is two!. Indeed for dimensions greater than three it is extremely hard. However in most cases, we are only interested in constructing CIs for certain parameters of interest, the other unknown parameters are simply nuisance parameters and CIs for them are not of interest. For example, for the normal distribution we may only be interested in CIs for the mean but not the variance. It is clear that directly using the log-likelihood ratio to construct CIs and also test will mean also constructing CIs for the nuisance parameters. Therefore below in Section?? we construct a variant of the likelihood called the Profile likelihood, which allows us to deal with nuisance parameters in a more efficient way Testing using the likelihood Let us suppose we wish to test the hypothesis H 0 : θ = θ 0 against the alternative H A : θ θ 0. We can use any of the results in 5.1, 5.2 and 5.3 to do the test - they will lead to slightly different p-values, but asympototically they are all equivalent, because they are all based essentially on the same derivation. We now list the three tests that one can use. The Wald test The Wald statistic is based on 5.1. We recall from 5.1 that if the null is true, then we have { D logfx;θ 2 } 1 T ˆθT θ 0 N 0, E θ0. 37