Advanced statistical inference. Suhasini Subba Rao
|
|
|
- Corey Phelps
- 10 years ago
- Views:
Transcription
1 Advanced statistical inference Suhasini Subba Rao August 1, 2012
2 2
3 Chapter 1 Basic Inference 1.1 A review of results in statistical inference In this section, we review some results that you came across in STAT611 or equivalent. We will review the Cramer-Rao bound and some properties of the likelihood. In later sections, we will use the likelihood as a means of parameter estimation ie. the maximimum likelihood estimator which you would have done in previous courses and heuristically argue why the Fisher information which gives the Cramer-Rao bound is extremely important The likelihood function Let {X i } be iid random variables with probability function or probability density function fx;θ, where f is known but the parameter θ is unknown. The likelihood function is defined as LX;θ = and the log-likelihood is T fx i ;θ 1.1 loglx;θ = LX;θ = logfx i ;θ. 1.2 Example i Suppose that {X t } are iid normal random variables with mean µ and variance σ 2 the log likelihood is L T X;µ,σ 2 Tσ 2 X t µ σ 2
4 ii Suppose that {X t } are iid binomial random variables X t Binn,π. Then the log likelihood is L T X;π n log + X t X t log π +nlog1 π. 1 π iii Suppose that {X t } are independent binomial random variables such that X t Binn t,π t, where the regressors z t influence the mean of X t, such that π t = gβ x t. Then the log likelihood is L T X;π log nt X t + X t log gβ x t +nt 1 gβ log1 gβ x t. x t iv Suppose that {X t } are independent exponential random variables which have the density θ 1 exp x/θ. The log-likelihood is L T X;θ = αlogθ + Y t. θ v A generalisation of the exponential distribution which gives more freedom in terms of shape of the distribution is the Weibull. Suppose that {X t } are independent Weibull random variables which have the density αyα 1 θ exp y/θ α where θ,α > 0 in the case that α α = 0 we have the regular exponential and y is defined over the positive real line. The log-likelihood is L T X;α,θ = logα+α 1logY t αlogθ Y t α. θ In the case, that α is known, but θ is unknown the likelihood is proportional to L T X;θ = αlogθ Y t α.. θ Bounds for the variance of an unbiased estimator We require the following assumptions, often called the regularity assumptions. We state the assumptions and result scalar θ, but they can easily be extended to the case that θ is a vector. Assumption Regularity Conditions 1 Let us suppose that L T is the likelihood with true parameter θ, and i logl T x;θ L T x;θdx = 0 for iid this is equivalent to logfx;θ fx;θdx = 0. 4
5 ii LT x;θdx = L T x;θ dx = 0. iii gxlt x;θdx = gx L Tx;θ dx, where g is any function which is not a function of θ for example the estimator of θ. iv E logl T X;θ 2 > 0. Theorem The Cramer-Rao bound Let θx be an unbiased estimator of θ. Suppose the likelihood L T X;θ satisfies the regularity conditions given in Assumption and θx is an unbiased estimator of θ, then we have var θx E logl T X;θ 1 2 = PROOF. Recall that θx is an unbiased estimator of θ therefore θxl T x;θdx = θ. Differentiating both sides wrt to θ gives θx L Tx;θ dx = 1. Since L T x;θ dx = 0 we have { θ θx } LT x;θ dx = 1. Multiplying and dividing by L T x;θ gives { θx θ } 1 L T x;θ E 2 logl T X;θ 1 2, Hence since L T x;θ is the distribution of X we have } 1 logl T X;θ E { θx θ = 1. L T X;θ L T x;θ L T x;θdx = Recalling that the Cauchy-Schwartz inequality is EUV EU 2 1/2 EV 2 1/2 where equality only arises if U = av +b where a and b are constants and applying it to the above we have var θxe logl T X;θ Thus giving us the Cramer-Rao inequality. Finally we need to prove that E logl T X;θ 2 = E 2 logl T X;θ. To prove this result we use the fact that 2 LT is a density to obtain L T x;θdx = 1. 5
6 Now by differentiating the above with respect to θ gives L T x;θdx = 0. By using Assumption 1.1.1ii we have LT x;θ dx = 0 loglt x;θ L T x;θdx = 0 Differentiating again with respect to θ and taking the derivative inside gives 2 logl T x;θ loglt x;θ L T x;θ 2 L T x;θdx+ dx = 0 2 logl T x;θ loglt x;θ 1 L T x;θ 2 L T x;θdx+ L T x;θdx = 0 L T x;θ 2 logl T x;θ loglt x;θ 2 2 L T x;θdx+ L T x;θdx = 0 Thus 2 logl T X;θ loglt X;θ E 2 = E L T X;θ. Which gives us the required result. Corollary Estimators which attain the C-R bound Suppose Assumption is satisfied. Then the estimator θx attains the C-R bound only if it can be written as for some functions a and b. ˆθX = aθ+bθ logl TX;θ PROOF. The proof is clear and follows from when the Cauchy-Schwartz inequality is an actual equality in the derivation of the C-R bound. We mention that there exists some well known distributions which do not satisfy Assumption These are non-regular distributions. A classical example of a distribution which violates this assumption is the uniform distribution fx;θ = 1/θ, for x [0,θ] and zero elsewhere. Other examples, include distributions where the support of the distribution is a function of the parameter. The Cramer-Rao lower bound does hold or even exist for such distributions. Example The classical example of the uniform Let us consider the example if the uniform distribution, which has the density fx;θ = θ 1 exp x/θ. Given the iid uniform 6
7 random variables {X t } the likelihood it is easier to study the likelihood rather than the loglikelihood is L T X T ;θ = 1 θ T T I [0,θ] X t. Since the support of density involves the unknown parameter, then the derivative of logl T X T ;θ is not well defined what is the derivative of logi [0,θ] X t = logi [Xt, θ with respect to θ? - observe that at log0 is not well defined and the derivative at X t is not well defined and Assumption 1.1.1ii is not satisfied. This is a classical example of a density which does not satisfy the regularity conditions. This means that the inverse of the Fisher information does not give a lower bound for the variance estimator. And below we will show why. In fact, using L T X T ;θ, the maximum likelihood estimator of θ is ˆθ T = max 1 t T X t you can see this by making a plot of L T X T ;θ against θ. It it well known that the distribution of max 1 t T X t is P max 1 t T X t x = PX 1 x,...,x T x = and the density of max 1 t T X t is fˆθt x = Tx T 1 /θ T. Exercise: Find the variance of ˆθ T definedabove. T PX t x = x T, θ Often we want to estimate a function of θ, τθ. The following corollary is a small generalisation of the Cramer-Rao bound. Corollary Suppose the regularity conditions Assumption are satisfied and TX is an unbiased estimator of τθ. Then we have vartx τ θ 2 E logl T X;θ We now define the notion of sufficiency, which gives us the ingredients for constructing a good estimator see also Sections and Davison Definition Sufficiency and the factorisation theorem Suppose that X = X 1,...,X T is a random vector. The statistic sx is called a sufficient statistic of the parameter θ, if the conditional distribution of X given sx is not a function of θ. Normally it is extremely hard to obtain the sufficient statistic from its definition. However, the factorisation theorem gives us a way of obtaining the sufficient statistic. The Factorisation Theorem Suppose that the likelihood function can be partitioned as follows, L T X;θ = hxgsx;θ, where hx is not a function of θ, then sx is a sufficient statistic of θ.. 7
8 We see that a sufficient statistic contains all the ingredients about the parameter θ. Theorem Rao-Blackwell Theorem Suppose sx is a sufficient statistic and θx is an unbiased estimator of θ then if we define the new unbiased estimator EθX SX, then vare θx SX var θx. The Rao-Blackwell theorem tells us that estimators with the smallest variance must be a function of the sufficient statistic. Of course this begs the question is there a unique estimator with the minumum variance. For this we require completeness of the sufficient statistic. Uniqueness immediately follows from completeness. Definition Completeness Let sx be a sufficient statistic for θ. Suppose Z is a function of sx such that EZsX = 0. sx is a complete sufficient statistic if and only if EZsX = 0 implies Zt = 0 for all t. Theorem Lehmann-Scheffe Theorem Suppose that SX is a complete sufficient statistic and θsx is an unbiased estimator estimator of θ then θsx is the unique minimum variance unbiased estimator of θ. The theorems above are theoretical, in the sense that, under certain conditions they give a lower bound for the variance of a plausible estimator and practical in the sense that they tell us that the best estimator should be a function of sufficient statistic. The natural question to ask, is how to construct such estimators. One of the most popular estimators in statistics are maximum likelihood estimators mle. That is the mle of θ is ˆθ T = argmax θ Θ L T θ, where Θ is the parameter space which contains all values of θ with fx;θdx = 1. There are two reasons that they are so widely used i it can be shown for a wide range of probability distributions - including under certain conditions the exponential family of distributions, defined below, that the mle is a function of the sufficient statistic, hence the mle is often the minimum variance unbiased estimator ii asymptotically at least the mle under certain conditions attains the C-R bound. Of course one can construct examples, where the regularity conditions are not satisfied and the mle is not the optimal estimator examples include estimation of the range in the uniform distribution, where an estimator can be constructed which has a small variance than the mle. But for a mass majority of distributions the mle is optimal. It is also worth mentioning that there can exist biased estimators which have a smaller mean squared error than the MLE this intriguing notion it called super-efficiency, which is beyond this course - see Stoica and Ottesten 1996 for a review. 8
9 1.1.3 Additional Notes We will use various distributions in this course, it would be useful if you complied a list of these distributions and become familiar with them. Example Useful transformations Question: The distribution function of the random variable X t is F t x = 1 exp λ t x. i Give a transformation of X t, such that the transformed variable is uniformly distributed on the interval [0, 1]. ii Suppose that I observe the independent but not necessarily identically distributed Answer: random variables {X t }, and I want to to check whether they have the distribution function F t x = 1 exp λ t x. Using i, suggest a method for checking this? i It is well known that if the random variable X t has the distribution function F t x, then the transformed random variable Y t = F t X t is uniformly distributed on the interval [0,1]. To see this, note that the distribution of Y t can be evaluated as PY t y = PF t X t y = PX t F 1 t y = F t F 1 t y = y, y [0,1]. Thus to answer the question, we let Y t = 1 exp λ t X t, which as a uniform distribution. ii If we want to check whether X t follows the distribution F t x = 1 exp λ t x, we can make the transformation Y t = 1 exp λ t X t, and use, for example, the Kolmogorov- Smirnov test to check whether {Y t } follows a uniform distribution. Example Question Suppose that Z is a Weibull random variable with density fx;φ,α = α φ x φ α 1 exp x/φ α. Show that EZ r = φ r Γ 1+ r. α Hint: Use x a exp x b dx = 1 b Γa b + 1 a,b > 0. b Solution This result may be useful in some of the examples given in this course. 9
10 10
11 Chapter 2 The Bayesian Cramer-Rao 2.1 The Bayesian Cramer-Rao inequality The classical Cramér-Rao inequality is useful for assessing the quality of a given estimator. But from the derivation we can clearly see that it only holds if the estimator is unbiased. No such inequality can be derived to include estimators which are biased. For example, this can be a problem in nonparametric regression, where estimators in general will be biased. How does one access the estimator in such cases? To answer this question we consider the Bayesian Cramer-Rao inequality. This is similar to the Cramer-Rao inequality but does not require that the estimator is unbiased, so long as we place a prior on the parameter space. This inequality is known as the Bayesian Cramer-Rao or van-trees inequality. Suppose {X i } T are random variables with distribution function L TX;θ. Let θx be an estimator of θ. We now Bayesianise the set-up by placing a prior distribution on the parameter space Θ, the density of this prior we denote as λ. Let Egx θ = gxl T x θdx and E λ denote the expectation over the density of the parameter λ. For example b E λ E θx θ = θxlt x θdxdθ. a R T Assumption θ is defined over the compact interval [a,b] and λx 0 as x a and x b so λa = λb = 0. Theorem Suppose Assumptions and hold. Let θx be an estimator of θ. Then we have E λ Eθ θx θ 2 θ E λ Iθ+Iλ 1 11
12 where E λ Iθ = and Iλ = loglt x;θ 2 L T x;θλθdθ logλθ 2 λθdθ. PROOF. We first note that under Assumption we have b Therefore by using the above we have a ] L T x;θλθ b dθ = L T x;θλθ = 0. a R T θx b a L T x;θλθ dθdx = Now let us consider b R T a θ L Tx;θλθ dθdx. By integration by parts we have R T = b a θ L Tx;θλθ dθdx = θl T x;θλθ R T ] b a dx R n b a L T x;θλθdθdx L T x;θλθdθdx. 2.2 Subtracting 2.2 from 2.1 we have R T b a LT x;θλθ θx θ dθdx = Multiplying and dividing by L T x;θλθ gives R T b a b R T R T b a L T x;θλθdθdx = 1. 1 L T x;θλθ θx θ L T x;θλθdxdθ = 1. L T x;θλθ loglt x;θλθ θx θ L T x;θλθdxdθ = 1 a Now by using the Cauchy-Schwartz inequality we have b 2LT 1 θx θ x;θλθdxdθ a R } T {{ } E λ E θx θ 2 θ Rearranging the above gives b a R T loglt x;θλθ 2 L T x;θλθdxdθ. E λ Eθ θx θ 2 b a R T loglt x;θλθ 1 2LT x;θλθdxdθ 12
13 Finally we want to show that the denominator of the RHS of the above is b loglt x;θλθ 2 L T x;θλθdxdθ = E λ Iθ+Iλ. a R T Using basic algebra we have loglt x;θλθ 2 L T x;θλθdxdθ = = b loglt x;θ + logλθ 2 L T x;θλθdxdθ a R T loglt x;θ 2 b L T x;θλθdxdθ+2 a } {{ } E λ Iθ b logλθ 2 + L T x;θλθdxdθ. a R T } {{ } Iλ logl T x;θ R T logλθ L T x;θλθdxdθ We note that loglt x;θ and b a R T logλθ logλθ logλθ dxdθ = 2LT x;θλθdxdθ = b a loglt x;θλθ 2 L T x;θλθdxdθ LT x;θ dxdθ } {{ } =0 logλθ 2λθdθ. Therefore we have = b loglt x;θ 2 L T x;θλθdxdθ+ a R T } {{ } E λ Iθ b logλθ 2 L T x;θλθdxdθ, a R T } {{ } Iλ we required. We will consider applications of the Bayesian Cramer-Rao bound in Section for obtaining lower bounds of nonparametric density estimators. 13
14 14
15 Chapter 3 The Exponential Family 3.1 The exponential family of distributions See also Section 5.2, Davison It is possible to derive the properties eg. mean, variance and maximum likelihood estimators - to be defined properly later on for every distribution of interest. However, this can be cumbersome, the algebra can be tedious and we may not see the big picture. Instead, we now consider an umbrella family of distributions which include several well known distributions. We will derive a general expression for the mean and variance of such distributions which will be useful when we consider Generalised Linear models later in this course, and use these results to show that the maximum likelihood estimator is a function of the sufficient statistic :- thus is the best unbiased estimator under the assumption of completeness. In other words, we that for this family of distributions the maximum likelihood estimator which we have encountered many times previously is indeed the best parameter estimator in terms of minimum variance. Suppose that the distribution of the random variable X t can be written in the form fy;ω = exp syηω bω+cy. 3.1 If the distribution of X t both the probability distribution function for discrete random variables and probability density function for continuous random variables has the above representation, then X t is said to belong to the exponential family of distributions. A large number of well known distribution functions belong to this family. Hence by understanding the properties the exponential family, we can draw conclusions on a large number of distribution functions. Example a The exponential distribution X Expλ, hence the pdf is fy; λ = λexp λy, which can be written as logfy;λ = yλ+logλ. 15
16 Therefore sy = y and ηλ = λ. b The binomial distribution PX = y = n y π1 π n y can be rewritten as π n logpy;λ = ylog 1 π +nlog1 π+log. y Therefore sy = y, ηπ = log π 1 π, bπ = nlog1 π 1 and cy = log n y. It should be mentioned that it is straightforward to generalise the exponential family to the case that θ is a vector of dimension greater than one. Suppose that θ is a p-dimensional vector. The order p exponential family is defined as distributions which satisfy fy;ω = expsy θω bω+cy, where sy = s 1 y,...,s p y with {s i } linearly independent and θω = θ 1 ω,...,θ p ω The natural exponential family If we let θ = ηω and η is an invertible function hence there is a one-to-one correspondence between the space containing ω and the space containing θ, then we can rewrite 3.1 we fy;θ = expsyθ κθ+cy, where κθ = bη 1 θ. The natural exponential family is when sy = y. Now by transformation we give example of distributions which have natural form. i The exponential distribution is already in natural exponential form. ii For the binomial distribution we let θ = log π π 1 π, since log 1 π is invertible this gives the log distribution as π logfy;θ = logfy;log 1 π = yθ nlog 1 n. +log 1+expθ y Hence the parameter of interest, π, has been transformed, and often we fit a model later in the course to θ, and transform back to obtain an estimator of π. Some properties of the natural exponental Distributions which have a natural exponential have interesting properties which we now discuss. Lemma Suppose that X is a random variable which has the natural exponential representation. Then the moment generating function of X is EexpXt = expκt + θ κθ. Furthermore, EX = κ θ and varx = κ θ. 16
17 PROOF. Let us suppose that t is sufficiently small such that fy;θ+t is a distribution. The mfg is M X t = EexptY = expty expθy κθ + cydy = expκθ +t κθ expθ +ty κθ +t+cydy = expκθ +t κθ, since expθ + ty κθ + t + cydy = fy;θ + tdy = 1. To obtain the moments we recall that M X 0 = EX and varx = M X 0 M X 02. Therefore M Xt = κ θ +texpκθ +t κθ M Xt = κ θ +t+κ θ +t 2 expκθ +t κθ. Hence M X 0 = κ θ and M X 0 = κ θ+κ θ 2, which gives the result. Remark The mean and variance of the natural exponential family make obtaining the mle estimators quite simple. We derive this later but we first observe that since EX = κ θ, therefore the mean of X is a function of θ, hence we can write µθ = κ θ. Moreover, since varx = κ θ, then the derivative of µ, µ θ, is strictly positive. In other words, µθ = κ θ is an increasing function in θ. Thus µθ is an invertible function, therefore given µθ, we can uniquely determine θ. This observation will prove useful later when obtaining the mle estimators of θ Maximum likelihood estimation for the exponential family Suppose that {X t } are iid random variables which have a natural exponential distribution representation. Then the log likelihood function is L T X;θ = θ X t Tκθ+T cx t. Hence by using the factorisation theorem we see that the sufficient statistic for θ is sx = T X t. Hence, supposing that Assumption is satisfied, then the minimum variance unbiased estimator of θ should be a function of sx. We now obtain the maximum likelihood estimator of θ, and derive conditions under which the mle is a function of sx hence, by Rao-Blackwell theorem and the Lehmann-Scheffe lemma it is the best estimator. The mle of θ is ˆθ T where ˆθ T = argmax θ Θ { T θ X t Tκθ+ 17 cx t }.
18 Thenaturalwaytoobtain ˆθ T istofindthesolutionof L TX;θ = 0. Howeverfor L TX;θ θ=ˆθt = 0, depends on a few conditions. Before we derive these conditions we first consider the solution of the derivative of L T X;θ. Differentiating L T X;θ gives L T X;θ = X t Tκ θ. Therefore, since µθ = κ prime θ is an invertible function, then L TX;θ = 0 when ˆθ T = µ 1 1 T X t. Of course, we need to know under what conditions µ 1 1 T X t = argmax θ Θ { T θ X t Tκθ+T cx t }. The above really depends on the parameter space Θ. Definition Let Θ be the parameter space of θ and the space of outcomes of the random variable X, Y. Let M = {µ = µθ;θ Θ} denote the man space. Let Ȳ T = {y = 1 T T x t;x t Y} the sample mean space. Lemma Suppose that {X t } are iid random variables which have a natural exponential representation. If Y T M then µ 1 1 T X t = argmax θ Θ { T θ X t Tκθ+T cx t }. PROOF.Theproofisstraightforward,sincethefirstderivativeiszerowhenθ = µ 1 1 T T X t. Then this is the maximum of lx;θ in the sample mean space ȲT. Hence in order for it the minimum over the mean space M, then either M = ȲT or ȲT M. Remark Minimum variance unbiased estimators Suppose X t has a distribution in the natural exponential family, the conditions of the above lemma are satisfied and sx is the complete statistic of θ. Moreover if µ 1 1 T T X t is an unbiased estimator of θ, then µ 1 1 T T X t is the minumum variance unbiased estimator of θ. However, in general, this will not be case. But by using Slutsky s theorem it can be shown that µ 1 1 T T X t P θ. 18
19 Remark Estimating ω Often we are interested in estimating ω, where θ = θω. However, since lx; θ ω = ω lx;θ = θ ω T X t Tκ θ. Then if all conditions regarding parameter and sample mean spaces are satisfied then the mle of ω is ˆω T = η 1 µ 1 1 T X t. It should be noted that one great advantage of the exponential family of distributions is that the mle is easy to obtain with explicit expressions!. Many of the results above can be generalised to the setting that {X t } are independent but not necessarily identically distributed and there exists regressors z which are known to influence the mean of X t. We will revisit this problem when we consider generalised linear models. 19
20 20
21 Chapter 4 The Maximum Likelihood Estimator 4.1 The Maximum likelihood estimator As illustrated in the exponential family of distributions, discussed above, the maximum likelihood estimator of θ 0 the true parameter is defined as ˆθ T = argmax θ Θ L TX;θ = argmax θ Θ L Tθ. Often we find that L Tθ θ=ˆθt = 0, hence solution can be obtained by solving the derivative of the log likelihood often called the score function. However, if θ 0 lies on or close to the boundary of the parameter space this will not necessarily be true. Below we consider the sampling properties of ˆθ T when the true parameter θ 0 lies in the interior of the parameter space Θ. We note that the likelihood is invariant to transformations of the data. For example if X has the density f ;θ and we define the transformed random variable Z = gx, where the function g has an inverse its a 1-1 transformation, then it is easy to show that the density of Z is fg 1 z;θ g 1 z z. Therefore the likelihood of {Z t = gx t } is T fg 1 Z t ;θ g 1 z z=zt = z T fx t ;θ g 1 z z=zt. z Hence it is proportional to the likelihood of {X t } and the maximum of the likelihood of {Z t = gx t } is the same as the maximum of the likelihood of {X t }. 21
22 4.1.1 Evaluating the MLE Examples Example {X t } are iid random variables, which follow a Normal Gaussian distribution Nµ,σ 2. The likelihood is proportional to L T X;µ,σ 2 = T logσ 1 2σ 2 X t µ 2. Maximising the above with respect to µ and σ 2 gives ˆµ T = X and ˆσ 2 = 1 T T X t X 2. Example Question: Solution: {X t } are iid random variables, which follow a Weibull distribution, which has the density αy α 1 θ α exp y/θ α θ,α > 0. Suppose that α is known, but θ is unknown and we need to estimate it. What is the maximum likelihood estimator of θ? The log-likelihood of interest is proportional to L T X;θ = logα+α 1logY t αlogθ Y t α θ αlogθ Y t α.. θ The derivative of the log-likelihood wrt to θ is L T = Tα θ + α θ α+1 Solving the above gives ˆθ T = 1 T T Y α t 1/α. Y α t = 0. Example Notice that if α is given, an explicit solution for the maximum of the likelihood, in the above example, can be obtained. Consider instead the maximum of the likelihood with respect to α and θ, ie. arg max θ,α logα+α 1logY t αlogθ Y t α. θ 22
23 The derivative of the likelihood is L T L T α = Tα θ + α θ α+1 Y α t = 0 = T T α logy t T logθ Tα T θ + log Y t θ Y t θ α = 0. It is clear that an explicit expression to the solution of the above does not exist and we need to find alternative methods for finding a solution. Below we shall describe numerical routines which can be used in the maximisation. In special cases, one can use other methods, such as the Profile likelihood we cover this later on. Numerical Routines In an ideal world to maximise a likelihood, we would consider the derivative of the likelihood and solve it L Tθ θ=ˆθt = 0, and an explicit expression would exist for this solution. In reality this rarely happens as we illustrated in the section above. Usually, we will be unable to obtain an explicit expression for the MLE. In such cases, one has to do the maximisation using alternative, numerical methods. Typically it is relative straightforward to maximise the likelihood of random variables which belong to the exponential family numerical algorithms sometimes have to be used, but they tend to be fast and attain the maximum of the likelihood - not just the local maximum. However, the story becomes more complicated even if we consider mixtures of exponential family distributions - these do not belong to the exponential family, and can be difficult to maximise using conventional numerical routines. We give an example of such a distribution here. Let us suppose that {X t } are iid random variables which follow the classical normal mixture distribution fy;θ = pf 1 y;θ 1 +1 pf 2 y;θ 2, where f 1 is the density of the normal with mean µ 1 and variance σ 2 1 and f 2 is the density of the normal with mean µ 2 and variance σ2 2. The log likelihood is L T Y;θ = 1 log p exp 1 2πσ 2 1 2σ 2 1 X t µ p exp 1 2πσ 2 2 2σ 2 2 X t µ 2 2. Studying the above it is clear there does not explicit solution to the maximum. Hence one needs to use a numerical algorithm to maximise the above likelihood. We discuss a few such methods below. 23
24 The Newton Raphson Routine The Newton-Raphson routine is the standard method to numerically maximise the likelihood, this can often be done automatically in R by using the R functions optim or nlm. To apply Newton-Raphson, we have to assume that the derivative of the likelihood exists this is not always the case - think about the l 1 - norm based estimators! and the minimum lies inside the parameter space such that L T θ θ=ˆθt = 0. We choose an initial value θ 1 and apply the routine 2 L T θ 1 L T θ n 1 θ n = θ n θn 1 θn 1. Where this routine comes from will be clear by using the Taylor expansion of L Tθ n 1 about θ 0 see Section If the likelihood has just one global maximum and no local maximums hence it is convex, then it is quite easy to maximise. If on the other hand, the likelihood has a few local maximums and the initial value θ 1 is not chosen close enough to the true maximum, then the routine may converge to a local maximum not good!. In this case it may be a good idea to do the routine several times for several different initial values θ i i 1. For each convergence value ˆθ T evaluate the likelihood L Tˆθ i T and select the value which gives the largest likelihood. It is best to avoid these problems by starting with an informed choice of initial value. Implementing without any thought a Newton-Rapshon routine can lead to estimators which take an incredibly long time to converge. If one carefully considers the likelihood one can shorten the convergence time by rewriting the likelihood and using faster methods often based on the Newton-Raphson. Iterative least squares This is a method that we shall describe later when we consider Generalised linear models. As the name suggests the algorithm has to be interated, however at each step weighted least squares is implemented see later in the course. The EM-algorithm This is done by the introduction of dummy variables, which leads to a new unobserved likelihood which can easily be maximised. In fact one the simplest methods of maximising the likelihood of mixture distributions is to use the EM-algorithm. We cover this later in the course. See Example 4.23 on page 117 in Davison The likelihood for dependent data We mention that the likelihood for dependent data can also be constructed though often the estimation and the asymptotic properties can be a lot harder to derive. Using Bayes rule ie. 24
25 PA 1,A 2,...,A T = PA 1 T i=2 PA i A i 1,...,A 1 we have L T X;θ = fx 1 ;θ T fx t X t 1,...,X 1 ;θ. t=2 Under certain conditions on {X t } the structure above T t=2 fx t X t 1,...,X 1 ;θ can be simplified. For example if X t were Markovian then we have X t conditioned on the past on depends the most recent past observation, ie. fx t X t 1,...,X 1 ;θ = fx t X t 1 ;θ in this case the above likelihood reduces to L T X;θ = fx 1 ;θ n fx t X t 1 ;θ. 4.1 t=2 Example A lot of the material we will cover in this class will be for independent observations. However likelihood methods can also work for dependent observations too. Consider the AR1 time series X t = ax t 1 +ε t, where ε t are iid random variables with mean zero. We will assume that a < 1. We see from the above that the observation X t 1 as a linear influence on the next observation and it is Markovian, that it given X t 1, the random variable X t 2 has no influence on X t to see this consider the distribution function PX t x X t 1,X t 2. Therefore by using 4.1 the likelihood of {X t } t is L T X;a = fx 1 ;a T f ε X t ax t 1, 4.2 t=2 where f ε is the density of ε and fx 1 ;a is the marginal density of X 1. This means the likelihood of {X t } only depends on f ε and the marginal density of X t. We use â T = argmaxl T X;a as the mle estimator of a. Often we ignore the term fx 1 ;a because this is often hard to know - try and figure it out - its relatively easy in the Gaussian case and consider what is called the conditional likelihood Q T X;a = T f ε X t ax t t=2 ã T = argmaxl T X;a as the quasi-mle estimator of a. Exercise: What is the quasi-likelihood proportional to in the case that {ε t } are Gaussian random variables with mean zero. It should be mentioned that often the conditional likelihood is derived as if the errors {ε t } are Gaussian - even if they are not. This is often called the quasi or pseudo likelihood. 25
26 4.1.2 A quick review of the central limit theorem In this section we will not endeavour to proof the central limit theorem which is usually based on showing that the characteristic function - a close cousin of the moment generating function - of the average converges to the characteristic function of the normal distribution. However, we will recall the general statement of the CLT and generalisations of it. The purpose of this section is not to lumber you with unnecessary mathematics but to help you understand when an estimator is close to normal or not. Lemma The famous CLT Let us suppose that {X t } are iid random variables, let µ = EX t < and σ 2 = varx t <. Define X = 1 T T X t. Then we have D T X µ N0,σ 2, alternatively X µ D N0, σ2 T. What this means that if we have a large enough sample size and plotted the histogram of several replications of the average, this should be close to normal. Remark i The above lemma appears to be restricted to just averages. However, it can be used in several different contexts. Averages arise in several different situations. It is not just restricted to the average of the observations. By judicious algebraic manipulations, one can show that several estimators can be rewritten as an average or approximately as an average. At first appearance, the MLE of the Weibull parameters given in Example does not look like an average, however, in the section we will consider the general maximum likelihood estimators, and show that they can be rewritten as an average hence the CLT applies to them too. ii The CLT can be extended in several ways. a To random variables whose variance are not all the same ie. indepedent but identically distributed random variables. b Dependent random variables so long as the dependency decays in some way. c To not just averages but weighted averages too so long as the weight depends in certain way. However, the weights should be distributed well over all the random variables. Ie. suppose that {X t } are iid random variables. Then it is clear that X t will never be normal unless {X t } is normal - observe 10 is fixed!, but it seems plausible that 1 n n sin2πt/12x t is normal despite this not being the sum of iid random variables. 26
27 There exists several theorems which one can use to prove normality. But really the take home message is, look at your estimator and see whether asymptotic normality it looks plausible - you could even check it through simulations. Example Some problem cases One should think a little before blindly applying the CLT. Suppose that the iid random variables {X t } follow a t-distribution with 2 degrees of freedom, ie. the density function is fx = Γ3/2 2π 1+ x2 2 3/2. Let X = 1 n n X t denote the sample mean. It is well known that the mean of the t-distribution with two degrees of freedom exists, but the variance does not it is too thick tailed. Thus, the assumptions required for the CLT to hold are violated and X is not normally distributed in fact it follows a stable law distribution. Intuitively this is clear, recall that the chance of outliers for a t-distribution with a small number of degrees of freedom, if large. This makes it impossible that even averages should be well behaved there is a large chance that an average could also be too large or too small. To see why the variance is infinite, study the form of t-distribution with two degrees. For the variance to be finite, the tails of the distribution should converge to zero fast enough in other words the probability of outliers should not be too large. See that the tails of the t-distribution for large x behaves like fx Cx 3 make a plot in Maple to check, thus the second moment EX 2 M Cx 3 x 2 dx = M Cx 1 dx for some C and M, is clearly not finite! This argument can be made precise The Taylor series expansion - the statisticians tool The Taylor series is used all over the place in statistics and you should be completely fluent with using it. It can be used to prove consistency of an estimator, normality based on the assumption that averages converge to a normal distribution, obtaining the limiting variance of an estimator etc. We start by demonstrating its use for the log likelihood. We recall that the mean value in the univariate case states that fx = fx 0 +x x 0 f x 1 fx = fx 0 +x x 0 f x 0 + x x 0 2 f x 2, 2 where x 1 and x 2 both lie between x and x 0. In the case that f is a multivariate function, then we have fx = fx 0 +x x 0 fx x= x1 fx = fx 0 +x x 0 fx x=x x x 0 2 fx x= x2 x x 0, 27
28 where x 1 and x 2 both lie between x and x 0. In the case that fx is a vector, then the mean value theorem does not directly work. Strictly speaking we cannot say that fx = fx 0 +x x 0 fx x= x1, where x 1 lies between x and x 0. However, it is quite straightforward to overcome this inconvience. The mean value theorem does hold pointwise, for every element of the vector fx = f 1 x,...,f d x, ie. for every 1 i d we have f i x = f i x 0 +x x 0 f i x x= xi, where x i lies between x and x 0. Thus if f i x x= xi f i x x=x0, we do have that fx fx 0 +x x 0 fx x=x0. We use the above below. Application 1 An expression for L T ˆθ T L T θ 0 in terms of ˆθ T θ 0 : The expansion of L T ˆθ T about θ 0 the true parameter L T θ 0 L T ˆθ T = L Tθ θ ˆθT 0 ˆθ T θ 0 ˆθ T 2 L T θ 2 θt θ 0 ˆθ T where θ T lies between θ 0 and ˆθ T. If ˆθ T lies in the interior of the parameter space this is an extremely important assumption here then L Tθ = 0. Moreover, if it can ˆθT P be shown that ˆθ T θ 0 0 we show this in the section below, then under certain conditions on L Tθ that 2 L T θ 2 θt P E 2 L T θ 2 such as the existence of the third derivative etc. it can be shown θ0 = Iθ 0. Hence the above is roughly 2L T ˆθ T L T θ 0 ˆθ T θ 0 Iθ 0 ˆθ T θ 0 Note that in many of the derivations below we will use that 2 L T θ P 2 L T θ 2 θt E 2 θ0 = Iθ0. But it should be noted that this only true if i ˆθ T θ 0 P 0 and ii 2 L T θ 2 uniformly to E 2 L T θ 2 θ0. We consider below another closely related application. converges 28
29 Application 2 An expression for ˆθ T θ 0 in terms of L Tθ θ0 : Theexpansionofthep-dimensionvector L Tθ ˆθT pointwiseaboutθ 0 thetrueparameter gives for 1 i d L i,t θ ˆθT = L i,tθ θ0 + L i,tθ θt ˆθ T θ 0. Now by using the same argument as in Application 1 we have L T θ θ0 Iθ 0 ˆθ T θ 0. We mention that U T θ 0 = L Tθ θ0 is often called the score or U statistic. And we see that the asymptotic sampling properties of U T determine the sampling properties of ˆθ T θ 0. Example The Weibull Evaluate the second derivative of the likelihood given in Example 4.1.3, take the expection on this, Iθ,α = E 2 L T we use the to denote the second derivative with respect to the parameters α and θ. Exercise: Evaluate Iθ, α. Application 2 implies that the maximum likelihood estimators ˆθ T and ˆα T recalling that no explicit expression for them exists can be written as T ˆθT θ Iθ,α 1 ˆα T α T α θ + α Y α θ α+1 t 1 α logy t logθ α θ +logyt θ Yt θ α Sampling properties of the maximum likelihood estimator See also Section p118, Davison These proofs will not be examined, but you should have some idea why Theorem is true. We have shown that under certain conditions the maximum likelihood estimator can often be the minimum variance unbiased estimator for example, in the case of exponential family of distributions. However, for finite samples, the mle may not attain the C-R lower bound. Hence for finite sample varˆθ T > Iθ 1. However, it can be shown that asymptotically the variance of the mle attains the mle lower bound. In other words, for large samples, the variance of the mle is close to the Cramer-Rao bound. We will prove the result in the case that l T is the log likelihood of independent, identically distributed random variables. The proof can be generalised to the case of non-identically distributed random variables. We first state sufficient conditions for this to be true. Assumption [Regularity Conditions 2] Let {X t } be iid random variables with density fx;θ. 29
30 i Suppose the conditions in Assumption hold. ii Almost sure uniform convergence This is optional For every ε > 0 there exists a δ such that P lim sup T θ 1 θ 2 >δ 1 T L TX;θ EL T θ > ε 0. We mention that directly verifying uniform convergence can be difficult. However, it can be established by showing that the parameter space is compact, point wise convergence of the likelihood to its expectation and almost sure equicontinuity in probability. iii Model identifiability For every θ Θ, there does not exist another θ Θ such that fx;θ = fx; θ for all x. iv The parameter space Θ is finite and compact. v supe 1 T L T X;θ <. We require Assumption 4.1.1ii,iii to show consistency and Assumptions and 4.1.1iiiv to show asymptotic normality. Theorem Supppose Assumption 4.1.1ii,iii holds. Let θ 0 be the true parameter and ˆθ T be the mle. Then we have ˆθ a.s. T θ 0 consistency. PROOF. To prove the result we first need to show that the expectation of the maximum likelihood is maximum at the true parameter and that this is the unique maximum. In other words we need to show that E 1 T L TX;θ E 1 T L TX;θ 0 0 for all θ Θ. To do this, we have E 1 T L TX;θ E 1 T L TX;θ 0 = log fx;θ fx;θ 0 fx;θ 0dx = E log fx;θ. fx;θ 0 Now by using Jensen s inequality we have E log fx;θ fx;θ loge = log fx;θ 0 fx;θ 0 fx;θdx = 0. ThusgivingE 1 T L TX;θ E 1 T L TX;θ 0 0. ToprovethatE 1 T L TX;θ E 1 T L TX;θ 0 = 0 only when θ 0 we note that identifiability assumption in Assumption 4.1.1iii, which means that fx;θ = fx;θ 0 only when θ 0 and no other function of f gives equality. 30
31 Hence E 1 T L TX;θ is uniquely maximum at θ 0. Finally, we need to show that ˆθ P T θ0. By Assumption 4.1.1ii and also the LLN we have that for all θ Θ that 1 T L TX;θ a.s. lθ. Therefore, for every mle ˆθ T we have 1 T L TX;θ 0 1 T L TX;ˆθ T a.s. E 1 T L TX;ˆθ T E 1 T L TX;θ To bound E 1 T L TX;θ 0 1 T L TX;ˆθ T we note that Now by using 4.4 we have and E 1 T L TX;θ 0 1 T L TX;ˆθ T = { E 1 T L TX;θ 0 1 T L TX;θ 0 } + { E 1 T L TX;ˆθ T 1 T L TX;ˆθ T } + { 1 T L TX;θ 0 E 1 T L TX;ˆθ T }. E 1 T L TX;θ 0 1 T L TX;ˆθ T { E 1 T L TX;θ 0 1 T L TX;θ 0 } + { E 1 T L TX;ˆθ T 1 T L TX;ˆθ T } + { E 1 T L TX;ˆθ T 1 T L TX;ˆθ T } E 1 T L TX;θ 0 1 T L TX;ˆθ T { E 1 T L TX;θ 0 1 T L TX;θ 0 } + { E 1 T L TX;ˆθ T 1 T L TX;ˆθ T } + { E 1 T L TX;θ 0 1 T L TX;θ 0 }. Therefore, under Assumption 4.1.1ii we have E 1 T L TX;θ 0 1 T L TX;ˆθ T 3sup E 1 θ Θ T L TX;θ 1 T L TX;θ a.s. 0. Since L T θ has a unique minimum this implies ˆθ T a.s. θ 0. Hence we have shown consistency of the mle. We now need to show asymptotic normality. Theorem Suppose Assumption is satisfied. i Then the score statistic is ii Then the mle is { 1 L T X;θ D logfx;θ 2 θ0 N 0, E θ0 }. 4.5 T { D logfx;θ 2 } 1 T ˆθT θ 0 N 0, E θ0. 31
32 iii The log likelihood ratio is D 2 L T X;ˆθ T L T X;θ 0 χ 2 p PROOF. First we will prove i. We recall because {X t } are iid random variables, then 1 L T X;θ T θ0 = 1 T T logfx t ;θ θ0. Hence L TX;θ θ0 isthesumofiidrandomvariableswithmeanzeroandvariancevar logfxt;θ θ0. Therefore, by the CLT for iid random variables we have 4.5. We use i and Taylor mean value theorem to prove ii. We first note that by the mean value theorem we have 1 T L T X;θ = 1 ˆθT T L T X;θ θ0 +ˆθ T θ L T X;θ T 2 θt. 4.6 Now it can be shown because Θ has a compact support, ˆθ T θ 0 a.s. 0 and the expectations of the third derivative of L T is bounded that 1 2 L T X;θ P 1 2 T 2 θt T E L T X;θ 2 Substituting 4.7 into 4.6 gives θ0 2 logfx;θ = E 2 θ L T X;θ 1 1 L TˆθT θ 0 = T 2 T X;θ θt θ0 T 1 2 L T X;θ 1 1 L = E T 2 θ0 T X;θ θ0 +o p 1. T We mention that the proof above is for univariate 2 L T X;θ 2 θt, but by redo-ing the above steps pointwise it can easily be generalised to the multivariate case too. Hence by substituting the 4.5 into the above we have ii. It is straightfoward to prove iii by using 2 L T X;ˆθ T L T X;θ 0 ˆθ T θ 0 Iθ 0 ˆθ T θ 0, i and the result that if X N0,Σ, then AX N0,A ΣA. Example The Weibull By using Example we have T ˆθT θ Iθ,α 1 α θ + α Y θ α+1 t α ˆα T α T 1 α logy t logθ α θ +logyt θ Yt θ α 32.
33 Now we observe that RHS consists of a sum iid random variables this can be viewed as an average. Since the variance of this exists you can show that it is Iθ,α, the CLT can be applied and we have that ˆθT θ ˆα T α D N 0,Iθ,α 1. Remark i We recall that for iid random variables that the Fisher information for sample size T is { } loglt X;θ 2 logfx;θ 2 Iθ = E θ0 = TE θ0. Hence comparing with the above theorem, we see that for iid random variables so long as the regularity conditions are satisfied the MLE, asympotitically, attains the Cramer-Rao bound even if for finite samples this may not be true. Moreover, since ˆθ T θ 0 Iθ 0 1 L Tθ θ0 = T 1 Iθ T L T θ θ0, and var 1 T L T θ θ0 = 1 T Iθ 0, then it can be seen that ˆθ T θ 0 = O p T 1/2. ii Under suitable conditions a similar result holds true for data which is not iid. In summary, the MLE under certain regularity conditions tend to have the smallest variance, and for large samples, the variance is close to the lower bound, which is the Cramer-Rao bound. In the case that Assumption is satisfied, the MLE is said to be asymptotically efficient. This means for finite samples the MLE may not attain the C-R bound but asymptotically it will. iii A simple application of Theorem is to the derivation of the distribution of Iθ 0 1/2 ˆθ T θ 0. It is clear that by using Theorem we have where I p is the identity matrix and Iθ 0 1/2 ˆθ T θ 0 D N0,I p ˆθ T θ 0 Iθ 0 ˆθ T θ 0 D χ 2 p. iv Note that these results apply when θ 0 lies inside the parameter space Θ. As θ gets closer to the boundary of the parameter space 33
34 Remark Generalised estimating equations Closely related to the MLE are generalised estimating equations GEE, which are relate to the score statistic. These are estimators not based on maximising the likelihood but are related to equating the score statistic derivative of the likelihood to zero and solving for the unknown parameters. Often they are equivalent to the MLE but they can be adapted to be useful in themselves and some adaptions will not be the derivative of a likelihood The Fisher information See also Section 4.3, Davison Let us return to the Fisher information. We recall that undercertain regularity conditions an unbiased estimator, θx, of a parameter θ 0 is such that var θx Iθ 0 1, where LT θ 2 Iθ = E = E 2 L T θ 2. is the Fisher information. Furthermore, under suitable regularity conditions, the MLE will asymptotically attain this bound. It is reasonable to ask, how one can interprete this bound. i Situation 1. Iθ 0 = E 2 L T θ 2 θ0 is large hence variance of the mle will be small then it means that the gradient of L Tθ is steep. Hence even for small deviations from θ 0, L Tθ is likely to be far from zero. This means the mle ˆθ T is likely to be in a close neighbourhood of θ 0. ii Situation 2. Iθ 0 = E 2 L T θ 2 θ0 is small hence variance of the mle will large. In this case the gradient of the likelihood L Tθ is flatter and hence L Tθ 0 for a large neighbourhood about the true parameter θ. Therefore the mle ˆθ T can lie in a large neighbourhood of θ 0. This is one explanation as to why Iθ is called the Fisher information. It contains information on how close close any estimator of θ can be. Look at the censoring example, Example 4.20, page 112, Davison
35 Chapter 5 Confidence Intervals 5.1 Confidence Intervals and testing We first summarise the results in the previous section which will be useful in this section. For convenience, we will assume that the likelihood is for iid random variables, whose density is fx;θ 0 though it is relatively simple to see how this can be generalised to general likelihoods - of not necessarily iid rvs. Let us suppose that θ 0 is the true parameter that we wish to estimate. Based on Theorem we have { D logfx;θ 2 } 1 T ˆθT θ 0 N 0, E θ0, 5.1 { 1 L T T D logfx;θ 2 } θ=θ 0 N 0, E θ0 5.2 and 2 L T ˆθ T L T θ 0 D χ 2 p, 5.3 where p are the number of parameters in the vector θ. Using any of 5.1, 5.2 and 5.3 we can construct 95% CI for θ Constructing confidence intervals using the likelihood See also Section 4.5, Davison One the of main reasons that we show asymptotic normality of an estimator it is usually not possible to derive normality for finite samples is to construct confidence intervals CIs and to test. 35
36 In the case that θ 0 is a scaler vector of dimension one, it is easy to use 5.1 to obtain { logfx;θ 2 } 1/2 D T E θ0 ˆθT θ 0 N0, Based on the above the 95% CI for θ 0 is [ ˆθ T 1 logfx;θ 2 E θ0 z T α/2,ˆθ T + 1 logfx;θ 2 E θ0 z T α/2 ]. 2 logfx;θ The above, of course, requires an estimate of thestandardised Fisher information E θ0 = E 2 logfx;θ 2 θ0 Usually, we evaluate the second derivative of 1 T logl Tθ = 1 T L Tθ and replace θ with the estimator of θ, ˆθ T. Exercise: Use 5.2 to construct a CI for θ 0 based on the score The CI constructed above works well if θ is a scalar. But beyond dimension one, constructing a CI based on 5.1 and the p-dimensional normal is extremely difficult. More precisely, if θ 0 is a p-dimensional vector then the analogous version of 5.4 is { logfx;θ 2 } 1/2 D T E θ0 ˆθT θ 0 N0,Ip, using this it is difficult to obtain the CI of θ 0. One way to construct the CI is to square ˆθT θ 0 and use Based on above a 95% CI is TE logfx;θ 2 } D ˆθT θ 0 θ0 ˆθT θ 0 χ 2 p. 5.5 { θ; ˆθT θ logfx;θ 2 TE θ0 ˆθT θ } χ 2 p Notethat asinthescalarcase, this leadstotheintervalwith thesmallestlength. Adisadvantage of 5.6 is that we have to a estimate the information matrix and b try to find all θ such the above holds. This can be quite unwielding. An alternative method, which is asymptotically equivalent to the above but removes the need to estimate the information matrix and is to use 5.3. By using 5.3, a 1001 α% CI for θ 0 is { θ;2 L T ˆθ T L T θ } χ 2 p1001 α. 5.7 The above is not easy to calculate, but it is feasible. 36
37 Example In the case that θ 0 is a scalar the 95% CI based on 5.7 is { θ;l T θ L T ˆθ T 1 } 2 χ2 p0.95. Both the 95% CIs in 5.6 and 5.7 will be very close for relatively large sample sizes. However one advantage of using 5.7 instead of 5.6 is that it is easier to evaluate - no need to obtain the second derivative of the likelihood etc. Another feature which differentiates the CIs in 5.6 and 5.7 is that the CI based on 5.6 is symmetric about ˆθ T recall that X 1.96σ/ T, X σ/ T is symmetric about X, whereas the symmetry condition may not hold for sample sizes when constructing a CI for θ 0 using 5.7. This is a positive advantage of using 5.7 instead of 5.6. A disadvantage of using 5.7 instead of 5.6 is that sometimes in the CI based on 5.7 may have more than one interval. As you can see if the dimension of θ is large it is quite difficult to evaluate the CI try it for the simple case that the dimension is two!. Indeed for dimensions greater than three it is extremely hard. However in most cases, we are only interested in constructing CIs for certain parameters of interest, the other unknown parameters are simply nuisance parameters and CIs for them are not of interest. For example, for the normal distribution we may only be interested in CIs for the mean but not the variance. It is clear that directly using the log-likelihood ratio to construct CIs and also test will mean also constructing CIs for the nuisance parameters. Therefore below in Section?? we construct a variant of the likelihood called the Profile likelihood, which allows us to deal with nuisance parameters in a more efficient way Testing using the likelihood Let us suppose we wish to test the hypothesis H 0 : θ = θ 0 against the alternative H A : θ θ 0. We can use any of the results in 5.1, 5.2 and 5.3 to do the test - they will lead to slightly different p-values, but asympototically they are all equivalent, because they are all based essentially on the same derivation. We now list the three tests that one can use. The Wald test The Wald statistic is based on 5.1. We recall from 5.1 that if the null is true, then we have { D logfx;θ 2 } 1 T ˆθT θ 0 N 0, E θ0. 37
38 Thus we can use as the test statistic T 1 = { logfx;θ 2 } 1/2 D T E θ0 ˆθT θ 0 N0,1. Let us now consider how the test statistics behaves under the alternative H A : θ = θ 1. If the null is not true, then we have that ˆθ T θ 0 = ˆθ T θ 1 +θ 1 θ 0 Iθ logfx t ;θ 1 θ 1 θ 0 T 1 ThusthedistributionoftheteststatisticT 1 becomescenteredabout T θ 0. Thus for a larger sample size the more likely we are to reject the null. t { 2 } 1/2 logfx;θ E θ0 θ Remark Types of alternatives In the case that the alternative is fixed, it is clear that the power in the test goes to 100%. Therefore, often to see the effectiveness of the test, one lets the alternative get closer to the the null as T. For example 2 } 1/2 Suppose that θ 1 = θ T, then the center of T 1 logfx;θ 1 is T {E θ0 0. Thus the alternative is too close to the null for us to discriminate between the the two. { 2 } 1/2. Suppose that θ 1 = θ logfx;θ T, then the center of T 1 is E θ0 Therefore, the test does have power, but it s not 100%. In the case that the dimension of θ is greater than one, we use the test statistic T 1 = 2 ˆθT θ 0 TE logfx;θ θ0 ˆθT θ 0 instead of T1. Noting that the distribution of T 1 is a chi-squared with p-degrees of freedom. The Score test The score test is based on the score. We recall from??, that under the null the distribution of the score is { 1 L T T D logfx;θ 2 θ=θ 0 N 0, E θ0 }. Thus we use as the test statistic T 2 = 1 logfx;θ 2 } 1/2 L T {E θ0 T D θ=θ 0 N0,1. An advantage of this test is that the maximum likelihood estimator under either the null or alternative does not have to be calculated. Exercise: What does the test statistic look like under the alternative? 38
39 The log-likelihood ratio test Probably one of the most popular test os the log-likelihood ratio tests. This test is based on 5.3, and the test statistic is T 3 = 2 L T ˆθ T L T θ 0 D χ 2 p. An advantage of this test statistic is that it is pivotal, in the sense that the Fisher information etc. does not have to calculated, only the maximum likelihood estimator. Exercise: What does the test statistic look like under the alternative? Applications of the log-likeihood ratio to the multinomial distribution Example The multinomial distribution This is a generalisation of the binomial distribution. In this case at any given trial there can arise m different events in the Binomial case m = 2. Let Z i denote the outcome of the ith trial and assume PZ i = k = π i π π m = 1. Suppose there were n trial conducted and let Y 1 denote the number of times event 1 arises, Y 2 denote the number of times event 2 arises and so on. Then it is straightforward to show that n m PY 1 = k 1,...,Y m = k m = π k i i k 1,...,k. m If we donot imposeany constraints on the probabilities {π i }, given{y i } m isstraightforward to derive the mle of {π i } it is very intuitive too!. Noting that π m = 1 m 1 π i, the log-likelihood of the multinomial is proportional to L T π = m 1 m 1 y i logπ i +y m log1 π i. Differentiating the above with respect to π i and solving gives the mle estimator ˆπ i = Y i /n, which is what we would have expected! We observe that though there are m probabilities to estimate due to the constraint π m = 1 m 1 π i, we only have to estimate m 1 probabilities. We mention, that the same estimators can also be obtained by using Lagrange multipliers, that is maximising L T π subject to the parameter constraint that p j=1 π i = 1. To enforce this constraint, we normally add an additional term to L T π and include the dummy variable λ. That is we define the constrained likelihood L T π,λ = m m y i logπ i +λ π i 1. 39
40 Now if we maximise L T π,λ with respect to {π i } m and λ we will obtain the estimators ˆπ i = Y i /n which is the same as the maximum of L T π. To derive the limiting distribution we note that the second derivative is 2 L T π π i π j = y i π 2 i y m 1 m 1 + i = j r=1 πr2 y m 1 m 1 i j r=1 πr2 Hence taking expectations of the above the information matrix is the k 1 k 1 matrix Iπ = n 1 π π m π m... π m 1 1 π m π π m... π m.. 1 π m π m π m Provided no of π i is equal to either 0 or 1 which would drop the dimension of m and make Iπ singular, then the asymptotic distribution of the mle the normal with variance Iπ 1. Sometimes the probabilities {π i } will not be free and will be determined by a parameter θ where θ is an r-dimensional vector where r < m, ie. π i = π i θ, in this case the likelihood of the multinomial is... L T π = m 1 m 1 y i logπ i +y m log1 π i θ. By differentiating the above with respect to θ and solving will give the mle. Pearson s goodness of Fit test We now derive Pearson s goodness of Fit test using the log-likelihood ratio, though Pearson did not use this method to derive his test. Suppose the null is H 0 : π 1 = π 1,...,π m = π m where { π i } are some pre-set probabilities and H A : the probabilities are not the given probabilities. Hence we are testing restricted model where we do not have to estimate anything against the full model where we estimate the probabilities using π i = Y i /n. The log-likelihood ratio in this case is W = 2 { argmax π L Tπ L T π }. Under the null we know that W = 2 { argmax π L T π L T π } P χ 2 m 1 because we have to estimate m 1 parameters. We now derive an expression for W and show that the Pearson- 40
41 statistic is an approximation of this. m W = Y i log Y i +Ym log Y m 1 m n n Y i log π i Y m log π m m = Y i log Y i. n π i Recall that Y i is often called the observed Y i = O i and n π i the expected under the null E i = n π i. Then W = 2 m O ilog O i E i P χ 2 m 1. By using that for a close to x and making a Taylor expansion of xlogxa 1 about x = a we have xlogxa 1 alogaa 1 +x a+ 1 2 x a2 /a. We let O = x and E = a, then assuming the null is true and E i O i we have W = 2 m Y i log Y i m 2 Oi E i + 1 O i E i 2. n π i 2 E i Now we note that m E i = m O i = n hence the above reduces to W O i E i 2 E i D χ 2 m 1. We recall that the above is the Pearson test statistic. Hence this is one methods for deriving the Pearson chi-squared test for goodness of fit. By using a similar argument, we can also obtain the test statistic of the chi-squared test for independent and an explanation for the rather strange number of degrees of freedom!. 41
42 42
43 Chapter 6 The Profile Likelihood 6.1 The Profile Likelihood See also Section 4.5.2, Davison The method of profiling Let us suppose that the unknown parameters θ can be partitioned as θ = ψ,λ, where ψ are the p-dimensional parameters of interest eg. mean and λ are the q-dimensional nuisance parameters eg. variance. We will need to estimate both ψ and λ, but our interest is in testing only the parameter ψ without any information on λ and construction confidence intervals for ψ without constructing unnecessary confidence intervals for λ - confidence intervals for a large number of parameters are wider than those for a few parameters. To achieve this one often uses the profile likelihood. To motivate the profile likelihood, we first describe a method to estimate the parameters ψ, ψ in two stages and consider some examples. Let us suppse that {X t } are iid random variables, with density fx;ψ,λ where our objective is to estimate ψ and λ. In this case the log-likelihood is L T ψ,λ = logfx t ;ψ,λ. To estimate ψ and λ one can use ˆλ T, ˆψ T = argmax λ,ψ L T ψ,λ. However, this can quite difficult, and lead to expression which are hard to maximise. Instead let us consider a different method, which may, sometimes, be easier to evaluate. Suppose, for now, ψ is known, then we rewrite the likelihood as L T ψ,λ = L ψ λ to show that ψ is fixed but λ varies. To estimate λ we maximise L ψ λ with respect to λ, ie. ˆλ ψ = argmax λ L ψλ. 43
44 In reality ψ this unknown, hence for each ψ we can evaluate ˆλ ψ. Note that for each ψ, we have a new curve L ψ λ over λ. Now to estimate ψ, we evaluate the maximum L ψ λ, over λ, and choose the ψ, which is the maximum over all these curves. In other words, we evaluate ˆψ T = argmax ψ L ψˆλ ψ = argmax ψ L Tψ,ˆλ ψ. A bit of logical deduction shows that ˆψ T and λˆψt are the maximum likelihood estimators ˆλ T, ˆψ T = argmax ψ,λ L T ψ,λ. We note that we have profiled out nuisance parameter λ, and the likelihood L ψ ˆλ ψ = L T ψ,ˆλ ψ is completely in terms of the parameter of interest ψ. The advantage of this best illustrated through some examples. Example Let us suppose that {X t } are iid random variables from a Weibull distribution with density fx;α,θ = αyα 1 θ exp y/θ α. We know from Example 4.1.2, that if α, were α known an explicit expression for the MLE can be derived, it is ˆθ α = argmaxl α θ θ = argmax θ = argmax θ logα+α 1logY t αlogθ Y t α θ αlogθ Y t α = 1 θ T Y α t 1/α, where L α X;θ = T logα + α 1logY t αlogθ Y t α θ. Thus for a given α, the maximum likelihood estimator of θ can be derived. The maximum likelihood estimator of α is ˆα T = argmax α logα+α 1logY t αlog 1 T Y α t 1/α Y t α 1 T T Y. t α1/α Therefore, the maximum likelihood estimator of θ is 1 T T Y ˆα T t 1/ˆα T. We observe that evaluating ˆα T can be tricky but no worse than maximising the likelihood L T α,θ over α and θ. As we mentioned above, we often do not have any interest in the nuisance parameters λ and are only interesting in testing and constructing CIs for α. In this case, we are interested in the limiting distribution of the MLE ˆα T. This can easily be derived by observing that T ˆψT ψ ˆλ T λ D Iψψ I ψλ N 0, I λψ I λλ 44 1.
45 where Iψψ I ψλ I λψ I λλ = E 2 logfx t;ψ,λ ψ 2 E 2 logfx t;ψ,λ ψ λ E 2 logfx t;ψ,λ ψ λ E 2 logfx t;ψ,λ ψ To derive an exact expression for the limiting variance of Tˆψ T ψ, we note that the inverse of a block matrix is 1 A B = C D Thus the above implies that A BD 1 C 1 D 1 CBA BD 1 C 1 A 1 BD CA 1 B 1 D CA 1 B 1 TˆψT ψ D N0,I ψ,ψ I ψ,λ I 1 λλ I λ,ψ 1. Thus if ψ is a scalar we can easily use the above to construct confidence intervals for ψ. Exercise: How to estimate I ψ,ψ I ψ,λ I 1 λλ I λ,ψ? The score and the log-likelihood ratio for the profile likelihood To ease notation, let us suppose that ψ 0 and λ 0 are the true parameters in the distribution. The above gives us the limiting distribution of ˆψ T ψ 0, this allows us to test ψ, however the test ignores any dependency that may exist with the nusiance estimator parameter ˆλ T. An alternative test, which circumvents this issue is to do a log-likelihood ratio test of the type { } 2 max L Tψ,λ max L Tψ 0,λ. 6.2 ψ,λ λ However, to derive the limiting distribution in this case for this statistic is a little more complicated than the log-likelihood ratio test that does not involve nusiance parameters. This is because a direct Taylor expansion does not work. However we observe that { } { } { } 2 max L Tψ,λ max L Tψ 0,λ = 2 max L Tψ,λ L T ψ 0,λ 0 2 max L Tψ 0,λ max L Tψ 0,λ 0, ψ,λ λ ψ,λ λ λ now we will show below that by using a few Taylor expansions we can derive the limiting distribution of 6.2. In the theorem below we will derive the distribution of the score and the nested- loglikelihood. Please note you do not have to learn this proof. Theorem Suppose Assumption holds. Suppose that ψ 0,λ 0 are the true parameters. Then we have L T ψ,λ ψ ˆλψ0,ψ 0 L Tψ,λ λ0,ψ ψ 0 I ψ0 λ 0 I 1 45 L T ψ 0,λ λ 0 λ 0 ψ0,λ λ
46 and 1 L T ψ,λ D T ˆλψ0 ψ,ψ 0 N0,Iψ0 ψ 0 I ψ0 λ 0 I 1 λ 0 λ 0 I λ,ψ 6.4 { } D 2 L T ˆψ T,ˆλ T L T ψ 0,ˆλ ψ0 χ 2 p 6.5 where I is defined as in 6.1. PROOF. We first prove 6.3 which is the basis of the proofs of 6.4 and in the remark below we try to interprete 6.3. To avoid, notationaly difficulties by considering the elements of the vector L Tψ,λ ψ ˆλψ0,ψ 0 and L Tψ,λ λ λ=λ0,ψ 0 as discussed in Section we will suppose that these are univariate random variables. Our objective is to find an expression for L Tψ,λ ψ ˆλψ0,ψ 0 in terms of L Tψ,λ λ λ=λ0,ψ 0 and L T ψ,λ ψ λ=λ0,ψ 0 which will allow us to obtain its variance and asymptotic distribution easily. Now making a Taylor expansion of L Tψ,λ ψ L T ψ,λ ψ ˆλψ0,ψ 0 L Tψ,λ ψ ˆλψ0,ψ 0 about L Tψ,λ ψ λ0,ψ 0 gives λ0,ψ L T ψ,λ λ0,ψ λ ψ 0 ˆλ ψ0 λ 0. Notice that we have used instead of = because we replace the second derivative with its true parameters. Now if the sample size is large enough then we can say that 2 L T ψ,λ λ ψ λ0,ψ 0 E 2 L T ψ,λ λ ψ λ0,ψ 0. To see why this is true consider the case that of iid random variables then 1 T Therefore we have that 2 L T ψ,λ λ0,ψ λ ψ 0 L T ψ 0,λ ψ = 1 2 logfx t ;ψ,λ T λ ψ 2 logfx t ;ψ,λ E λ0,ψ λ ψ 0. λ0,ψ 0 L Tψ,λ ˆλψ0 λ0,ψ ψ 0 +T I λψ ˆλ ψ0 λ Hence we have the first part of the decomposition of L Tψ 0,λ ψ into the distribution which ˆλψ0 is known, now we need find a decomposition of ˆλ ψ0 λ 0 into known distributions. We first recall that since L T ψ 0,ˆλ ψ0 = argmax λ L T ψ 0,λ then L T ψ 0,λ = 0 ˆλψ0 λ as long as the parameter space is large enough and the maximum is not on the boundary. Therefore making a Taylor expansion of L Tψ 0,λ λ L T ψ 0,λ λ about L Tψ 0,λ ˆλψ0 λ λ0,ψ=ψ 0 gives L Tψ 0,λ ˆλψ0 λ0,ψ λ L T ψ 0,λ λ 2 λ0,ψ 0 ˆλ ψ0 λ 0. 46
47 Again using the same trick as in 6.6 we have Therefore L T ψ 0,λ λ L Tψ 0,λ ˆλψ0 λ0,ψ λ 0 +T I λλ ˆλ ψ0 λ 0 = 0. ˆλ ψ0 λ 0 = I 1 λλ T Therefore substituting 6.6 into 6.7 gives and 6.3. L T ψ 0,λ ψ ˆλψ0 L Tψ,λ ψ L T ψ 0,λ λ0,ψ λ λ0,ψ 0 I ψλ I 1 λλ L T ψ 0,λ ψ0,λ λ 0 Toprove6.4ie. obtaintheasympototicdistributionandlimitingvarianceof L Tψ 0,λ ψ ˆλψ0, we recall that the regular score function satisfies 1 L T ψ,λ λ0,ψ T ψ 0 = 1 LT ψ,λ T ψ λ0,ψ 0 L T ψ 0,λ λ ψ0,λ 0 D N0,Iθ 0. Now by substituting the above into 6.4 we immediately obtain 6.4. Finally to prove 6.5 we the following decomposition, Taylor expansions and the trick in 6.6 to obtain { } 2 L T ˆψ T,ˆλ T L T ψ 0,ˆλ ψ0 { } { } = 2 L T ˆψ T,ˆλ T L T ψ 0,λ 0 2 L T ψ 0,ˆλ ψ0 L T ψ 0,λ 0 ˆθ T θ 0 Iθˆθ T θ 0 ˆλ ψ0 λ 0 I λλ ˆλ ψ0 λ 0, 6.8 where ˆθ T = ˆψ,ˆλ the mle. Now we want to rewrite ˆλ ψ0 λ 0 in terms of ˆθ T θ 0. We start by recalling that from 6.6 we have ˆλ ψ0 λ 0 = I 1 λλ T L T ψ 0,λ λ0,ψ λ 0. Now we will rewrite L Tψ 0,λ λ λ0,ψ 0 in terms of ˆθ T θ 0 by using L T θ L Tθ ˆθT θ0 +T Iθˆθ T θ 0 L Tθ θ0 Iθˆθ T θ 0. Therefore concentrating on the subvector L Tθ λ ψ0,λ 0 we see that L T θ λ ψ 0,λ 0 I λψ ˆψ ψ 0 +I λλ ˆλ λ
48 Substituting 6.9 into 6.7 gives ˆλ ψ0 λ 0 I 1 λλ I λψˆψ ψ 0 +ˆλ λ 0. Finally substituting the above into 6.8 and making lots of cancellations we have { } 2 L T ˆψ T,ˆλ T L T ψ 0,ˆλ ψ0 Tˆψ ψ 0 I ψψ I ψλ I 1 λ,λ I λ,ψˆψ ψ 0. Finally, since TˆθT θ 0 D N0,Iθ 1, by using inversion formulas for block matrices we have that Tˆψ ψ 0 I ψλ I 1 λ,λ I λ,ψ 1, which gives the desired result. D N0,I ψψ Remark i We first make the rather interesting observation. The limiting variance of L Tψ,λ ψ ψ0,λ 0 is I ψψ, whereas the the limiting variance of L Tψ,λ ψ ˆλψ0,ψ 0 is I ψψ I ψλ I 1 λ,λ I λ,ψ and the limiting variance of Tˆψ ψ 0 is I ψψ I ψλ I 1 λ,λ I λ,ψ 1. ii Look again at the expression L T ψ,λ ψ ˆλψ0,ψ 0 L Tψ,λ ψ λ0,ψ 0 I ψλ I 1 λλ L T ψ 0,λ λ0,ψ λ It is useful to understand where it came from. Consider the problem of linear regression. Suppose X and Y are random variables and we want to construct the best linear predictor of Y given X. We know that the best linear predictor is ŶX = EXY/EY 2 X and the residual and mean squared error is EXY Y ŶX = Y EY 2 X and E Y EXY 2 EY 2 X = EY 2 EXYEY 2 1 EXY. Compare this expression with We see that in some sense L Tψ,λ ψ ˆλψ0,ψ 0 can be treated as the residual error of the projection of L Tψ,λ ψ λ0,ψ 0 onto L Tψ 0,λ λ lambda0,ψ 0. This is quite surprising! We now aim to use the above result. It is immediately clear that 6.5 can be used for both constructing likelihoods and testing. For example, to construct a 95% CI for ψ we can use the mle ˆθ T = ˆψ T,ˆλ T and the profile likelihood and use the 95% CI { { ψ;2 L T ˆψ T,ˆλ T L T ψ,ˆλ ψ } } χ 2 p0.95. As you can see by profiling out the parameter λ, we have avoided the need to also construct a CI for λ too. This has many advantages, from a practical perspective it reduced the dimension of the parameters. 48
49 The log-likelihood ratio test in the presence of nuisance parameters An application of Theorem is for nested hypothesis testing, as stated at the beginning of this section. 6.5 can be used to test H 0 : ψ = ψ 0 against H A : ψ ψ 0 since { } 2 max L Tψ,λ max L D Tψ 0,λ χ 2 p. ψ,λ λ Example χ 2 -test for independence Now it is worth noting that using the Profile likelihood one can derive the chi-squared test for independence in much the same way that the Pearson goodness of fit test was derived using the log-likelihood ratio test. Do this as an exercise see Davison, Example 4.37, page 135. The score test in the presence of nuisance parameters WerecallthatweusedTheorem6.1.1toobtainthedistributionof2 { max ψ,λ L T ψ,λ max λ L T ψ 0,λ } under the null, we now motivate an alternative test to test the same hypothesis which uses the same Theorem. We recall that under the null H 0 : ψ = ψ 0 the derivative L Tψ,λ λ ˆλψ0,ψ 0 = 0, but the same is not true of L Tψ,λ ψ ˆλψ0,ψ 0. However, if the null is true we would expect if ˆλ ψ0 to be close to the true λ 0 and for L Tψ,λ ψ ˆλψ0,ψ 0 to be close to zero. Indeed this is what we showed in 6.4, where we showed that under the null 1 T L T ψ,λ ψ where λ ψ0 = argmax λ L T ψ 0,λ. use 1 ˆλψ0 D N0,Iψψ I ψλ I 1 λ,λ I λ,ψ, 6.11 Therefore 6.11 suggests an alternative test for H 0 : ψ = ψ 0 against H A : ψ ψ 0. We can L T ψ,λ T ψ ˆλψ0 as the test statistic. This is called the score or LM test. The log-likelihood ratio test and the score test are asymptotically equivalent. There are advantages and disadvantages of both. i An advantage of the log-likelihood ratio test is that we do not need to calculate the information matrix. ii An advantage of the score test is that we do not have to evaluate the the maximum likelihood estimates under the alternative model Examples Example: An application of profiling to frequency estimation Question 49
50 Suppose that the observations {X t ;t = 1,...,T} satisfy the following nonlinear regression model X t = Acosωt+Bsinωt+ε t where {ε t } are iid standard normal random variables and 0 < ω < π. The parameters A,B, and ω are real and unknown. Some useful identities are given at the end of the question. i Ignoring constants, obtain the log-likelihood of {X t }. Denote this likelihood as ii Let L T A,B,ω. T S T A,B,ω = Xt 2 2 Show that 2L T A,B,ω+S T A,B,ω = A2 B X t Acosωt+Bsinωt + 2 TA2 +B 2. cos2ω+ab sin2ω. Thus show that L T A,B,ω+ 1 2 S TA,B,ω = O1 ie. the difference does not grow with T. Since L T A,B,ω and 1 2 S TA,B,ω are asymptotically equivalent, for the rest of this question, use 1 2 S TA,B,ω instead of the likelihood L T A,B,ω. iii Obtain the profile likelihood of ω. hint: ProfileouttheparametersAandB,toshowthat ˆω T = argmax ω T X texpitω 2. Suggest, a graphical method for evaluating ˆω T? iv By using the identity expiωt = exp 1 2 it+1ωsin1 2 TΩ sin 1 2 Ω 0 < Ω < 2π T Ω = 0 or 2π show that for 0 < Ω < 2π we have tcosωt = OT t 2 cosωt = OT 2 tsinωt = OT t 2 sinωt = OT 2. 50
51 v By using the results in part iv show that the Fisher Information of L T A,B,ω denoted as IA, B, ω is asymptotically equivalent to 2IA,B,ω = E 2 S T = ω 2 T T B +OT T 0 2 T2 2 A+OT T 2 2 B +OT T2 2 A+OT T 3 3 A2 +B 2 +OT 2. vi Derive the asymptotic variance of maximum likelihood estimator, ˆω T, derived in part iv. Comment on the rate of convergence of ˆω T. Useful information: In this question the following quantities may be useful: expiωt = exp 1 2 it+1ωsin1 2 TΩ sin 1 2 Ω 0 < Ω < 2π T Ω = 0 or 2π the trignometric identities: sin2ω = 2sinΩcosΩ, cos2ω = 2cos 2 Ω 1 = 1 2sin 2 Ω, expiω = cosω+isinω and t = TT +1 2 t 2 = TT +12T Solution i Since {ε t } are standard normal iid random variables the likelihood is L T A,B,ω = 1 2 X t Acosωt Bsinωt 2. 51
52 ii It is straightforward to show that = = = 2L T A,B,ω Xt 2 2 X t Acosωt+Bsinωt T T +A 2 cos 2 ωt+b 2 sin 2 ωt+2ab Xt 2 2 A 2 2 Xt 2 2 A 2 B 2 2 X t Acosωt+Bsinωt + 1+cos2ω+ B2 2 sinωt cosωt 1 cos2ω+ab sin2ω X t Acosωt+Bsinωt + T 2 A2 +B 2 + cos2ω+ab = S T A,B,ω+ A2 B 2 2 sin2ω cos2ω+ab sin2ω Now by using 6.13 we have 2L T A,B,ω = S T A,B,ω+O1, as required. iii To obtain the profile likelihood, let us suppose that ω is known, Then the mle of A and B using 1 2 S T is  T ω = 2 T X t cosωt ˆBT ω = 2 T X t sinωt. Thus the profile likelihood using the approximation S T is 1 2 S pω = 1 T Xt = 1 2 T Xt 2 T 2 X t ÂT ωcosωt+ ˆBωsinωt + T 2 ÂTω 2 + ˆBω 2 [ T ω 2 + ˆB T ω 2 ]. 52
53 Thus the ω which maximises 1 2 S pω is the parameter that maximises ÂTω 2 + ˆB T ω 2. Since ÂTω 2 + ˆB T ω 2 = 1 2T T X texpitω, we have as required. ˆω T = argmax 1/2S pω = argmax ÂT ω 2 + ˆB T ω 2 ω ω = argmax T X t expitω 2, ω iv Differentiating both sides of6.12 with respect to Ω and considering the real and imaginary terms gives T tcosωt = OT T tsinωt = OT. Differentiating both sides of 6.12 twice wrt to Ω gives the second term. v DifferentiatingS T A,B,ω = T X2 t 2 T X t Acosωt+Bsinωt TA2 +B 2 twice wrt to A,B and ω gives and 2 S T A 2 = T, 2 S T B 2 = T, S T T A = 2 X t cosωt+at S T T B = 2 X t sinωt+bt S T T ω = 2 AX t tsinωt 2 2 S T A B = 0, 2 S T T ω A = 2 X t tsinωt 2 S T T ω B = 2 X t tcosωt BX t tcosωt. 2 S T T ω 2 = 2 t 2 X t Acosωt+Bsinωt. Now taking expectations of the above and using v we have E 2 S T T ω A = 2 tsinωt Acosωt+Bsinωt = 2B = B tsin 2 ωt+2 At sinωt cosωt t1 cos2ωt+a tsin2ωt = 53 TT +1 B +OT = B T OT.
54 Using a similar argument we can show that E 2 S T ω B = AT2 2 +OT and E 2 S T T ω 2 = 2 2 t Acosωt+Bsinωt 2 = A 2 +B 2 TT +12T +1 6 Since E 2 L T 1 2 E 2 S T, this gives the required result. +OT 2 = A 2 +B 2 T 3 /3+OT 2. vi Noting that the asymptotic variance for the profile likelihood estimator ˆω T by subsituting vi into the above we have I ω,ω I ω,ab I 1 A,B I BA,ω 1, A 2 +B T 3 +OT A 2 +B 2 T 3 Thus we observe that the asymptotic variance of ˆω T is OT 3. Typically estimators have a variance of order OT 1, so we see that the estimator ˆω T variance which converges to zero, much faster. Thus the estimator is extremely good compared with the majority of parameter estimators. Example: An application of profiling in survival analysis Question This question also uses some methods from Survival Analysis which is covered later in this course - see Sections 13.1 and Let T i denote the survival time of an electrical component. It is known that the regressors x i influence the survival time T i. To model the influence the regressors have on the survival time the Cox-proportional hazard model is used with the exponential distribution as the baseline distribution and ψx i ;β = expβx i as the link function. More precisely the survival function of T i is F i t = F 0 t ψxi;β, where F 0 t = exp t/θ. Not all the survival times of the electrical components are observed, and there can arise censoring. Hence we observe Y i = mint i,c i, where c i is the censoring time and δ i, where δ i is the indicator variable, where δ i = 0 denotes censoring of the ith component and δ i = 1 denotes that it is not censored. The parameters β and θ are unknown. 54
55 i Derive the log-likelihood of {Y i,δ i }. ii Compute the profile likelihood of the regression parameters β, profiling out the baseline parameter θ. Solution i The survivial function and the density are f i t = ψx i ;β { F 0 t } [ψx i ;β 1] f0 t and F i t = F 0 t ψx i;β. Hence for this example we have logf i t = logψx i ;β [ ψx i ;β 1 ] t θ logθ t θ logf i t = ψx i ;β t θ. Therefore, the likelihood is L n β,θ = = n { δ i logψxi ;β+logf 0 T i +ψx i ;β 1logF 0 t } + n 1 δ i { ψx i ;βlogf 0 t } n { δ i logψxi ;β logθ } n ψx i ;β T i θ ii Keeping β fixed and differentiating the above with respect to θ and equating to zero gives L n = n δ i { 1 θ } + ψx i ;β T i θ 2 and n ˆθβ = ψx i;βt i n δ. i Hence the profile likelihood is l P β = n { δ i logψxi ;β log ˆθβ } n ψx i ;β T i ˆθβ. Hence to obtain an estimator of β we maximise the above with respect to β. 55
56 An application of profiling in semi-parametric regression We now consider how the profile likelihood we use inverted commas here because we do not use the likelihood, but least squares instead can be used in semi-parametric regression. Recently this type of method has been used widely in various semi-parametric models. This section needs a little knowledge of nonparametric regression, which is considered later in this course. Suppose we observe Y t,u t,x t where Y t = βx t +φu t +ε t, Y t,x t,u t are iid random variables and φ is an unknown function. To estimate β, we first profile out φ, which we estimate as if β were known. In other other words, we suppose that β is known and let Y t β = Y t βx t. We then estimate φ using the classic local least estimator, in other words the φ which minimises the criterion ˆφ β u = argmin W b u U t Y t β a 2 t = W bu U t Y t β a t t W bu U t = tw bu U t Y t t t W β W bu U t X t bu U t t W bu U t := G b u βh b u, 6.14 where G b u = tw bu U t Y t t W bu U t and H b u = t W bu U t X t t W bu U t. Thus, given β the estimator of φ and the residuals ε t are ˆφ β u = G b u βh b u and Y t βx t ˆφ β U t. Given the estimated residuals Y t βx t ˆφ β U t we can now use least squares to estimate coefficient β, where L T β = t Yt βx t ˆφ β U t 2 = t Yt βx t G b U t +βh b U t 2 = t Yt G b U t β[x t H b U t ] 2. Therefore, the least squares estimator of β is t ˆβ b,t = [Y t G b U t ][X t H b U t ] t [X t H b U t ] 2. Using β b,t we can then estimate We observe how we have the used the principle of profiling to estimate the unknown parameters. There is a large literature on this, including 56
57 Wahba, Speckman, Carroll, Fan etc. In particular it has been shown that under some conditions on b as T, the estimator ˆβ b,t has the usual T rate of convergence. U t = t T It should be mentioned that using random regressors U t are not necessary. It could be that on a grid. In this case ˆφ β u = argmin a t W b u t T Y tβ a 2 = t W bu t T Y tβ t W bu t T = t W b u t T Y t β t W b u U t X t := G b u βh b u, 6.15 where G b u = t W b u t T Y t and H b u = t W b u t T X t. Using the above estimator of φ we continue as before. 57
58 58
59 Chapter 7 The Delta Method very short 7.1 The delta method and the construction of CIs Let us suppose that the estimator ˆθ T has the following limiting distribution TˆθT θ 0 D N0,Σ, 7.1 where for example Σ = 1 T Iθ 0 1. Often we want to obtain the limiting distribution of a function of this estimator. In other words, if ˆθ T is a good estimator of θ 0 is is reasonable to suppose that gˆθ T is a good estimator of gθ 0. Then the natural question to ask is what the limit distribution of gˆθ T is. As is almost always the case we can obtain the limiting distribution using the Taylor expansion. Lemma Suppose that the derivative of g exists, is continuous and non-zero at θ 0. Then we have D T gˆθ T gθ 0 N 0,g θ 0 2 Σ PROOF. To prove the result, we first note that since g is continuous and ˆθ T P θ0, we have by the continuous mapping theorem that g ˆθ T g θ 0. In this case we can make a Taylor expansion of gˆθ T about θ 0 to obtain T gˆθ T gθ 0 = g θ 0 Tˆθ T θ 0 +o p 1. Now by using 7.1 we obtain the result. We can use the above result to construct CIs for gθ 0. We mention that the above result can be extended to the multivariate case. 59
60 Example Question Let us suppose that ˆθ T is an estimator of θ 0, and Tˆθ T θ 0 D N0,V. i Give an estimator of θ0 3. ii Using a Taylor expansion of gθ about θ 0, obtain the asymptotic distribution of the above estimator. Solution i ˆθ T 3. ii By the Taylor expansion we have gˆθ T gθ 0 + ˆθ T θ 0 } {{ } D N0,V/T g θ 0. Thus Tgˆθ T gθ 0 D N0,[g θ 0 ] 2 V. Hence for this example we have Tˆθ 3 T θ3 0 D N0,9θ 4 0 V. 60
61 Chapter 8 Non-regular models 8.1 Non-regular models Estimating the mean on the boundary There are situations where the parameter to be estimated in a model is exactly on the boundary. In such cases the limiting distribution of the the parameter may not be normalwith variance the inverse Fisher information. Even in the case that the parameter is very close to the boundary very large sample sizes are required for normality to hold. In this case alternative techiques are required. I illustrate one such method for the example below, though it is worth noting that it may be hard to use such methods for more complex situations. Suppose that X i Nµ,1, where the mean µ is unknown. Suppose in addition that it is known that the mean is non-negative hence the parameter space of the mean is Θ = [0,. In this case X can no longer be the MLE because there will be some instances where X < 0. But it makes no sense estimating µ with a negative value when X is negative. Let us relook at the likelihood for this restricted space ˆµ T = argmax L 1 Tµ = argmax µ Θ µ Θ 2 X t µ 2. Due to the convexity of L T µ with respect to µ we see that the MLE estimator is { X X 0 ˆµ T = 0 X < 0. Hence in this restricted space L Tµ µ ˆµT 0, and the usual Taylor expansion method cannot be used to derive normality. Indeed we will show that it is not normal. 61
62 We recall that T X µ D N0,T 1 Iµ 1 or equivalently 1 L T µ D T µ X N0,Iµ. Hence if the true parameter µ 0 = 0, then approximately half the time X will be less than zero and the other half it will be greater than zero. This means that half the time ˆµ T = 0 and the other half it will be greater than zero. Therefore the distribution function of ˆµ T is P Tˆµ T x = Pˆµ T = 0 or 0 < 0 x 0 Tˆµ T x = 1/2 x = 0 1/2+P0 < T X x x > 0 Now we may want to test the following hypthesis H 0 : µ = 0 against the hypothesis H A : µ > 0. We would use the log-likelihood ratio W = 2 { L T ˆµ T L T 0 }, but now it is unlikely to be a chi-squared. So we need to derive this distribution. It can be argued that under the null half the time the likelihood takes the correct value, hence we have 2 { L T ˆµ T L T 0 } D χ2 1. I am not a big fan of this argument. I prefer to use the following argument Therefore L T ˆµ T = 1 2 = X t ˆµ T 2 L T ˆµ T = { L T ˆµ T L T 0 } = Xt. 2 2ˆµT X ˆµ 2 T { 0 X 0 P X 0 = 1/2 T X 2 X > 0 P X > 0 = 1/2 Hence we have that P2 { L T ˆµ T L T 0 } 0 x 0 x = 1/2 x = 0 1/ PT X 2 x x > 0 Therefore P2 { L T ˆµ T L T 0 } x = Pχ2 x. Therefore, suppose I wanted to test the hypothesis H 0 : µ = 0 against the hypothesis H A : µ > 0 then I would use the above log likelihood ratio test. In other words, evaluate W = 2 { L T ˆµ T L T 0 } and find the p such that Pχ2 W = 1 p. This is the p-value, depending on it we can see whether we are able to reject the null. 62.
63 Example Question: The survivial time of disease A follow an exponential distribution, where the distribution function has the form fx = λ 1 exp x/λ. Suppose that it is known that at least one third of all people who have disease A survive for more than 2 years. i Based on the above information derive a lower bound for λ. ii Suppose that it is known that λ λ 0. What is the maximum likelihood estimator of λ. iii Derive the sampling properties of maximum likelihood estimator of λ, for the cases λ = λ 0 and λ > λ 0. Solution i PX > x = exp x/λ. Hence we have that PX > 2 = exp 2/λ > 1/3, thus λ > 2/log3. ii The likelihood is L T λ = X t logλ+. λ Thus ˆλ T = argmax [λ0, L T λ; noting that of the parameter space is not constrained than the maximum arises when ˆλ T = 1 T T X t. In the constrained space ˆλ T = { λ0 if X λ0 X if X > λ0. iii If λ > λ 0, then the true parameter does not lie on the boundary of the parameter space and for a large enough sample we have Tˆλ T λ D N0,varX t. On the other hand if λ = λ 0, then we are on the boundary. We know that T X λ 0 D N0,varX t. From this result, we see that for large samples, about there is a 50% chance that X < λ0 and 50% chance X λ 0. Based on this, it can be argued that the limiting distribution of ˆλ T is { TˆλT λ 0 D 1/2 for ˆλ T = λ 0 1/2N0,varX t if ˆλ T > λ 0. Example Example 4.39 page 140 in Davison 2002 In this example Davison reparameterises the t-distribution. It is well known that if the number of degrees of freedom of a t-distribution is one, it is the Cauchy distribution, which has extremely thick tails such that the mean does not exist. At the other extreme, if we let the number of degrees of freedom tend 63
64 to, then the limit is a normal distribution where all moments exist. In this example, the t-distribution is reparameterised as fy;µ,σ 2,ψ = Γ[ 1+ψ 1 ] 2 ψ 1/21+ σ 2 π 1/2 Γ 1 2π ψy ψ 1 +1/2 µ2 σ 2 It can be shown that lim ψ 1 fy;µ,σ 2,ψ is a t-distribution with one-degree of freedom and at the other end of the spectrum lim ψ 0 fy;µ,σ 2,ψ is a normal distribution. Thus 0 < ψ 1, and the above generalisation allows for fractional orders of the t-distribution. In this example it is assumed that the random variables {X t } have the density fy;µ,σ 2,ψ, and our objective is to estimate ψ, when ψ 0, this the true parameter is on the boundary of the parameter space 0,1] it is just outside it!. Using similar, arguments to those given above, Davison shows that the limiting distribution of the MLE estimator is close to a mixture of distributions as in the above example Regularity conditions which are not satisfied The uniform distribution The standard example where the regularity conditions mainly Assumption 1.1.1ii are not satisfied is the uniform distribution fx;θ = We can see that the likelihood in this case is L T X;θ = { 1 θ 0 x θ 0 otherwise T θ 1 I0 < X t < θ. In this case the the derivative of L T X;θ is not well defined, hence we cannot solve for the derivative. Instead, to obtain the mle we try to reason what the maximum is. We should plot L T X;θ against θ and place X i on the θ axis. We can see that if θ < X i, then L T is zero. Let X i denote the ordered data X 1 X 2,... X T. We see that for θ = X T, we have L T X;θ = X T T, then beyond this point L T X;θ decays ie. L T X;θ = θ T for θ X T. Hence the maximum of the likelihood is ˆθ T = max 1 t T X t. To investigate the limiting behaviour, we need to consider the likelihood. However, since X t are iid we need only consider the density fx;θ. We observe that Assumption 1.1.1ii, is not satisfied, since we cannot exchange the integral and derivative d θ 1 dθ 0 θ dx θ 1dI0 x θ dx, dθ 64
65 hence the Cramer-Rao bound no longer necessarily holds etc. And the limit distribution does not necessarily converge to a normal with the inverse of the Fisher information. In fact you cannot use the standard methods of differentiating the likelihood to obtain the sampling properties of the estimator because the derivative at the true value is not well defined. But you can calculate the limiting distribution of ˆθ T = max 1 t T X t try it. The shifted exponential Let us consider the shifted exponential distribution fx;θ,φ = 1 θ exp x φ x φ,θ,φ > 0. θ We first observe when φ = 0 we have the usual exponential function, φ is simply a shift parameter. It is clear that since the support of the distribution function involves the parameter φ that the regularity condition Assumption 1.1.1ii will not be satisfied try it and see. This means the Cramer-Rao bound does not exist in this case and the distribution of the mle estimators of the parameters will not be normal with the inverse of the Fisher information as its variance. The likelihood for this example is L T X;θ,φ = 1 θ T T exp X t φ Iφ X t. θ We see that we cannot obtain the maximum of L T X;θ,φ by differentiating. Instead let us consider what happens to L T X;θ,φ for different values of φ. We see that for φ > X t for any t, the likelihood is zero. But at φ = X 1 smallest value, the likelihood is 1 T θ T exp X t X 1 θ. But for φ < X 1, L T X;θ,φ starts to decrease because X t φ > X t X 1, hence the likelihood decreases. Thus the MLE for φ is ˆφ T = X 1, notice that this estimator is completely independent of θ. To obtain the mle of θ, differentiate L TX;θ,φ dθ ˆφT =X and equate to zero. 1 We obtain ˆθ T = X ˆφ T. This makes sense because we recall that when φ = 0, then the MLE of θ is ˆθ T = X. We now obtain the distribution of ˆφ T φ = X 1 φ. To make the calculation easier we observe that X t can be rewritten as X t = φ + E t, where {E t } is a random variables which as a standard exponential distribution fx;θ,0 = θ 1 exp x/θ. Therefore the distribution function of ˆφ T φ = min t E t Pˆφ T φ x = Pmin t E t x = 1 Pmin t E t > x = 1 [exp x/θ] T. Thereforethedensityof ˆφ T φis θ T exp Tx/θ,orinotherwordsanexponentialwithparameter T/θ. Hence the mean of ˆφ T φ is θ/t notices it goes to zero as T and the variance is 65
66 θ 2 /T 2. Standardising we see that the distribution of Tˆφ T φ is exponential with parameter θ 1 since the sum of T iid exponentials with parameter θ 1 is exponential with parameter Tθ 1. Hence we observe that ˆφ T is a biased estimator of φ, but the bias decreases as T. Morover, the variance is quite amazing. Unlike standard estimators where the variance decreases at the rate 1/T, the variance of ˆφ T decreases at the rate 1/T 2. See Davison 2002, page 145, example 4.43 for more details. Example Let us suppose that {X t } are iid exponentially distributed random variables with density fx = 1 λ exp x/λ. Suppose that we only observe {X t}, if X t > c else X t is not observed. i Show that the sample mean X = 1 T T X t is a biased estimator of λ. ii Suppose that λ and c are unknown, derive the log-likelihood of {X t } T likelihood estimators of λ and c. and the maximum Solution i The observations are biased and E X = EX t X t > c, thus EX t X t > c = Note that this is quite logical. = 0 c fxix c x dx PX > c } {{ } fx X>c x = λe c/λ c λ +1 e c/λ fxix c dx = 1 PX > c e c/λ = λ+c. c xfxdx ii We observe that the density of X t given X t > c is fx X t > c = fxix>c PX>c = λ 1 exp 1/λX cix c this is close to a shifted exponential. Based on this the log-likelihood is = { logfxt +logix t c logpx > c } { logλ 1 λ X t c+logix t c }. Hence we want to find the λ and c which maximises the above. Here we can use the idea of profiling to estimate the parameters - it does not matter which parameter we profile out. 66
67 Suppose we fix, λ, and maximise the above with respect to c, in this case it is easier to maximise the actual likelihood: L λ c = T 1 λ exp X t c/λix t > c. By drawing L with respect to c, we can see that it is maximum at c, thus the estimator of c conditioned on λ is ĉ = min t X t. Now we can estimate λ. Putting ĉ back into the log-likelihood gives the profile likelihood { logλ 1 λ X t ĉ+logix t ĉ }. Differentiating the above with respect to λ gives T X t ĉ = λt. Thus ˆλ T = 1 T T X t ĉ. Thus ĉ = min t X t ˆλT = 1 T T X t ĉ T, are the MLE estimators of c and λ respectively. 67
68 68
69 Chapter 9 Misspecification and the Kullbach Leibler Criterion 9.1 Assessing model fit and the Kullbach-Leibler criterion The Kullbach Leibler criterion is on method for measuring how close two densities are, or alternatively it is a means of measuring how close the conjectured density is to the true density of the observations which in reality is never observed. Rather than define it here it will come naturally from the discussion below on model misspecification Model misspecification Until now we have assumed that the model we are fitting to the data is the correct model and our objective is to estimate the parameter θ. In reality often the model we are fitting will not be the correct model which is usually unknown. In this situation a natural question to ask is what are we estimating? Let us suppose that {X t } are iid random variables which have the density gx. Suppose we are fitting the density {fx;θ;θ Θ} to the data and are trying to estimate θ. the log likelihood is L T θ = logfx t ;θ. However its limit will now be different due to misspecification, using the LLN law of large numbers we have 1 T L Tθ a.s. E logfx t ;θ = 69 log fx; θgxdx. 9.1
70 Therefore it is clear that ˆθ T = argmaxl T θ is an estimator of θ g = argmax logfx;θgxdx. Hence ˆθ T is not an estimator of any true parameter but it is an estimator of the parameter which best fits the model in the specified family of models. In reality we will often be estimating the best fitting parameter θ g. Of course, one would like to know what the limit distribution of ˆθ T θ g is it will not be the same as the correctly specified case. To obtain the limit distribution we again use the Taylor expansion of L T θ and the approximation 1 L T θ θg 1 L T θ +iθ T T ˆθT g Tˆθ T θ g, 9.2 where iθ g = E 2 logfx;θ 2 θg. Now, for us to use the usual asymptotic normality theory we require the following assumptions which are not required for the specified case. Theorem Suppose that {X t } are iid random variables and logfx;θgxdx θ=θg = and the usual regularity conditions are satisfied exchanging derivative and integral is allowed and the third order derivative exists. Then we have and 1 L T θ D θg N0,jθg, 9.4 T TˆθT θ g D N 0,iθ g 1 jθ g iθ g where iθ g = E 2 logfx;θ 2 θg logfx;θ 2 jθ g = E θ=θg = 2 logfx;θ = 2 gxdx logfx;θ 2 gxdx. PROOF. If 9.3 is satisfied, then for large enough T we have L Tθ ˆθT = 0, hence 1 L T θ θg iθ g Tˆθ T θ g 9.6 T Tˆθ T θ g iθ g 1 1 L T θ θg. 9.7 T 70
71 Hence asymptotic normality of Tˆθ T θ g follows from asymptotic normality of 1 T L T θ Under assumption 9.3 fxt;θ CLT we have Now substituting 9.8 into 9.6 we have θg. θg are zero mean iid random variables. Therefore by using the 1 L T θ D θg N0,jθg. 9.8 T TˆθT θ g D N 0,iθ g jθ g iθ g This gives the desired result. The main thing to observe is that unlike the case when we have correctly specified the model it is not true that iθ g = jθ g. Hence whereas in the correctly specified case we have Tˆθ T θ 0 D N 0,iθ 0 1 in the misspecified case it is Tˆθ T θ g D N 0,iθ g 1 jθ g iθ g 1. Recall that in the specified case that we estimated the information criterion using either îθ 0 = 1 T or ĩθ 0 = 1 T logfx t ;θ 2 θ=ˆθt 2 logfx t ;θ 2 θ=ˆθt. Both îθ 0 and ĩθ 0 are estimators of the information matrix iθ since E 2 logfx t;θ 2 θ0 = E logfxt;θ θ0 2. In the misspecified case we need to use both of the above to obtain estimators of iθ g and jθ g. In other words îθ g = 1 T ĵθ g = 1 T 2 logfx t ;θ 2 θ=ˆθt logfx t ;θ 2 θ=ˆθt are estimators of iθ g and jθ g respectively. Hence using this and Theorem we can construct CIs for θ g. We can also construct the log-likelihood ratio statistic. But the distribution is not a standard chi-squared distribution it is a generalised chi-squared distribution. Example Let us suppose that {X t } t are indepedendent random variables which satisfy the model X t = g t T +ε t, where {ε t } are iid random variables which follow a t-distribution with 71
72 6-degrees of freedom ie. the variance exists. Thus, as T gets large we observe a corrupted version of g on a finer grid. Since, the function g is unknown, a line if fitted to the data and it is assumed that the noise is Gaussian. In other words, the estimated the slope â T which maximised the criterion L T a = 1 2σ 2 Xt a t 2. T Question: i What is â T estimating? ii What is the limiting distribution of â T? i Rewriting L T a we observe that 1 T L Ta = = P 1 2σ 2 T 1 2σ 2 T 1 2σ 2 T 1 0 t g T +ε t a t 2 T t g T a t T 2σ 2 T gu au ε 2 t + 2 2σ 2 T t g T a t εt T Thus we observe â T is an estimator of the line which best fits the curve g according to the l 2 -distance a g = argmin If you draw a picture, this seems logical. 1 0 gu au 2du. ii Now we derive the distribution of Tâ T a g. We assume and it can be shown that all the regularity conditions are satisfied. Thus we procede to derive the derivatives of the likelihoods. 1 T L T a ag = 1 a Tσ 2 Xt a g t t T T 1 T 2 L T a a 2 ag = 1 Tσ 2 t 2. T Thus taking the mean and expectation and using the definition of the Reimann integral we have 1 T Ia g = 1 T var L T a 1 ag = a Tσ 4 1 T Ja g = 1 T E 2 L T a 1 a 2 ag = Tσ 2 varx t t T t T u 2 du = 1 3σ 2 u 2 du = 1 3σ 2. We observe that in this case despite the mean and the distribution being misspecified we have that 1 T Ia g = 1 T Ja g. Altogether, this gives the limiting distribution T ât a g D N0,3σ 2. 72
73 We observe that had we fitted a Double Laplacian to the data which has the distribution fx = 1 2b exp x mut b, the limit of the estimator would be different, and the limiting distribution would also be different. The Kullbach-Leibler information criterion The discussion above, in particular, 9.1 leads to the definition of the Kullbach-Liebler criterion. We recall that the parameter which best fits the model using the maximum likelihood is an estimator of θ g = argmax logfx;θgxdx. θ g can be viewed as the parameter which best fits the distribution out of the possible distributions. Of course the word best is not particularly precise. It is best according to the criterion logfx;θgxdx. To determine how well this fits the distribution we compare it to the limit likelihood using the correct distribution, which is log gxgxdx limit of of likelihood of correct distribution. In other words, the closer the difference logfx;θ g gxdx log gxgxdx = log fx;θ g gxdx gx is to zero, the better the parameter θ g fits the distribution g, using this criterion. We recall that log fx;θ gx gxdx = E log fx t;θ fx t ;θ loge gx t gx t where equality only arises if fx;θ = gx. Therefore an alternative, but equivalent interpretation of θ g, is the parameter which minimises the distance Dg,f θ = logfx;θ g gxdx log gxgxdx = log fx;θ g gxdx, gx i.e. θ g = argmax θ Θ Dg,f θ. We note that Dg,f θ is not as such a distance since Dg,f θ Df θ,g though it can be symmetrified. Dg,f θ is called the Kullbach-Leibler criterion. It can be considered as a measure of fit between the two distributions, the closer these two quantities are to zero the better the fit. The Kullbach-Leibler criterion arises all over the place. We will use it in the section below on model selection. 73
74 Weobservethatθ g = argmax θ Θ Dg,f θ, hencefx;θ g isthebestfittingdistributionusing the K-L criterion. This does not mean it is the best fitting distribution according to another criterion. Indeed if we used a different distance measure, we are likely to obtain a different best fitting distribution. There are many different information criterions. The motivation for the K- L criterion comes from the likelihood. However, in the model misspecification set-up there are alternative methods, to likelihood methods, to finding the best fitting distribution alternative methods may be more robust - for example the Renyi information criterion. Examples Example An example of misspecification is when we fit the exponential distribution {fx;θ = θ 1 exp x/θ;θ > 0} to the observations which come from the Weibull distribution. For example suppose gx = α x α 1 exp x/φ α ; α,φ > 0, x > 0. φ φ We fit use the likelihood to fit the exponential distribution 1 T L Tθ = 1 T logθ + X t a.s. θ logθ E X t θ logθ x = + gxdx. θ Let ˆθ T = argmaxl T θ. Tedious algebra shows that ˆθ T is an estimator of θ g = argmax{ logθ +EX t /θ } = EX t = φγ1+α Therefore by using Theorem we have T ˆθT φγ1+α 1 P N 0,iθg 1 jθ g iθ g 1 where iθ g = E θ 2 2Xθ 3 θ=ex = [EX] 2 θ jθ g = E 1 +Xθ 2 2 θ=ex = EX2 [EX] 4 1 [EX] 2, since logfx;θ = θ 1 + xθ 2 and 2 logfx;θ = θ 2 2xθ 3. We note that for the Weibull 2 distribution EX = φγ1+α 1 and EX 2 = φ 2 Γ1+2α 1, hence the above reduces to iθ g = 1 [φγ1+α 1 ] 2 and jθ g = 1 φ 2 Γ1+α 1 2 Γ1+2α 1 /Γ1+α
75 We also observe to check how well the best fitting exponential fits the Weibull distribution for difference values of φ and α we use the Kullback information criterion. That is evaluate θ 1 g exp θ 1 g x α Dg,f θg = log α φ x φ α 1 exp x φ α φ x φ α 1 exp x φ α dx φγ1+α 1 1 exp φγ1+α 1 1 x α = log φ x φ α 1 exp x φ α dx9.12 α φ x φ α 1 exp x φ α We note that by using 9.10 that Dg,f θg should be close to zero when α = 1 since then the Weibull is a close an exponential, and we conjecture that this difference should grow the further α is from one. See also Davison 2002, page 147. Example Question: Suppose {X t } T are independent, identically distributed normal random variables with distribution Nµ,σ 2, where µ > 0. Suppose that µ and σ 2 are unknown. A non-central t-distribution with 11 degrees of freedom 11+1/2 fx;a = C11 1+ x a2, 11 where Cν is a finite constant which only depends on the degrees of freedom, is mistakenly fitted to the observations. [8] i Suppose we construct the likelihood using the t-distribution with 11 degrees of freedom, to estimate a. In reality, what is this MLE actually estimating? ii Denote the above ML estimator as â T. Assuming that standard regularity conditions are satisfied, what is the approximate distribution of â T? Solution i The MLE seeks to estimate the maximum of ElogfX;a wrt a. Thus for this example â T is estimating a g = argmaxe 6log1+ a X a2 = argmin 11 log1+ x a2 11 ii Let a g be defined a above. Then we have TâT a g D N 0,J 1 a g Ia g J 1 a g, dφ x µ dx. σ where Ia g = C116E dlog1+x a 2 /11 2 a=ag da Ja g = C116E d 2 log1+x a 2 /11 da 2 a=ag. 75
76 Example Question: The random variable X has a Poisson distribution with PX = k = θ k exp θ k!. Suppose our hapless researcher fits the geometric distribution π1 π k 1 to the data by using the log-likelihood. i What quantity is the misspecified maximum likelihood estimator actually estimating? ii How well does the best fitting geometric distribution approximate the Poisson distribution? iii Given the data, describe a method the researcher can use to check whether the gometric Solution distribution is an appropriate choice of distribution. i The expectation of the geometric maximum likelihood with respect to the Poisson measure is E logπ1 π X 1 = θ k e θ logπ +klog1 π k=0 = logπ +log1 πex = logπ +θlog1 π. Hence if we maximise the likelihood with respect to π we are estimating the maximum of the above which is π P = 1 1+θ. ii To measure how well the best fitting geometric distribution fits the correct Possion distribution we use the K-L divergence criterion, this is KPθ,G1+θ 1, which is defined as KPθ,G1+θ 1 1 = E log 1+θ θ 1+θ X E log θx exp θ X! k! Now we observe that 1 E log 1+θ θ 1+θ X = log1+θ+exlogθ EXlog1+θ = θ +1logθ +θlog1+θ. and E log θx exp θ X! = EXlogθ θ +ElogX! = θlogθ θ + 76 k=0 logk! θk exp θ. k!
77 Thus the K-L distance is KPθ,G1+θ 1 = [ ] θ +1logθ +θlog1+θ θlogθ θ + = θlog1+θ logθ +θ iii Pearson s goodness of fit test can be used. Example Question k=0 logk! θk exp θ. k! Let us suppose that the random variable X is a mixture of Weibull distributions Solution fx;θ = p α 1 φ 1 x φ 1 α 1 1 exp x/φ 1 α 1 +1 p α 2 φ 2 x φ 2 α 2 1 exp x/φ 2 α 2. i Derive the mean and variance of X. k=0 logk! θk exp θ k! ii Obtain the exponential distribution which best fits the above mixture Weibull according to the Kullbach-Lieber criterion recall that the exponential is gx;λ = 1 λ exp x/λ. i Define the r.v. δ = { 1 0 where pδ = 1 = p and pδ = 0 = 1 p. Therefore, to evaluate the expectation of X we observe that EX = E{EX δ} = pφ 1 Γ1+ 2 α1+1 pφ 2 Γ1+ 2 α 2. To obtain variance use either the expression varx = E{EX 2 δ} [EEX δ] 2 or varx = E{varX δ} + var[ex δ], to obtain varx = pφ 2 1Γ pφ 2 2Γ1+ 2 {pφ 1 Γ pφ 2 Γ1+ 2 α1 α2 α1 α2} 2. ii The Kullbach-Lieber criterion is defined as E f {log gx;λ fx;θ }. The best fitting exponential distribution is the λ, which maximises ˆλ = argmax λ E f { loggx : λ gx; 0 77 } = argmax λ E f {loggx;λ}
78 note that the above is simply the expection of the misspecified log-likelihood. Thus E f {loggx;λ} = E f { X λ logλ} = E fx λ By diff the above w.r.t. λ we see that it is maximized when logλ. λ = E f X = pφ 1 Γ pφ 2 Γ1+ 1 α1 α2. Thus the best fitting exponential has its parameter defined by the mean of the distribution λ = EX = E{EX δ} = pφ 1 Γ1+ 2 α1 +1 pφ 2Γ1+ 2 α 2. 78
79 Chapter 10 Model Selection 10.1 Model selection See also Section 4.7, Davison Over the past 30 years there have been several different methods for selecting the best model out of a class of several given models. The typical problem is we observe the response variable Y t which satisfies the the model p Y t = a j x t,j +ε t. j=1 The natural question to ask is how many regressors should be included in the model. Without checking, we are prone to overfitting the model. One method is to use a log-likelihood ratio test, but this does not allow us to compare the fit of various candidate models to the true distribution. Ie. how to say one model is better than another. Without a good checking system, one could easily end up with an over parametrised model. For example fitting a Weibull distribution, when the observations come from an exponential distribution, it is unnecessary to fit that extra parameter. There are various ways to approach this problem. One of the classical methods is to use a information criterion for example the AIC, which penalises the number of parameters. There are different methods for motivating the information criterion. Here we motivate it through the Kullbach-Leibler criterion which is also a little adhoc. The basic idea is that citerion can be split into two parts, the first part measures the model fit or bias in the model the second part measures the variance which is due to the inclusion of several parameters in the model. To simplify the approach we will assume that {X t } are iid random variables though this assumption is not necessary - in fact the AIC was first derived for order selection in time series!. 79
80 Suppose that X t has the distribution gx, but we want to select the best distribution in the family {fx;θ;θ Θ}. Let Iθ g = E 2 logfx;θ 2 θg logfx;θ 2 Jθ g = E θ=θg = 2 logfx;θ = 2 gxdx θg logfx;θ 2 gxdx θg. Given the observations {X t } we would usually use the MLE to obtain the best estimator, ie. ˆθ T X = argmax L TX;θ = argmax θ Θ θ Θ logfx t ;θ, we have included X in the above to show that the mle depends on it. Hence based on these observations the best fitting model is fx t ;ˆθ T X. To measure how well it fits it is natural to compare it to the true density gx using the K-L criterion Dg,fˆθT X = log fx; ˆθ T X gxdx. gx The problem with the above is that a we cannot evaluate it - because g is unknown b it depends on the sample X. However the second part of Dg,fˆθT X is simply loggxgx, and is the same for all candidate distributions, hence it is just a constant and can be ignored. Therefore rather than consider Dg,fˆθT X we can consider Dg,fˆθT X = logfx; ˆθ T Xgxdx instead. We consider the negative value to stick with convention. However, we observe that Dg,fˆθT X still depends on the sample X. Therefore, a more sensible criterion is to consider the expectation of the above over all random samples X E X Dg,fˆθT X = E X E Y logfy;ˆθ T X. Our objective is to estimate E X EY logfy;ˆθ T X. A crude estimator of E X Dg,fˆθT X would be to use 1 T logfx t ;ˆθ T X However, there are several problems with this crude estimator of E X EY logfy;ˆθ T X. Though it is not immediately obvious E X EY logfy;ˆθ T X penalises overfitted models. 80
81 To understand why, recall that a Taylor expansion of E X EY logfy;ˆθ T X about θ g would give E X EY logfy;ˆθ T X E Y logfy;θ g E X ˆθ T X θ g Iθ g ˆθ T X θ g. The second term on the right of the above grows as the number of parameters grow recall it has a χ 2 -distribution where the number of degrees of freedom is equal to the number of parameters. Hence E X EY logfy;ˆθ T X penalises unnecessary parameters which is an advantage. However, the crude estimator in 10.1 does not, in fact it decreases as the number of parameters increase regardless of whether they usefully fit the model. We now look for an approximation which corrects for this bias. We recall that ˆθ T X is an estimator of θ g hence we start by replacing E X Dg,fˆθT X with E X Dg,fθg to give E X Dg,fˆθT X = E X Dg,fθg + E X Dg,fˆθT X E X Dg,fθg. We consider the difference E X Dg,fˆθT X E X Dg,fθg later, and start by focussing on E X Dg,fθg. Since E X Dg,fθg is unknown we replace it by its average E X Dg,fθg 1 T logfx t ;θ g. Hence we have E X Dg,fˆθT X 1 T logfx t ;θ g + E X Dg,fˆθT X E X Dg,fθg. Of course, θ g is unknown so this is replaced by ˆθ T X to give E X Dg,fˆθT X 1 T 1 + T = 1 T logfx t ;ˆθ T X+ 1 T logfx t ;ˆθ T X 1 T E X Dg,fˆθT X E X Dg,fθg logfx t ;θ g logfx t ;ˆθ T X+I 1 +I Since 1 T T logfx t;ˆθ T X is known, we now bound I 1 and I 2. We mention that the terms I 1 and I 2 are both positive. This is because θ g = argmax Dg,f θ = argmin Dg,fθ and ˆθ T = argmax T logfx t;θ. 81
82 We now bound the two differences above. We first note that by using Taylor expansions and the assumptions that E logfx;θ θ=θg = 0 we have I 1 = E X Dg,fˆθT X Dg,f θg 1 { } = E X E Y logfy t ;ˆθ T X logfy t ;θ g T = 1 T E XE Y L T Y,ˆθ T X L T Y,θ g = 1 T E LT Y,θ XE Y θg ˆθ T X θ g + 1 2T E XE Y ˆθ T X θ g 2 L T Y,θ = 1 T E LT Y,θ XE Y θg ˆθ T X θ g + 1 2T E YE X ˆθ T X θ g 2 L T Y,θ = 1 2T E YE X ˆθ T X θ g 2 L T Y,θ θx ˆθ T X θ g. Now we note that 1 T 2 L T Y,θ 2 θx Iθ g, which gives us I 1 = E X Dg,fˆθT X Dg,f θg 1 2 E X ˆθ T X θ g Iθ g ˆθ T X θ g θx ˆθ T X θ g θx ˆθ T X θ g We now obtain an estimator of I 2 in To do this we make the usual Taylor expansion noting that L Tθ θ=ˆθt = 0 I 2 = 1 T logfx t ;θ g 1 T logfx t ;ˆθ T X ˆθ T X θ g Iθ g ˆθ T X θ g To obtain the final approximations for 10.3 and 10.4 we use 9.9 where TˆθT θ g D N 0,Iθ g 1 Jθ g Iθ g 1. Now by using the above and the relationship that if Z N0,Σ then EZ AZ = trace { AΣ }. Therefore by using the above we have 1 I 2 = T logfx t ;θ g 1 T 1 T ˆθ T X θ g Iθ g ˆθ T X θ g 1 Iθ 2T trace g Iθ g 1 Jθ g Iθ g 1 82 logfx t ;ˆθ T X
83 and I 1 = E X Dg,fˆθT X Dg,f θg 1 Iθ 2T trace g Iθ g 1 Jθ g Iθ g 1 Simplifying the above and substituting into 10.2 gives E X Dg,fˆθT X 1 logfx t ;ˆθ T X+trace Jθ g Iθ g 1 T = 1 T L TX;ˆθ T X+ 1 T trace Jθ g Iθ g. 1 Hence to measure the divergence between fx;θ and g we can use the approximation E X Dg,fˆθT X 1 T L TX;ˆθ T X+ 1 T trace Jθ g Iθ g We apply the above to the setting of model selection. The idea is that we have a set of candidate models we want to fit to the data, and we want to select the best model. Suppose there are N different candidate family of models. Let {f p x;θ p ;θ p Θ p } denote the pth family. Let L p,t X;θ p = logfx t ;θ p denote the likelihood associated with the pth family. Let ˆθ p,t = argmax θp Θ p L p,t X;θ p denote the maximum likelihood estimator of the pth family. In an ideal world we would compare the different families by selecting the family of distributions {f p x;θ p ;θ p Θ p } which minimise the criterion E X Dg,fp,ˆθp,T X. However, we do not know E X Dg,fp,ˆθp,T X hence we consider an estimator of it given in This requires estimators of Jθ p,g and Iθ p,g, this we can be easily be obtained from the data and we denote this as Ĵp and Îp. We then choose the the family of distributions which minimise min 1 p N In other words, the order we select is ˆp where ˆp = arg min 1 p N 1 T L p,tx;ˆθ p,t + 1 T trace Ĵ p Îp T L p,tx;ˆθ p,t + 1 T trace Ĵ p Îp 1 83
84 Often but not always in model selection we assume that the true distribution is nested in the many candidate model. For example, the true model Y t = α 0 +α 1 x t,1 +ε t belongs to the set of families defined by Y t,p = α 0 + p α i x t,i +ε t. In this case {α 0 + p α ix t,i +ε t ;α i R p+1 } denotes the pth family of models. Since the true model is nested in most of the candidate model we are in the specified case. Hence we have Jθ g = Iθ g, in this case trace Jθ g Iθ g 1 = trace Iθ g Iθ g 1 = p. In this case 10.6 reduces to selecting the family which minimises min 1 p N 1 T L p,tx;ˆθ p,t + p T We observe that this penalises the number of parameters. This criterion above is called the AIC Akaike Information Criterion.. AICp = 1 T L p,tx;ˆθ p,t + p T. This is one of the first information criterions. But there is a bewildering array of other criterions including BIC etc, but most are similar in principle and usually take the form 1 T L p,tx;ˆθ p,t +pen T p, where pen T p denotes a penality term there are many including Bayes Information criterion etc.. Remark Usually the AIC is defined as AICp = 2L p,t X;ˆθ p,t +2p, this is more a matter of preference whether we include the factor 2T or not. We observe that as the sample size grows, the weight of penalisation relative to the likelihood declines since L p,t X;ˆθ p,t = OT. This fact can mean that the AIC can be problematic. Because it does not weight the parameters as the sample size grows, it means that the AIC can easily overfit, and select a model with a larger number of parameters than is necessary. This idea can be formalised and it can be shown that the AIC is an inconsistent estimator of the true model this means as the sample size grows, it does not select the true model - see the lemma below. 84
85 Another information criterion is the BIC this can be obtained using a different reasoning, and is defined as BICp = 2L p,t X;ˆθ p,t +plogt. The AIC does not place as much weight on the number of parameters, whereas the BIC the does place a large weight on the parameters. It can be shown that the BIC is a consistent estimator of the model so long as the true model is in the class of candidate models. However, it does have a tendency of underfitting selecting a model with too few parameters. However, in the case that the the true model does not belong to any the families, the AIC can be a more suitable criterion than other criterions. Lemma Inconsistency of the AIC Suppose that we are in the specified case and θ p is the true model. Hence the true model has order p. Then for any q > 0 we have that moreover lim P arg min AICn > p > 0, T 1 n p+q lim P arg min AICn = p 1. T 1 n p+q In other words, the AIC will with a positive probability choose the larger order model, and is more likely to select large models, as the the order q increases. PROOF. To prove the result we note that p+q-order model will be selected over p-order in the AIC if L p+q,t +p+q < L p,t +p, in other words we select p+q if Hence L p+q,t L p,t > q. P arg min 1 n p+q AICn > p P2L p+q,t L p,t > 2q. But we recall that L p+q,t and L p,t are both log-likelihoods and under the null that the pth order model is the true model we have 2L p+q,t L p,t D χ 2 q. Since Eχ 2 q = q, we have for any q > 0 that P arg min 1 n p+q AICn > p P2L p+q,t L p,t > 2q > 0. Hence with a positive probability the AIC will choose the larger model. 85
86 This means as the sample size T grows, with a positive probability we will not necessarily select the correct order p, hence the AIC is inconsistent and lim P arg min AICn = p 1. T 1 n p+q Example: Logistic and model selection This example considers model selection for logistic regression, which is covered later in this course. Example Example: Suppose that {Y i } are independent binomial random variables where Y i Bn i,p i. The regressors x 1,i,...,x k,i are believed to influence the probability p i through the logistic link function where p < q. log p i 1 p i = β 0 +β 1 x 1,i +β p x p,i +β p+1 x p+1,i +...+β q x q,i, a Suppose that we wish to test the hypothesis against the alternative H 0 : log H 0 : log p i 1 p i = β 0 +β 1 x 1,i +β p x p,i p i 1 p i = β 0 +β 1 x 1,i +β p x p,i +β p+1 x p+1,i +...+β q x q,i. State the log-likelihood ratio test statistic that one would use to test this hypothesis. If the null is true, state the limiting distribution of the test statistic. b Define the model selection criterion where C is a finite constant, L T,d β d = M n d = 2L n ˆβ d 2Cd Y i β d x id n i log1+expβ d x id+ x id = x 1,i,...,x d,i and ˆβ d = argmax βd L T,d β d. We use ˆd = argmax d M n d as an estimator of the order of the model. Suppose that H 0 defined in part 2a is true, use your answer in 2a to explain whether ni Y i, the model selection criterion M n d consistently estimates the order of model. 86
87 Solution: a The likelihood for both hypothesis is L T,d β d = Thus the log-likelihood ratio test is T T = 2 L T,q ˆβ q L T,p ˆβ p = 2 Y i β d x id n i log1+expβ d x id+ ni Y i. Y i [ˆβ A ˆβ 0]x i n i [log1+expˆβ Ax i log1+expˆβ 0x i ] where ˆβ 0 and ˆβ A are the maximum likelihood estimators under the null and alternative respectively. If the null is true, then T T D χ 2 q p as T. b Under the null we have that T T = 2 L T,q ˆβ q L T,p ˆβ p D χ 2 q p. Therefore, by definition, if ˆd = argmax d M n d, then we have Lˆdˆβ d 2Cˆd L p ˆβ p 2Cp > 0. Suppose q > p, then the model selection criterion would select q over p if 2 [ Lˆdˆβ q L p ˆβ p ] > 2Cq p. Now the LLRT test states that under the null 2 [ L q ˆβ q L p ˆβ p D χ 2 q p, thus roughly speaking we can say that [ ] P L q ˆβ q L p ˆβ p > 2Cˆd p Pχ 2 q p > 2Cq p. As the above is a positive probability, this means that the model selection criterion will select model q over the true smaller model with a positive probability. This argument holds for all q > p, thus the model selection criterion M n d does not consistently estimate d Recent model selection methods The AIC and its relatives have been extensively in statistics over the past 30 years because it is easy to evaluate. There are however problems in the case that p is large more so when p is large with respect to the sample size n, often called the large p small n problem. For example, in the situation where the linear regression model takes the form p Y t = a j x t,j +ε t, j=1 87
88 where the number of possible regressors {x t,j } is extremely large. In this case, evaluating the mle for all the p different candidate models, and then making a comparisoon can take a huge amount of computational time. In the past 10 years there has been a lot of work on alternative methods of model selection. One such method is called the LASSO, this is where rather than estimating all model individually parameter estimation is done on the large model using a penalised version of the MLE L T θ+λ p θ i. The hope is by including the λ p θ i in the likelihood many of coefficients of the regressors would be set to zero or near zero. Since the introduction of the LASSO in 1996 many variants of the LASSO have been proposed and also the LASSO has been applied to several different situations. 88
89 Chapter 11 Bootstrap methods 11.1 Asymptotic normality of the sample mean and Edgeworth expansions So far we have concentrated on showing asymptotic normality for anything under the sun. But this result, as the name suggests, is asymptotic. For finite samples, indeed small samples, this approximation can be quite poor. We recall if the original data appears to have the following features: i Skewness non-zero third order cumulant. ii Thick tails described as kurtosis - this is 1 σ 4 κ 4 σ 4, where κ 4 is the fourth order cumulant of the random variable. Then the normality result only sets in with very large sample sizes. Heuristically, this is because the features in the population distribution will influence the sampling distribution. The further the population distribution is from normal the further the sampling distribution will be too. The above is rather heuristic, it is natural to ask how the above above effects the rate of convergence of the distribution of T 1/2 T X t to normality. To answer this question we recall how we usually prove normality of the sample average T 1/2 T X t. Remark A quick review of cumulants Suppose the random variable Y has distribution function F density f the characteristic function of Y is defined as the Fourier transform χ Y t = expitxdfx = expitxfxdx. The rth order cumulant of a random variable, is the cofficient of t r /r! in the series expansion of logχ Y t. Since cumulants are derived from the logarithm of the characteristic function, they 89
90 can be represented in terms of its moments, eg. κ 1 = EX κ 2 = EX 2 EX 2. For the normal distribution all the cumulants greater than order 2 are zero this is unique for the normal distribution - and is usually how we prove normality of an estimator. The joint cumulants similar to joint moments of a multivariate random variable can be derived in a similar way. κ r Y 1,...,Y r is the joint cumulant of Y 1,...,Y r and is the coefficient of t 1...t r in the expansion of the log of the characteristic function of the joint distribution of Y 1,...,Y r. More general, if Y 1,...,Y r = X 1,...X } {{ } 1,...,X n,...x n r 1 r n = r, then κ r Y 1,...,Y r is the coefficient of n s=1 trs characteristic function of X 1,...,X n. } {{ } r n, where r s /r s! in the expansion of the log of the It is interesting to note if at least one of the random variables in Y 1,...,Y r is independent of the rest then κ r Y 1,...,Y r = 0. TheusualmethodtoshownormalityistorepresentthecharacteristicfunctionofT 1/2 T X t as a function of the cumulants and show that all cumulants over the second order cumulant which is the variance converge to zero. More precisely, the characteristic function which is the Fourier transform of the density of Y of a random variable Y, with mean zero and variance one is approximately it 2 χ Y t = exp + it3 κ 3 Y+ it4 2 3! 4! = exp it2 it 3 2 exp 3! κ 4 Y+... κ 3 Y+ it4 κ 4 Y+... 4! 11.1 where κ r Y denotes the rth cumulant of Y when this approximation is valid is beyond this course - see the recommended books for details. We recall that for standard normal random variables the characteristic function is χ Y t = exp it2 2. NowletusconsiderwhatthismeansforthedistributionofT 1/2 X = T 1/2 T X t. Tomake notation easier, we will standardise T 1/2 X, and consider the distribution of ST = T 1/2 X µ/σ, wherex t hasmeanµandvarianceσ 2. Nowbyexpandingthecumulantswhichislikeexpanding the variance of sums of random variables we have κ r S T = T r/2 = T r/2 T T s 1,...,s r=1 s=1 κ r X s 1 µ σ κ r X s µ σ 90,..., X s r µ σ = T r/2+1 κ r X µ, 11.2 σ
91 since {X t } are iid random variables. We note that for r > 2, κ r X σ = κ r X µ σ and denote κ r = κ r X µ σ the rth order cumulant of the standardised random variable. Now we obtain the characteristic function of the sample mean S T. By substituting 11.2 into 11.1 we have χ ST t = exp it2 it 3 2 exp 3! T 1 κ T 1/2 κ 3 + it4 4! } {{ } Since T r/2+1 κ r 0 as T we have that χ ST t exp it2 2, which is the normal distribution. It can be shown if the characteric function converges to the characteristic distribtion of a normal, so the distribution must converge to the normal. Hence we have the CLT - though we note that we do not require all moments to exist, in fact it is sufficient that only the second moment exist. However, for this heuristic discussion we shall assume that at least four moments exist. The above is a heuristic proof of the CLT, but already we get a feeling of how fast this convergence should be. The leading term in the above expansion is it3 3! T 1/2 κ 3, which suggests that the error in the normal approximation should be of order T 1/2. ie. PS T x Φx OT 1/2, where Φ is the cumalitive function Indeed this is the case, and it can be shown by using an Edgeworth expansion. The Edgeworth expansion, is effectively an expansion of the distribution of G T x = P T S n x in terms of the normal distribution and higher order terms. To see how it arises, use the series expansion of exp it 3 3! T 1/2 κ 3 + it4 4! T 1 κ to rewrite χ ST t as χ ST t = exp it2 it 3 2 exp T 1/2 κ 3 + it4 T 1 κ ! 4! = exp it2 2 1+T 1/2 r 1 it+t 1 r 2 it+.... where r 1 s = 1 6 κ 3s 3 r 2 s = 1 24 κ 4s κ2 3s 6. Now, we recall that the characteristic function is the Fourier transform of the distribution function, hence inverting the Fourier transform of the above we have G T x = PS T x = Φx+ 1 T 1/2p 1xφx+ 1 T p 2xφx+ 1 T 3/2p 3xφx...,11.4 where Φ is the distribution of the normal, φx is the normal distribution and p 1 x = 1 6 κ 3x 2 1 and p 2 x = x{ 1 24 κ 4x κ2 3x 4 10x 2 +15}. 91
92 Hence, as expected we have PS T x Φx = 1 } {{ } T 1/2p 1xφx+ 1 T p 2xφx+... G T x Therefore, the error in this approximation is of order OT 1/2 if the there is a skew in the distribution of X t and of order T 1 if there isn t a skew. The technical details can be found in Hall 1992, Chapter 2. This means that when we construct confidence intervals there will be some errors. For example, if we construct a 95% CI for the mean using the normal approximation, it may in reality be less than a 95% CI. Now in the same way that the errors in the probabilities using the normal approximation, can be calculated by using Edgeworth expansion, so can the errors in the quantile which give the CI, by using what is known as a Cornish-Fisher expansion, can be calculated. To understand this, let us recall what the CI for the mean µ actually means. We recall if we want to construct a 95% CI we try to find the 2.5% and 97.5% quantiles, ξ and ξ 0.975, such that Pξ T X µ σ ξ = ξ α correspond to the α quantile of the distribution G T, which is the distribution of T X µ Since T X µ σ is asymptotically normally distributed we approximate ξ and ξ with z and z which is 1.96 and 1.96 we would approximate the true CI for the mean [ X +ξ0.025 σ/ T, X +ξ σ/ T ], σ. with its normal approximation [ X +z0.025 σ/ T, X +z σ/ T ]. Hence it is interesting to see a how close z α and ξ α are and b what the difference between the two CIs are. But the Edgeworth expansion can be inverted to go from probabilities to quantiles and it can be shown that see Hall 1992, Chapter 2.5 that where ξ α = z α + 1 T 1/2 p 1z α + 1 T p 2z α +..., p 1 z = p 1 z p 2 z = p 1 zp 1z 1 z p 1z 2 p 2 z. 92
93 11.2 The Bootstrap and why it works For a review of many applications of bootstrap see Efron and Tibshirani For the theory behind the bootstrap see the books by Hall 1992, van der Waart 2000, Lahiri 2003 and Politis and Romano The Bootstrap methodology The heuristics above give us an explanation as to why the asymptotic normality assumption may not be particularly good for small samples. Bootstrap is a form of sampling from the data, which tries to capture features in the distribution which the over simplified normal approximation cannot do. Resampling methods have been in the statistical literature for over 50 years. However, it was Efron who proposed the bootstrap as it is today, and really brought to attention the its importance in solving various statistical problems. The bootstrap is a tool, which allows us to obtain better finite sample approximation of estimators. The bootstrap is used all over the place to estimate the variance, correct bias and construct CIs etc. There are many, many different types of bootstraps. Here we describe two simple versions of the bootstrap for constructing CIs. They can be roughly described as the nonparametric bootstrap and the parametric bootstrap in my opinion the nonparametric bootstrap is more flexible. The nonparametric bootstrap confidence interval for the mean We will assume that {X t } are iid random variables with mean µ, variance σ 2 and the fourth moment exists. To simplify the explanation we will assume the variance of {X t } is known. All the sampling properties of the bootstrap procedure that we describe also hold when the the variance is unknown in which case we need to use what is called the studentised bootstrap, however more sophisticated techiques and greater care has to be used to prove the results. Let us consider the sample mean X = 1 T T X t. As we mentioned above asymptotically the distribution of T X µ σ is normal, and the asymptotic 1 α100% confidence interval of the mean µ is [ X +zα/2 σ/ T, X +z 1 α/2 σ/ T ]. But we want to obtain a better approximation of the true confidence interval [ X +ξα/2 σ/ T, X +ξ 1 α/2 σ/ T ], where ξ α is the α percent quantile of the distribution G T which is the actually distribution of T X µ σ. However, we can obtain an estimator of G T. We recall that we observe the iid 93
94 random variables {X t } T, the distribution function of X t is F. In nonparametric bootstrap we do not know the distribution of F, but we can estimate it with the empirical distribution function F T x = 1 T IX t x, we note that though F T x is random since it depends on a sample, it is a proper distribution function and has a mean which is X and variance ˆσ = 1 T T x t X 2. Now we recall that G T x is basically the distribution of T 1 T T Xt µ σ, where X t are independent draws from the unknown distribution F. Hence, if we wanted to estimate G T x and did not have F to sample, then it would be natural to sample from the the distribution that we do have available which is F T. We use the following algorithm: i We could sample T independent times from F T to obtain the bootstrap sample X T,1 = X 1,1,...,X T,1. Using this we obtain the bootstrap estimator of the mean, X T,1 = 1 T T X t,1. As this is a sample from the empirical distribution function F T, the mean of X T,1 is the mean of F T, which is X recall the mean of X is the mean µ, which is the mean of the distribution F. We note this is equivalent to drawing from {X 1,...,X T } T-times with replacement. ii We do this multiple times. In fact one can draw T T different samples. For each bootstrap sample we calculate the sample mean, so that we have { X T,1,..., X T,n }, where n = TT. Based on this we can construct the bootstrap estimator of the distribution G T x which is Ĝ T x = 1 T T T T [ T X T,k X ] I ˆσ k=1 x, we use X and ˆσ/ T in the definition of Ĝ T because this is the mean and variance of X T based on sampling from F T. Now if F T x were the true distribution of X t, then Ĝ T x = G T. Of course it is not, so ĜTx is only an estimator of G T. I reality this may not be possible to obtain all T T samples this is a lot!, but we sample enough times and in a good way to obtain a good enough approximation of ĜTx. We will assume that we can obtain ĜTx. iii Since ĜTx is an estimator of G T we can obtain an estimator of the quantiles. Thus we can use this to obtain an estimator of the CIs and hope that it is more accurate than the standard normal approximation. Let ˆξ α be such that ĜTˆξ α = α. 94
95 iv The 95% CI bootstrap CI of the mean µ is [ X + ˆξα/2 σ/ T, X + ˆξ 1 α/2 σ/ T ] The parametric bootstrap confidence interval of an estimator Let us suppose {X t } are iid random variables with distribution f ;θ 0, where the parameter θ 0 is unknown. Suppose we use mle to estimate the parameter θ 0, which we denote as ˆθ T. We know that if all the regularity conditions are satisfied then we have T ˆθT θ 0 D N0,Iθ0 1, where Iθ 0 = fx;θ 2 θ=θ0 fx;θ 0 dx. Of course this is an asymptotic result. If the sample size is small, we may want to obtain a better finite sample approximation of Tˆθ T θ 0, to construct better CIs. Let G T denote the distribution of Tˆθ T θ 0. i We now sample T independent times from the distribution fx, ˆθ T, and for each bootstrap sample X T,1 = X 1,1,...,X T,1 construct the bootstrap mle ˆθ T,1. We do this many times. We denote the kth bootstrap estimator as ˆθ T,k. ii Unlike the nonparametric bootstrap there is likely to be an infinite number of draws one can make. Hence we cannot construct an estimator of G T using all possible draws of fx, ˆθ T. But one can construct an estimate of the finite sample distribution of ˆθ T θ 0 using a large number of draws. Let Ĝ T x = 1 n T I ˆθ T,k n ˆθ T x. k=1 iii Let ˆξ α be such that ĜTˆξ α = α. The 95% CI bootstrap CI of the mean θ 0 is [ˆθT + 1 T ˆξα/2,ˆθ T + 1 T ˆξ1 α/2 ]. An alternative way to construct the CI is to use the likelihood ratio test. We recall that if f ;θ 0 is the true distribution then L T ˆθ T L T θ 0 D χ 2 p and we can use this result to construct the 1001 α% CI see the section on confidence intervals. But if the sample size is small, and we believe it that normality result is a poor approximation - hence the chi-squared result would also be a poor approximation. We can use a bootstrap method instead. In this case, every bootstrap estimator ˆθ k,t, we plug it into the log-likelihood T L T ˆθ k,t = logfx t ;ˆθ k,t. 95
96 We can the construct an estimator of the distribution function of L T ˆθ T L T θ 0, which we denote as H T as Ĥ T x = 1 n n k=1 I L T ˆθ k,t L Tˆθ T x. Let ˆξ α be such that ĤTˆξ α = α. The 1001 α% CI for θ based on the log-likelihood ratio is { θ;lt θ > L T ˆθ T ˆξ 1 α }. In reality the parametric bootstrap is not used as much as the nonparametric bootstrap. The main reason is that in the misspecified case, the CIs produced have no meaning they are incorrect and will not even converge to the CIs produced by using the normal approximation using the misspecified variance I 1 θ g J θg I θg Using Edgeworth expansions to show why the nonparametric bootstrap works On first reading the bootstrap may seem a little like magic. But really it is not. We recall that G T is the distribution of X based on sampling from F. Since in reality F is unobserved, and can only be estimated using the empirical distribution function F T, it does not seem unnatural that ĜT can be used as an estimator of G T. We first state a consistency result, the proof can be found in various places, see for example Hall 1992 or van der Waart There are different ways this can be proven, in the more complex setting where we are not estimating the mean, using something like the Mallows distance which is the measure which measures the distance between distributions may be the most appropriate method for proving the result. Theorem Consistency Suppose that EXt 4 <. Then T X T X ˆσ D N0,1 noting that T X µ σ D N0,1. The value of the above result is that is shows that the bootstrap distribution ĜT converges to the standard normal, just like G T converges to the standard normal. Hence we do no loose by using the bootstrap approximation of the CIs. We now show what we can gain by using the bootstrap. Let us recall 11.4 G T x = PS T x = Φx+ 1 T 1/2p 1xφx+ 1 T p 2xφx+ 1 T 3/2p 3xφx...,
97 where Φ is the distribution of the standard normal, φx is the standard normal density and p 1 x = 1 6 κ 3x 2 1 and p 2 x = x{ 1 24 κ 4x κ2 3x 4 10x 2 +15}. We now rewrite the above results in terms of the underlying distribution of the random variables. Let us suppose the distribution of the iid random variables {X t } is F. Then rewriting the above we have G T x = PS T x F = Φx+ 1 T 1/2p 1x Fφx+ 1 T p 2x Fφx+..., 11.6 where p 1 x F,p 2 x F etc. are the polynomials, whose coefficients are determined by the cumulants p 1 x F = 1 X µf 6 κ 3 F x 2 1 σf { 1 X µf p 2 x F = x 24 κ 4 F x X µf σf 72 κ 2 } 3 F x 4 10x 2 +15, σf µf = E F X, σf 2 = E F X 2 E F X 2, X µf κ 3 F σf and E F X = xdfx. κ 3 X F = E F X 3 E F X 3 κ 4 X F = E F X 4 3E F X 2 2 3EXEX 3 +E F X 4 X µf = σf 3/2 κ 3 X F κ 4 F = σf 2 κ 4 X F σf This leads us to something rather fascinating. We recall that the bootstrap distribution Ĝ T x is an approximation of the finite sample distribution G T. G T is determined by the measure F and the bootstrap distribution is based entirely on the random measure ˆF T. Hence conditioning on the distribution ˆF T by using 11.6, we have the Edgeworth expansion of the random measure ĜTx conditioned on ˆF T which is Ĝ T x = PS T x ˆF T = Φx+ 1 T 1/2ˆp 1x ˆF T φx+ 1 T ˆp 2x ˆF T φx+..., where p 1 x ˆF T,p 2 x ˆF T, are random and given by p 1 x ˆF T = 1 6 σˆf T 3/2 κ 3 X ˆFT x 2 1 = 1 6 ˆσ 3/2ˆκ 3 x 2 1 { 1 p 2 x ˆF T = x 4 x 24ˆσ 2ˆκ ˆσ 3ˆκ 1 } 2 3x 4 10x 2 +15, since the mean of ˆF T is X the variance of ˆF T is ˆσ 2, the rth order cumulant of ˆF T is the empirical cumulant ˆκ r. 97
98 Remark Since ˆF T x is the distribution function of a discrete random variables, which gives the weight 1/T to the event X t and zero otherwise we see that X r = 1 X r EˆFT n t, hence we obtain that the mean with respect to ˆF T is X etc. Hence comparing the above with 11.6 we have r where G T x = PS T x F = Φx+ 1 T 1/2p 1x Fφx+ 1 T p 2x Fφx+..., Ĝ T x = PS T x ˆF T = Φx+ 1 T 1/2ˆp 1x ˆF T φx+ 1 T ˆp 2x ˆF T φx+..., ˆp 1 x ˆF T = 1 6 ˆσ 3/2ˆκ 3 x 2 1 { 1 ˆp 2 x ˆF T = x 4 x 24ˆσ 2ˆκ ˆσ 3ˆκ 1 } 2 3x 4 10x p 1 x F = 1 6 σ 3/2 κ 3 x 2 1 { 1 p 2 x F = x 24 σ 2 κ 4 x } 72 σ 3 κ 2 3x 4 10x Therefore taking differences gives G T x ĜTx = 1 T 1/2 p 1 x F ˆp 1 x ˆF T φx+ 1 T p 2 x F ˆp 2 x ˆF T φx Now this is whether the bootstrap distribution becomes very useful, we recall see that p 1 x F contains the third order cumulant, whereas ˆp 1 x ˆF T contains it s estimator. We recall that X µ = O p T 1/2, ˆσ σ = O p T 1/2, the same is true for ˆκ 3 and ˆκ 4, that is ˆκ 3 κ 3 = O p T 1/2, κ 4 ˆκ 4 = O p T 1/2. Substituting this into 11.7 leads to G T x ĜTx = O p 1 T. Let us now compare this result with the normal approximation in This gives us G T x Φx = O p 1 T 1/2 Hence we observe that the bootstrap distribution ĜTx leads to a better approximation of the finite sample distribution than the normal approximation. Now by using the Cornish-Fisher expansions one can show that ˆξ α ξ α = O p T 1 98
99 compared with ξ α z α = O p T 1. Hence the confidence intervals constructed using the bootstrap are more accurate than the CIs using the normal approximation. Remark i In the case that σ is unknown, a similar result can be applied, but the calculations become more complicated. However, it is always better to try and transform the estimator into a quantity which is asymptotically pivotal. We recall a distribution is asymptotically pivotal if its limiting distribution does not depend in the parameters. We recall that asymptotically the distribution of T X depends on the variance σ 2. If we bootstrap T X instead of T X µ σ and the variance σ 2 is unknown, we may not gain by in terms of approximations. ii Please observe that the calculations above are heuristic since we have not given conditions under which the expansion are true. iii Similar arguments to those given above can also be applied to bootstrapping of other parameters, θ, besides the mean. They can also be generalised to dependent data and far more complicated situations. 99
100 100
101 Chapter 12 A short description of the empirical distribution 12.1 The empirical distribution and the nonparametric likelihood See Owen 2001 Empirical likelihood for more details. We note that the empirical distribution function is often called the nonparametric maximum likelihood estimator. We now show that if the random variable T takes only discrete values ie. {t s } then the empirical distribution is the maximum likelihood estimator. We then extend this argument to the case of continuous random variables. Lemma Suppose {T i } n are iid random variables which can only take the discrete values {t s }. Let n s denote the number of occurences of t s. The likelihood of {T i } i is s Ys Then the likelihood of {T i } is L n π 1,... = log s n s + s logπ s. n 1,...,n n s=1n n 1,...,n m s πns s The maximum likelihood estimators of {π s } subject to the constraint s=1 π s = 1 is ˆπ = n s /n. Equivalently the distribution function which maximises the above likelihood is the empirical distribution function ˆF n x = 1 n n Ix T i. PROOF. It is clear that if PT = t s = π s, then the likelihood of {T i } is n L n π 1,...,λ = logπ Ti = s logπ s, s=1n 101
102 which gives the required likelihood. We need to maximise the above with respect to all the parameters {π s }. However, we observe that since {π s } are probabilities then s=1 π s = 1, so the maximisation needs to be done under this constraint. There are various way this can be done one method is to contrain the last probability as was done Section but a more general method is to use Lagrange multipliers. Usually if parameters are subject to a constraint we use a Lagrange multiplier which means maximising the log-likelihood L n π 1,..., subject to the constraint s=1 π s = 1. To enforce this constraint, we normally add an additional term to L T π 1,... and include the dummy variable λ. That is we define the constrained likelihood L n π 1,...,λ = n s logπ s +λ 1. s=1 s=1π Now if we maximise the above with respect to π 1,...,π m,λ we obtain the maximum likelihood estimators ˆπ s = n s /n. We observe that these are probabilities associated with the empirical likelihood ˆF n x = 1 n n Ix T i. We mention that we do not observe the event t i, then the mle estimator of probability of its occurrence is zero which is obvious. We now show a very similar result to the above but for continuous random variables. To show this result, let is consider the candidate distribution Gx and define the probability PT = x = Gx Gx, where Gx = GX < x, hence for discrete random variables PT = x is a probability and for continuous random variables PT = x = gxδx which is not so well defined. Let us suppose we observe {T i }. Using this notation, it is clear that the likelihood of {T i } with respect to the candidate distribution G is L n G = n GT i GT i Let z 1 < z 2 <... < z m be the distinct values which the sample {T i } n takes clearly m n and let n s denote the number of occurences of z s. We now show that the likelihood L n F is maximised when PT = z s = n s /n often n s = 1, hence ˆF n x = 1 n n Ix T i. Toshowthis, wewillshowthatforanycandidatedistributionfunctiongwehavelogl n G logl n ˆF n < 0 which implies L n G L n ˆF n. Define p s = GT s GT s and ˆp s = n s /n. Therefore logl n G logl n ˆF n = m s=1 102 n s log p s ˆp s = n m s=1 ˆp s log p s ˆp s.
103 Now we observe that m s=1 ˆp slog ps ˆp s = E log p S ˆpS, where the random variable S is such that PS = z s = ˆp s. Hence using Jenson s inequality we have Therefore we have E log loge ˆp S ps ps m = log ˆp S s=1 ˆp s p s ˆp s 0. logl n G logl n ˆF n 0, where equality only arises when p s = n s /n. Thus we have that the empirical distribution maximises the likelihood. Hence the empirical distribution is often said to be the maximum of the nonparametric likelihood. We note that the above result basically shows that the uniform distribution maximises the relative entropy. This is an important result in information theory. 103
104 104
105 Chapter 13 Survival Analysis 13.1 An introduction to survival analysis See also Section 5.4 in Davison There are also a lot of excellent books on survival analysis What is survival data? So called survival data can occur in several different applications. Usually a set of individuals are observed and we record the failure time or lifetime of that individual. We note that individual does not necessarily need to be a person but can be an electrical component etc. Examples include: Lifetime of a machine component. Time until a patient s cure, remission, passing. Time for a subject to perform a task. Duration of an economic cycle. Also it may not be time we are interested in but: Length of run of a particle. Amount of usage of a machine, eg. amount of petrol used etc. In the case that we do not observe any regressors explanatory variables which influence the survival time such as gender/age of a patient etc, we can model the survival times as iid random variables. If the survival times are believed to have the density fx;θ 0, where fx;θ 105
106 is known but θ 0 is unknown, then the maximum likelihood can be used to estimate θ. The standard results discussed in Section 4.1 can be easily applied to this type of data The survival, hazard and cumulative hazard functions Let T denote the survival time of an individual, which has density f. The density f and the distribution function Fx = x 0 fudu are not particularly informative about the chance of survival at a given time point. Instead, the survival, hazard and cumlative hazard functions, which are functions of the density and distribution function, are used instead. The survival function. This is Fx = 1 Fx. It is straightforward to see that Fx = PT x. Therefore, Fx is the probability of survival over x. It is clear why Fx is called the survival time. The hazard function The hazard function is defined as Px T < x+δx T x hx = lim δx 0 δx = 1 Fx lim δx 0 Fx+δx Fx δx Px T < x+δx = lim δx 0 δxpt x = fx Fx = dlogfx. dx We can see from the definition the hazard function is the chance of failure though it is a normalised probability, not a probability at time x, given that the individual has survived until time x. We see that the hazard function is similar to the density in the sense that it is a positive function. However it does not integrate to one. Indeed, it is not integrable. The cumulative Hazard function This is defined as It is straightforward to see that Hx = x Hx = x hudu. dlogfx x=u du = logfx. dx This is just the analogue of the distribution function, however we observe that unlike the distribution function Hx is unbounded. It is straightforward to show that fx = hxexp Hx and Fx = exp Hx. 106
107 It is useful to know that given any one of fx, Fx, Gx and gx, we can obtain the other functions. Hence there is a one-to-one correspondence between all these functions. Example The Exponential distribution Suppose that fx = 1 θ exp x/θ. Then the distribution function is Fx = 1 exp x/θ. Fx = exp x/θ, hx = 1 θ and Hx = x/θ. The exponential distribution is widely used. However, it is not very flexible. We observe that the hazard function is constant over time. This is the well known memoryless property of the exponential distribution. In terms of modelling it means that the chance of failure in the next instant does not depend on on how old the individual is. The exponential distribution cannot model aging. The Weibull distribution We recall that this is a generalisation of the exponential distribution, where fx = α θ For the Weibull distribution x α 1 exp x/θ α ;α,θ > 0, x > 0. θ Fx = 1 exp x/θ α Fx = exp x/θ α hx = α/θx/θ α 1 Hx = x/θ α. Compared to the exponential distribution the Weibull has a lot more flexibility. Depending on the value of α, the hazard function hx can either increase over time of decay over time. The shortest lifetime Suppose that Y 1,...,Y k are independent life times and we are interested in the shortest survival time for example this could be the shortest survival time of k sibling mice given some medication. Let g i, G i H i and h i denote the density, survival function cumalitive hazard and hazard function of Y i we are not assuming iid and T = miny i. Then, the survival function k k Fx = PT > x = PY i > x = G i x. 107
108 Since the cumulative hazard function satisfies Hx = log Fx, the cumulative hazard function of T is k Hx = logg i x = k H i x and the hazard function is hx = k d logg i x dx = k h i x Remark Discrete Data Let us suppose that the survival time are not continuous random variables, but discrete random variables. In other words, T can take any of the values {t i } where 0 t 1 < t 2 <... Examples include the first time an individual visits a hospital after an operation, in this case it is unlikely that the exact time of visit is known, but the date of visit may be recorded. Let PT = t i = p i, using this we can define the survival function, hazard and cumulative hazard function. Survival function The survival function is F i = PT t i = PT = t j = j=i p j. j=i Hazard function The hazard function is h i = Pt i 1 T < t i T t i 1 = PT = t i PT T i 1 p i = = F i 1 F i = 1 F i F i 1 F i 1 F i 1 Now by using the above we have the following useful representation F i = i j=2 F i F i 1 = i i 1 hj = 1 hj, 13.2 j=2 j=1 since h 1 = 0 and F 1 = 1. Cumulative hazard function The cumulative hazard function is H i = i j=1 h j. These expression will be very useful when we consider nonparametric estimators of the survival function F. 108
109 Censoring and the maximum likelihood One main feature about survival data which distinguishes it from all the data types that we have considered so far, is that often it is incomplete. This means that there are situations where the random variable survival time is not completely observed this is often called incomplete data. Usually, the incompleteness will take the form as censoring this will be the type of incompleteness we will consider here. There are many type of censoring, the type of censoring we will consider here will be right censoring. This means we may not observe the time of failure, and only have knowledge that the individual survived beyond a certain time point. For example, for some reason the individual, for reasons independent of its survival time, chooses to leave the study. In this case, we would only know that the individual survived beyond a certain time point. This is called right censoring. Left censoring arises when the start or birth of an individual is unknown hence it is known when an individual passes away, but the individuals year of birth is unknown, we will not consider this problem here. Let us suppose that T i is the survival time, but this may not be observed and we observe instead Y i = mint i,c i, where c i is the potential censoring time. We do know if the data has been censored, and together with Y i we observe the indicator variable δ i = { 1 Ti c i uncensored 0 T i > c i censored. Hence,insurvivalanalysiswetypicallyobserve{Y i,δ i } n. Weusetheobservations{Y i,δ i } n to make inference about unknown parameters in the model. Let us suppose that T i has the distribution fx;θ 0, where f is known but θ 0 is unknown. Naive approaches to likelihood construction There are two naive approaches for estimating θ 0. One method is to ignore the fact that the observations are censored and use time of censoring as if the were failure times. Hence define the likelihood n L 1,n θ = logfy i ;θ, and use as the parameter estimator ˆθ 1,n = argmax θ Θ L 1,n θ. The fundemental problem with this approach is that it will be biased. To see this consider the expectation of T 1 L 1,n θ, which is ElogfY i ;θ = c 0 logfx;θfx;θ 0 dx+fc;θ 0 logfc;θ. 109
110 We can see that θ 0 does not necessarily maximise the above differentiate the above wrt θ and see whether it does. For example, suppose f is the exponential distribution, the parameter we will be estimating is the mean θ 0. But, if we treat the censored times as failure times, then the estimated mean will be less than the true mean θ 0. Indeed, if the number of censored observations is a large proportion of the total number of observations the estimator will be asymptotically biased since it will be consistently under estimating the mean. Hence this approach should be avoided since the resulting estimator is biased. Another method is to construct the likelihood function by filtering out the censored data. In other words use the log-likelhood function n L 2,n θ = δ i logfy i ;θ, and let ˆθ 2,n = argmax θ Θ L 2,n θ be an estimator of θ. It can be shown that if a fixed censor value is used, ie. Y i = mint i,c, then this estimator is not a consistent estimator of θ, it is also biased. As above, consider the expectation of T 1 L 2,n θ, which is Eδ i logfy i ;θ = c Thus, we see that θ 0 does not maximise the above. 0 logfx;θfx;θ 0 dx. The likelihood under censoring The likelihood under censoring can be constructed using both the density and distribution functions or the hazard and cumulative hazard functions. Both are equivalent, which you choose to use depends on your inclination. The log-likelihood will be a mixture of probabilities and densities, depending on whether the observation was censored or not. We will suppose that we observe Y i,δ i where Y i = mint i,c i and δ i is the indicator variable. In this section we treat c i as if they were deterministic, we consider the case that they are random later. We first observe that if δ i = 1, then the log-likelihood of the individual observation Y i is logfy i ;θ, since PY i = x δ i = 1 = PT i = x T i c i = fx;θ hy;θfx;θ dx = dx Fc i ;θ 1 Fc i ;θ On the other hand, if δ i = 0, the log likelihood of the individual observation Y i δ i = 0 is simply one, since if δ i = 0, then Y i = c i it is given. Of course it is clear that Pδ i = 1 = 1 Fc i ;θ and Pδ i = 0 = Fc i ;θ. Thus altogether the joint density of {Y i,c i } is fx;θ δi 1 δi 1 Fc i ;θ 1 Fc i;θ 1 Fc i ;θ = fx;θ δ i Fc i ;θ 1 δ i. 110
111 Therefore by using fy i ;θ = hy i ;θfy i ;θ, and HY i ;θ = logfy i ;θ, the joint loglikelihood of {Y i,δ i } n is L n θ = = n δ i logfy i ;θ+1 δ i log 1 FY i ;θ n δ i loghti ;θ HT i ;θ n 1 δ i Hc i ;θ Hence we use as the maximum likelihood estimator ˆθ n = argmaxl n θ. Example The exponential distribution Suppose that the density of T i is fx;θ = θ 1 expx/θ, then by using 13.4 the likelihood is L n θ = n δ i logθ θ 1 Y i 1 δi θ 1 Y i. By differentiating the above it is straightforward to show that the maximum likelihood estimator is n ˆθ n = δ it i + n 1 δ ic i n δ. i However, it is not clear what the limit of ˆθ n will be, whether it is biased etc. But this can be solved if we suppose the sampling is random see below Types of censoring and consistency of the mle Often it can be shown that under certain censoring regimes the estimator converges to the true parameter and is asymptotically normal. More precisely that D n ˆθn θ 0 N0,Iθ0 1, where 1 Iθ = E n n δ i 2 logfy i ;θ n i 1 δ i 2 logfc i ;θ 2. We discuss the behaviour of the likelihood estimator defined in 13.4 for different censoring regimes. Non-random censoring Let us suppose that Y i = mint i,c, where c is some deterministic censoring point for example the number of years cancer patients are observed. We first show that the expectation of the likelihood is maximum at the true parameter this 111
112 under certain conditions means that the mle defined in 13.4 will converge to the true parameter. Taking expectation of L n θ gives E L n θ = TE δ i logft i ;θ+1 δ i logft i ;θ = T c 0 logfx;θfx;θ 0 dx+fc;θ 0 logfc theta. To show that the above is maximum at θ assuming no restrictionson the parameter space we differentiate E L n θ with respect to θ and show that it is zero at θ 0. The derivative at θ 0 is E L n θ θ=θ0 = c 0 = 1 Fc;θ fx;θdx θ=θ0 + Fc;θ θ=θ0 θ=θ0 + Fc;θ θ=θ0 = 0. θ This proves that the expectation of the likelihood is maximum at zero. Now assuming that asymptotic normality can be shown the Fisher information after rescaling with n is c Iθ = 2 2 fx;θ 0 logfx;θdx+fc;θ 0 logfc;θ. 0 We observe that when c = 0 thus all the times are censored, the the Fisher information is zero, thus the asymptotic variance of the mle estimator, ˆθ n is not finite, which makes sense. It is worth noting that under this censoring regime the estimator is consistent, but the variance of the estimator will be larger than when there is no censoring just compare the Fisher informations for both cases. Random censoring In the above we have treated the censoring times as fixed. However, they can also be treated as if they were random. In other words, the censoring times {c i = C i } are random. Usually it is assumed that {C i } are iid random variables which are independent of the survival times if there is dependence, constructing the likelihood would beverydifficult, becausethejointdistributionoft i,c i wouldberequired. Furthermore, it is assumed that the distribution of C does not depend on the unknown parameter θ. Let k and K denote the density and distribution function of {C i }. Then by using the arguments given in 13.3 and?? the likelihood of the joint distribution of {Y i,δ i } n is L n,r θ = n δ i [ logfyi ;θ+log1 KY i ] +1 δ i [ log 1 FC i ;θ +logkc i ]. 112
113 If the censoring density ky does not depend on θ, then the maximum likelihood estimator of θ 0 is identical to the maximum likelihood estimator using the non-random likelihood or, equivalently, the conditional likelihood. In other words ˆθ n = argmaxl n θ = argmaxl n,r θ. Hence the estimators using the two likelihood are the same. The only difference is the limiting distribution of ˆθ n. We now examine what ˆθ n is actually estimating in the case of random censoring. To ease notation let us suppose that the censoring times follow an exponential distribution kx = βexp βx and Kx = 1 exp βx. To see whether the ˆθ n is biased we evaluate the derivative of the likelihood. As both the full likelihood and the conditional yield the same estimators, we consider the expectation of the conditional log-likelihood. This is E L n θ = ne δ i logft i ;θ = ne logft i ;θeδ i T i +ne 1 δ i log 1 FC i ;θ +ne log 1 FC i ;θ E 1 δ i C i. We observe that Eδ i T i = PC i > T i T i = exp βt i and E1 δ i C i = PT i > C i C i = FC i ;θ 0. Therefore E L n θ = n exp βxlogfx;θfx;θ 0 dx+ 0 0 Fc;θ 0 βexp βclogfc;θdc. Thus in order for ˆθ T to consistently estimate θ 0, the derivative of the above should be zero at θ 0. On careful examination of the above differenting wrt θ and equating to zero it can be seen that this will not always be the case. Thus in the case of random sampling the mle estimator may be biased. Furthermore, information matrix of ˆθ n,r will dependent on the censoring distribution k and K. In reality, the censoring distribution will be unknown, and it will be difficult to obtain the information matrix under random sampling. In such cases, it may be better to assume non-random sampling instead. Example: In the case that T i is an exponential, see Example , the MLE is ˆθ n = n δ it i + n 1 δ ic i n δ. i Now suppose that C i israndom, then itis possible tocalculatethe limitof the above. Since the numerator and denominator are random it is not easy to calculate the expectation. 113
114 However under certain conditions the denominator does not converge to zero we have by Slutsky s theorem that ˆθ n P n Eδ it i +1 δ i C i n Eδ i = EminT i,c i. PT i < C i If the distribution of C i and T i is known, the above can be calculated exercise. Hence, we are able to calculate the limit of the mle. Check under what conditions ˆθ n is a consistent estimator of θ 0. Type I and Type II censoring Type I sampling In this case, there is an upper bound on the observation time. In other words, if T i c we observe the survival time but if T i > c we do not observe the survival time. This situation can arise, for example, when a study audit ends and there are still individuals who are alive. This is a special case of non-random sampling with c i = c. Type II sampling We observe the first r failure times, T 1,...,T r, but do not observe the n r failure times, whose survival time is greater than T r we have used the ordering notation T 1 T 2... T n. The likelihood for censored discrete data Recall the discrete survival data considered in Remark , where the failures can occur at {t s } where 0 t 1 < t 2 <... We will suppose that the censoring of an individual can occur only at the times {t s }. We will suppose that the survival time probabilities satisfy PT = t s = p s θ, where the parameter θ is unknown but the function p s is known, and we want to estimate θ. Examples, include the geometric distribution which can be considered as the discrete time version of the exponential distribution and the Poisson. As in the continuous case let Y i denote the failure time or the time of censoring of the ith individual and let δ i denote whether the ith individual is censored or not. Hence, we observe {Y i,δ i }. To simplify the exposition let us define d s = number of failures at time t s q s = number censored at time t s. Hence, since the data is discrete observing {Y i,δ i } is equivalent to observing {d s,q s } ie. the number of failures and censors at time t s, in terms of likelihood construction. Using {d s,q s } and Remark we now construct the likelihood. We shall start with the usual not log likelihood. Let PT = t s θ = p s θ and PT t s θ = F s θ. Using this notation observe that 114
115 the probability of d s,q s is p s θ ds PT t s qs = p s θ ds F s θ qs, hence the likelihood is n L n θ = Yi θ p δi F Yi θ 1 δi = p s θ ds F s θ qs s=1 = s=1p s θ ds [ p j θ] qs. j=s For most parametric inference the above likelihood is relatively straightforward to maximise. However, in the case that our objective is to do nonparametric estimation where we do not assume a parametric model and directly estimate the probabilities without restricting them to a parametric family, then rewriting the likelihood in terms of the hazard function greatly simplies matters. By using some algebraic manipulations and Remark we now rewrite likelihood in terms of the hazard functions. Using that p s θ = h s θf s 1 θ see equation 13.1 we have L n θ = p s θ ds F s θ qs F s 1 θ ds = s θ s=1 s=1h ds F s θ qs+d s+1. Now, substituting F s θ = s j=1 1 h jθ see equation 13.2 into the above gives there are some typos below - but it is hard to correct! L n θ = s h s θ ds 1 h j θ qs+d s+1 s=1 j=1 m s = h s θ d s+1 1 h j θ qs+d s+1 = s=1 s=1j=1 h s θ ds 1 h s θ m=s qm+d m+1 = s=1 s=1 s=1 h s θ ds 1 h s θ m=s qm+d m+1. To simplify notation define N s to be the number of individuals who are alive just before time t s ie. thenumberofy i t s. ItisclearthatN s = m=s q m+d m. Therefore m=s q m+d m+1 = N s d s above likelihood can be rewritten as L n θ = s=1h s θ ds 1 h s θ Ns ds. We mention that N s is often called the risk set or the number of individuals alive just prior to t s. The log-likelihood is L n θ = s=1 d s logh s θ+n s d s log1 h s θ. Thus for the discrete time case the mle of θ is the parameter which maximises the above likelihood. 115
116 Nonparametric estimators of the hazard function - the Kaplan-Meier estimator Let us suppose that {T i } are iid random variables with distribution function F and survival function F. However, we do not know the class of functions from which F or F may come from. Instead, we want to estimate F nonparametrically, in order to obtain a good idea of the shape of the survival function. Once we have some idea of its shape, we can conjecture the parametric family which may best fit its shape. If the survival has not be censored the best nonparametric estimator of the cumulative distribution function F is the empirical likelihood ˆF n x = 1 n n Ix T i. In Section 12.1 we have shown that the empirical distribution function maximises the nonparametric likelihood. Since ˆF n x is an estimator of the distribution function F. It is clear that an estimator of the survival function Fx is ˆF n x = 1 ˆF n x = 1 n n Ix > T i. Inthecase,thatthesurvivaldataiscensoredandweobserve{Y i,δ i },then ˆF n xisnotavalid estimator of the survival function. An alternative estimator is the Kaplan-Meier estimator, which is a nonparametric estimator of the survival function F that takes into account censoring. We will now derive the Kaplan-Meier estimator for discrete data. The derivation for continuous random variables is more complicated than in the discrete case, but it is similar. It should be mentioned that despite the derivation appearing complex, the actual estimator is extremely simple. We show that the maximum likelihood estimator of the hazard h s = PT = t s /PT t s 1 is ĥ s = d s N s, whered s arethenumberoffailuresattimet s andn s arethenumberofsurvivorsjustbeforetime t s. In many respects, this is a rather intuitive estimator of the hazard function. For example, if there is no censoring then it can be shown that maximum likelihood estimator of the hazard is ĥ s = d s i=s d s = number of failures at time s number who survive just before time s, which is a very natural estimator. We now show that the maximum likelihood estimator in the case of censoring is similar. In Section 12.1 we showed that the empirical distribution is the maximimum of the likelihood for non-censored data. We now show that the Kaplan-Meier 116
117 estimator is the maximum likelihood estimator when the data is censored. We recall in Section that the discrete log-likelihood for censored data is L n θ = s=1 = s=1 d s logp s θ ds +q s log[ j=s p j θ] d s logh s θ+n s d s log1 h s θ wherept = t s = p s θ,d s arethenumberoffailuresattimet s,q s arethenumberofindividuals censored at time t s and N s = m=s q m +d m. Now the above likelihood is constructed under the assumption that the distribution has a parametric form and the only unknown is θ. Let us suppose that the probabilities p s do not have a parametric form. In this case the likelihood is L n p 1,p 2,... = s=1 d s logp s +q s log[ j=s p j ] subject to the condition that p j = 1. However, it is quite difficult to directly maximise the above. Instead we use the likelihood rewritten in terms of the hazard function L n h 1,h 2,... = d s logh s +N s d s log1 h s, s=1, and maximise this. The derivative of the above with respect to h s is L n h s = d s h s N s d s 1 h s. Hence by setting the above to zero and solving for h s gives ĥ s = d s N s. If we recall that d s = number of failures at time t s and N s = number of alive just before time t s. Hence the non-parametric estimator of the hazard function is rather logical since the hazard function is the chance of failure at time t, given that no failure has yet occured, ie. ht i = Pt i 1 < T t i T t i 1. Now recalling 13.2 and substituting ĥs into 13.2 gives the survival function estimator ˆF s = s 1 ĥ j. j=1 Rewriting the above, we have the Kaplan-Meier estimator ˆFt s = s j=1 1 d j N j. 117
118 For continuous random variables, d j {0,1} as it is unlikely two or more survival times are identical, the Kaplan-Meier estimator cab be extended to give ˆFt = 1 dj 1. N j j;y j t We observe that in the case that the survival data is not censored then N j = m s=j d s, and the Kaplan-Meier estimator reduces to ˆFt = 1 1 m. s=j s d j;y j t It can be shown that this is equal to 1 ˆF n t, where ˆF n t is the empirical distribution function based on {T i } n. Of course given an estimator it is useful to approximate its variance. Some useful approximations are given in Davison 2002, page Examples Problem: survival analysis, fixed censoring and the Fisher information Example Question Let us suppose that {T i } i are the survival times of lightbulbs. We will assume that {T i } are iid random variables with the density f ;θ 0 and survival function F ;θ 0, where θ 0 is unknown. The survival times are censored, and Y i = mint i,c and δ i are observed c > 0, where δ i = 1 if Y i = T i and is zero otherwise. a i State the log-likelihood of {Y i,δ i } i. ii We denote the above log-likelihood as L T θ. Show that 2 L T θ LT θ 2 E 2 θ=θ0 = E θ=θ0, stating any important assumptions that you may use. b Let us suppose that the above survival times satisfy a Weibull distribution fx;φ,α = α φ x φ α 1 exp x/φ α and as in part a we observe and Y i = mint i,c and δ i, where c > 0. i Using your answer in part 2ai, give the log-likelihood of {Y i,δ i } i for this particular distribution we denote this as L T α,φ and derive the profile likelihood of α profile out the nusiance parameter φ. Suppose you wish to test H 0 : α = 1 against H A : α 1 using the log-likelihood ratio test, what is the limiting distribution of the test statistic under the null? 118
119 ii Let ˆφ T, ˆα T = argmaxl T α,φ maximum likelihood estimators involving the censored likelihood. Do the estimators ˆφ T and ˆα T converge to the true parameters φ and α you can assume that ˆφ T and ˆα T converge to some parameters, and your objective is to find whether these parameters are φ and α. iii Obtain the expected Fisher information matrix of maximum likelihood estimators. iv Using your answer in part 2biii derive the limiting variance of the maximum likelihood estimator of ˆα T. Solution ai The likelihood is n n L T θ = δ l logfy l ;θ+ 1 δ l logfc;θ l. l=1 l=1 aii Evaluating the first and second derivative of L T θ with respect to θ gives L T n = 1 f n δ l f + 1 fc;θ 1 δ l. Fc : θ l=1 l=1 2 L T n 1 f 2 2 = δ l f 2 1 Fc;θ 1 δ l Fc;θ l=1 n 1 2 f n + δ l f Fc;θ 1 δ l Fc;θ 2 l=1 l=1 2 In order to evaluate the expectation of the above we consider E{ 2 f }: 2 { 2 } { f c E 2 = 1 f 2 0 f 2 fdx 1 } Fc : θ 2 Fc;θ { c 2 } f + 2 Fc;θ 2dx however we observe that the second term of the above is zero because c 0 Altogether this gives 2 f 2 Fc;θ 2dx+ 2 = E 2 L T 2 = = n { c 0 c 0 1 f 2 2 f 2 2dx+ 2 f 2 fdx 1 Fc;θ c fxdx = 0 2 f 2dx = 0. } Fc : θ
120 Now we need to compare the above with E L T 2. Expanding E L T 2 gives { LT } [ { 2 n } { E = E δl 2 1 f 2 }] f 2 +E 1 δ l 2 1 Fc;θ 2 Fc;θ 2 l=1 { c 1 f 2 = n f 2 fdx 1 } Fc;θ 2. Fc;θ 0 Comparing the above with 13.6 gives the required result. bi The likelihood is L T α,φ = n l=1 δ l {logα logφ+α 1logY l α 1logφ Y l φ α } φ 1 δ 2 c φ α. We now profile out the nuisance parameter φ. L n φ = l=1 [δ l { 1 φ l 1 φ Thus keeping α fixed the mle of φ with fixed α is ˆφ α = +α Y l α ] }+1 δ φ α+1 2 α cα φ α+1. [ δl Xl α +1 δ lc α ] 1 α. δl Thus the profile likelihood for α is L T α, ˆφ n { α = logα log ˆφ α +α 1logY l α 1log ˆφ α Y } i ˆφ α α l=1 + 1 δ l cα. ˆφ α To test H 0 : α = 1 again H a : α = 1 test whether the distribution is exponential. Use the log likelihood ratio test, under the null hypothesis we have: 2{max L Tα, ˆφ α max L TA,φ} d χ 2 α 1. φ bii It is very hard to prove the result for exactly this example. However, if we show the result under a general set of assumptions, and show that the Weibull satisfy these then we have proven the result. We know that it is stated in the question the mle estimators converge to some constants/parameters We do not need to prove this. The object is to find what these parameters are. The general survival future likelihood expectation E{δlogfT i ;θ+1 δ i logfc;θ}. 120
121 The parameter which maximizes this is λ = argmax E{δlogfT i ;θ+1 δ i logfc;θ}. In Section we show that this parameter is the true parameters of the distribution. Since the Weibull distribution satisfies all the assumptions we can exchange integral and derivative. Then the above result implies ˆα T, ˆφ T α,φ the true parameters. Please refer to notes for the details. iii Differentiate L T α,φ twice with respect to α,φ and take expectations not possible to obtain an explicit expression. Iα,φ = [ E 2 L T α,φ E 2 L T α,φ α 2 α φ E 2 L T α,φ α φ E 2 L T α,φ φ 2 ] = [ Iαα I αφ I αφ I φφ iv The limiting variance at least variance of the limiting normal distribution is I φφ I αα I αφ I 2 φφ is the inverse of [ a b c d ] 1 = ] [ 1 d b ad cb c a. ]. Problem: Survival times and random censoring Example Question Let us suppose that T and C are exponentially distributed random variables, where the density of T is 1 λ exp t/λ and the density of C is 1 µ exp c/µ. i Evaluate the probability PT < C +x, where x is some finite constant. ii Let us suppose that {T i } i and {C i } i are iid survival and censoring times T i and C i are independent of each other, where the densities of T i and C i are f T t;λ = 1 λ exp t/λ and f C c;µ = 1 µ exp c/µ respectively. Let Y i = mint i,c i and δ i = 1 if Y i = T i and zero otherwise. Suppose λ and µ are unknown. We use the following likelihood to estimate λ L n λ = δ i logf T Y i ;λ+ 1 δ i logf T Y i ;λ, where F T is the survival function. Let ˆλ n = argmaxl n λ. Show that ˆλ n is an asymptotically, biased estimator of λ you can assume that ˆλ n converges to some constant. 121
122 iii Based on your results in parts i and ii construct estimators of λ and µ which are asymptotically unbiased. Solutions i PT > x = exp x/λ and PC > c = exp c/µ, thus PT < C +x = PT < c+xf C cdc = ii Differentiating the likelihood 0 1 exp c+x λ 1 µ exp c µ == 1 exp x/λ λ λ+µ. L n λ λ = δ i logf T T i ;λ λ + 1 δ i logf TC i ;λ λ, substituting fx;λ = λ 1 exp x/λ and Fx;λ = exp x/λ into the above and equating to zero gives the solution ˆλ T = δ it i + i 1 δ ic i i δ. i Now we evaluate the expectaton of the numerator and the denominator. Eδ i T i = ET i IT i < C i = ET i EIC i > T i T i = ET i exp T i /µ = texp t/µ 1 λ exp t/λdλ = 1 texp t µ+λ λ µλ = µ+λ2 µ 2 λ 3. Similarly we can show that E1 δ i C i = µ+λ2 λ 2 µ 2. Finally, we evaluate the denominator Eδ i = PT < C = 1 lambda µ+λ = µ µ+λ. Therefore my Slutsky s theorem we have µ+λ 2 P + µ+λ2 λ ˆλ T 2 µ 2 µ 2 λ 3 µ µ+λ = µ+λ4 µ 3 λ 3. Clearly if µ, this is a biased estimator of λ contrast this with the case that the censoring times are fixed at a common value c. 122
123 iii By part i we observe that the sample probability of T i < C i is and by part ii ˆλ T P µ+λ 4 µ 3 λ 3. ˆp T = 1 T i δ i P µ µ+λ Thus by using ˆp T and ˆλ T, we can solve for λ and µ, to obtain estimators of these parameters. An alternative method is to construct the likelihood of µ based on left censoring. This will give a biased estimator of µ. However, this together with the biased estimator of λ in part ii can be solved to give consistent estimators of µ and λ. Problem: survival times and fixed censoring Example Question Let us suppose that {T i } n are survival times which are assumed to be iid independent, identically distributed random variables which follow an exponential distribution with density fx;λ = 1 λ exp x/λ, where the parameter λ is unknown. The survival times may be censored, and we observe Y i = mint i,c and the dummy variable δ i = 1, if Y i = T i no censoring and δ i = 0, if Y i = c if the survival time is censored, thus c is known. a State the censored log-likelihood for this data set, and show that the estimator of λ is ˆλ n = n δ it i + n 1 δ ic n δ. i b By using the above show that when c > 0, ˆλ n is a consistent of the the parameter λ. c Derive the expected information matrix for this estimator and comment on how the information matrix behaves for various values of c. Solution 1a Since PY i c = exp cλ, the censored loglikelihood is L n λ = n δ i logλ δ i λy i 1 δ i cλ. Thus differentiating the above wrt λ and equating to zero gives the mle ˆλ n = n δ it i + n 1 δ ic n δ. i 123
124 b To show that the above estimator is consistent, we use Slutsky s lemma to obtain ˆλ n P E [ δt +1 δc ] Eδ To show tha λ = E[δT+1 δc] Eδ we calculate each of the expectations: EδT = c 0 y 1 λ exp λydy = cexp c/λ 1 λ exp c/λ+λ E1 δc = cpy > c = cexp c/λ Eδ = PY c = 1 exp c/λ. Substituting the above into gives ˆλ n P λ as n. iii To obtain the expected information matrix we differentiate the likelihood twice and take expections to obtain Iλ = ne δ i λ 2 = 1 exp c/λ λ 2. Note that it can be shown that for the censored likelihood E Lnλ 2 = E 2 L nλ. We λ 2 observe that the larger c, the larger the information matrix, thus the smaller the limiting variance. 124
125 Chapter 14 The Expectation-Maximisation Algorithm 14.1 The EM algorithm - a method for maximising the likelihood Let us suppose that we observe Y = {Y i } n. The joint density of Y is fy;θ 0, and θ 0 is an unknown parameter. Our objective is to estimate θ 0. The log-likelihood of Y is L n Y;θ = logfy;θ, Observe, that we have not specified that {Y i } are iid random variables. This is because the procedure that we will describe below is extremely general and the observations do not need to be either independent or identically distributed indeed a very interesting extension of this procedure, is to time series with missing data first proposed in Shumway and Stoffer 1982 and Engle and Watson Our objective is to estimate θ 0, in the situation where either evaluating the log-likelihood L n or maximising L n is difficult. Hence an alternative means of maximising L n is required. Often, there may exist unobserved data {U = {U i } m }, where the likelihood of Y,U can be easily evaluated. It is through these unobserved data that we find an alternative method for maximising L n. Example Let us suppose that {T i } n+m are iid survival times, with density fx;θ 0. Some of these times are censored and we observe {Y i } n+m, where Y i = mint i,c. To simplify notation we will suppose that {Y i = T i } n, hence the survival time for 1 i n, is observed 125
126 but Y i = c for n+1 i n+m. Using the results in Section the log-likelihood of Y is n n+m L n Y;θ = logfy i ;θ + i=n+1 logfy i ;θ. The observations {Y i } i=n+1 n+m can be treated as if they were missing. Define the complete observations U = {T i } i=n+1 n+m, hence U contains the unobserved survival times. Then the likelihood of Y,U is L n Y,U;θ = n+m logft i ;θ. Usually it is a lot easier to maximise L n Y,U than L n Y. We now formally describe the EM-algorithm. As mentioned in the discussion above it is easier to deal with the joint likelihood of Y,U than with the likelihood of Y itself. Hence let us consider this likelihood in detail Let us suppose that the joint likelihood of Y,U is L n Y,U;θ = logfy,u;θ. This likelihood is often called the complete likelihood, we will assume that if U were known, then this likelihood would be easy to obtain and differentiate. We will assume that the density fu Y;θ is also known and is easy to evaluate. By using Bayes theorem it is straightforward to show that logfy,u;θ = logfy;θ+logfu Y;θ 14.1 L n Y,U;θ = L n Y;θ+logfU Y;θ. Ofcourse, inrealitylogfy,u;θisunknown, becauseu isunobserved. However, letusconsider the expected value of logfy,u;θ given what we observe Y. That is Qθ 0,θ = E logfy,u;θ Y,θ 0 = logfy,u;θ fu Y,θ 0, 14.2 where fu Y,θ 0 is the conditional distribution of U given Y and the unknown parameter θ 0. Hence if fu Y,θ 0 were known, then Qθ 0,θ can be evaluated. Remark It is worth noting that Qθ 0,θ = E logfy,u;θ Y,θ0 can be viewed as the best predictor of the complete likelihood involving both observed and unobserved data - Y, U given what is observed Y. We recall that the conditional expectation is the best predictor of U in terms of mean squared error, that is the function of Y which minimises the mean squared error: EU Y = argmin g EU gy
127 The EM algorithm is based on iterating Q in such a way that at each step we obtaining a θ which gives a larger value of Q and as we will show later, this gives a larger L n Y;θ. We describe the EM-algorithm below. The EM-algorithm i Define an initial value θ 1 Θ. Let θ = θ 1. ii The expectation step The k+1-step, For a fixed θ evaluate logfy,u;θ Qθ,θ = E logfy,u;θ Y,θ = fu Y,θ du, for all θ Θ. iii The maximisation step Evaluate θ k+1 = argmax θ Θ Qθ,θ. We note that the maximisation can be done by finding the solution of logfy,u;θ E Y,θ = 0. iv If θ k and θ k+1 are sufficiently close to each other stop the algorithm and set ˆθ n = θ k+1. Else set θ = θ k+1, go back and repeat steps ii and iii again. We use ˆθ n as an estimator of θ 0. To understand why this iteration is connected to the maximising of L n Y;θ and, under certain conditions, gives a good estimator of θ 0 in the sense that ˆθ n is close to the parameter which maximises L n let us return to Taking the expectation of logfy,u;θ, conditioned on Y we have Qθ,θ = E logfy,u;θ Y,θ = logfy;θ+e logfu;θ Y,θ. observethatthisislike14.2, butthedistributionusedintheexpectationisfu Y,θ instead of fu Y,θ. Define Dθ,θ = E logfu;θ [logfu;θ ] Y,θ = fu Y,θ du. Hence we have Qθ,θ = L n θ+dθ,θ Now we recall that at the k+1th step iteration of the EM-algorithm, θ k+1 maximises Qθ k,θ over all θ Θ, hence Qθ k,θ k+1 Qθ k,θ k. In the lemma below we show that L n θ k+1 L n θ k, hence at each iteration of the EMalgorthm we are obtaining a θ k+1 which increases the likelihood over the previous iteration. 127
128 Lemma We have L n θ k+1 L n θ k. Moroever, under certain conditions we have θ k converges to the maximum likelihood estimator argmaxl n Y we do not prove this part of the result here. PROOF. From 14.3 it is clear that Qθ k,θ k+1 Qθ k,θ k = [ L n θ k+1 L n θ k ] + [ Dθ k,θ k+1 Dθ k,θ k ] We will now that [ Dθ n,θ n+1 Dθ n,θ n ] 0, the result follows from this. We observe that [ Dθn,θ n+1 Dθ n,θ n ] = log fu Y,θ n+1 fu Y,θ n fu Y,θ ndu. Now by using the Jenson s inequality which we have used several times previously we have that [ Dθn,θ n+1 Dθ n,θ n ] logfu Y,θ n+1 du = 0. Therefore, we have [ Dθ n,θ n+1 Dθ n,θ n ] 0. Since [ Dθ n,θ n+1 Dθ n,θ n ] 0 we have by 14.4 that [ Ln θ n+1 L n θ n ] Qθ n,θ n+1 Qθ n,θ n 0. and we obtain the desired result L n θ k+1 L n θ k. Remark The Fisher information The Fisher information of of the observed likelihood L n Y;θ is I n θ 0 = E 2 logfy;θ 2 θ=θ0. As in the Section 4.1, I n θ 0 1 is the asymptotic variance of the limiting distribution of ˆθ n. To understand, how much is lost by not having a complete set of observations, we now rewrite the Fisher information in terms of the complete data and the missing data. By using 14.1 I n θ 0 can be rewritten as I n θ 0 = E 2 logfy;θ 2 E 2 logfu Y;θ θ=θ 0 2 θ=θ 0 = I C n θ 0 I M n θ 0. In the case that θ is univariate, it is clear that I C n θ 0 I M n θ 0. Hence, as one would expect, the complete data set Y,U contains more information about the unknown parameter than Y. 128
129 If U is fully determined by Y, then it can be shown that I M n θ 0 = 0, and no information has been lost. From a practical point of view, one is interested in how many iterations of the EM-algorithm is required to obtain an estimator sufficiently close to the MLE. Let J n C θ 0 = E 2 logfy;θ 2 θ=θ 0 Y,θ 0 J n M θ 0 = E 2 logfu Y;θ Y,θ 0. 2 θ=θ 0 By differentiating 14.1 twice with respect to the parameter θ we have J n θ 0 = 2 logfy;θ 2 = J n C θ 0 J n M θ 0. θ=θ 0 Now it can be shown the rate of convergence of the algorithm depends on on the ratio J n C θ 0 1 J n M θ 0. The closer the largest eigenvalue of J n C θ 0 1 J n M θ 0 is to one, the slower the rate of convergence, and a large number of iterations required. On the other hand if the largest eigenvalue of J n C θ 0 1 J n M θ 0 is close to zero, then the rate of convergence is fast small number of iterations for convergence to the mle Censored data Let us return to the example at the start of this section, and construct the EM-algorithm for censored data. We recall that the log-likelihoods for censored data and complete data are and n n+m L n Y;θ = logfy i ;θ + i=n+1 n n+m L n Y,U;θ = logfy i ;θ + i=n+1 logfy i ;θ. logft i ;θ. To implement the EM-algorithm we need to evaluate the expectation step Qθ,θ. It is easy to see that Qθ,θ = E L n Y,U;θ Y,θ = n To obtain E logft i ;θ Y,θ i n+1 we note that n+m logfy i ;θ + i=n+1 E logft i ;θ Y,θ = ElogfTi ;θ T i c 1 [ = logfti ;θ ] fu;θ du. Fc; θ 129 c E logft i ;θ Y,θ.
130 Therefore we have Qθ,θ = n m logfy i ;θ + Fc;θ c [ logfti ;θ ] fu;θ du. We also note that the derivative of Qθ,θ with respect to θ is Qθ,θ = n 1 fy i ;θ fy i ;θ + Hence for this example, the EM-algorithm is i Define an initial value θ 1 Θ. Let θ = θ 1. ii The expectation step, For a fixed θ evaluate Qθ,θ = n iii The maximisation step 1 fy i ;θ fy i ;θ + m Fc;θ c m Fc;θ c Solve for Qθ,θ. Let θ k+1 be such that Qθ,θ θ=θk = 0. 1 fu;θ 1 fu;θ fu;θ fu;θ du. fu;θ fu;θ du. iv If θ k and θ k+1 are sufficiently close to each other stop the algorithm and set ˆθ n = θ k+1. Else set θ = θ k+1, go back and repeat steps ii and iii again Mixture distributions We now consider a useful application of the EM-algorithm, to the estimation of parameters in mixture distributions. Let us suppose that {Y i } n are iid random variables with density fy;θ = pf 1 y;θ 1 +1 pf 2 y;θ 2, where θ = p,θ 1,θ 2 are unknown parameters. For the purpose of identifiability we will suppose that θ 1 θ 2, p 1 and p 0. The log-likelihood of {Y i } is L n Y;θ = n log pf 1 Y i ;θ 1 +1 pf 2 Y i ;θ Now maximising the above can be extremely difficult. As an illustration consider the example below. 130
131 Example Let us suppose that f 1 y;θ 1 and f 2 y;θ 1 are normal densities, then the log likelihood is L n Y;θ = n 1 log p exp 1 2πσ 2 1 2σ 2 1 Y i µ p exp 1 2πσ 2 2 2σ 2 2 Y i µ 2 2. We observe this is extremely difficult to maximise. On the other hand if Y i were simply normally distributed then the log-likelihood is extremely simple n L n Y;θ logσ σ1 2 Y i µ In other words, the simplicity of maximising the log-likelihood of the exponential family of distributions see Section 3.1 is lost for mixtures of distributions. We now use the EM-algorithm as an indirect but simple method of maximising In this example, it is not clear what observations are missing. However, let us consider one possible intepretation of the mixture distribution. Let us define the random variables δ i and Y i, where δ i {1,2}, and the probability of Y i = y given δ i is Pδ i = 1 = p and Pδ i = 2 = 1 p, fy i = y δ i = 1 = f 1 y;θ 1 dy and fy i = y δ i = 2 = f 2 y;θ 2 dy. Therefore, it is clear from the above that the density of Y i is fy;θ = pf 1 y;θ 1 +1 pf 2 y;θ 2. Hence, one interpretation of the mixture model is that there is a hidden unobserved random variable which determines the state or distribution of Y i. A simple example, is that Y i is the height of an individual and δ i is the gender. However, δ i is unobserved and only the height is observed. Often a mixture distribution has a physical interpretation, similar to the height example, but sometimes it can be used to parametrically model a wide class of densities. Based on the discussion above, U = {δ i } can be treated as the missing observations. The likelihood of Y i,u i is { p1 f 1 Y i ;θ 1 } Iδ i =1{ p2 f 2 Y i ;θ 2 } Iδ i =2 = pδi f δi Y i ;θ δi. where we set p 2 = 1 p. Therefore the log likelihood of {Y i,δ i } is L n Y,U;θ = n logpδi +logf δi Y i ;θ δi. 131
132 We now need to evaluate Qθ,θ = E n L n Y,U;θ Y,θ = E logpδi Y i,θ +E logfδi Y i ;θ δi Y i,θ. We see that the above expectation is taken with respect the distribution of δ i conditioned on Y i and the parameter θ. By using conditioning arguments it is easy to see that Therefore Pδ i = 1 Y i = y,θ = Pδ i = 1,Y i = y;θ PY i = y;θ := w 1 θ p f 2 y,θ 2, Pδ i = 2 Y i = y,θ = p f 1 y,θ 1, +1 p f 2 y,θ 2, := w 2 θ = 1 w 1 θ. Qθ,θ = n logp+logf 1 Y i ;θ 1 w 1 θ + = n p f 1 y,θ 1, p f 1 y,θ 1, +1 p f 2 y,θ 2, log1 p+logf 2 Y i ;θ 2 w 2 θ. Now maximising the above with respect to p,θ 1 and θ 2 in general will be much easier than maximising L n Y;θ. For this example the EM algorithm is i Define an initial value θ 1 Θ. Let θ = θ 1. ii The expectation step, For a fixed θ evaluate n Qθ,θ = logp+logf 1 Y i ;θ 1 w 1 θ + iii The maximisation step n log1 p+logf 2 Y i ;θ 2 w 2 θ. Evaluate θ k+1 = argmax θ Θ Qθ,θ by differentiating Qθ,θ wrt to θ and equating to zero. Sincetheparameterspandθ 1,θ 2 areinseparatesubfunctions, theycanbemaximised separately. iv If θ k and θ k+1 are sufficiently close to each other stop the algorithm and set ˆθ n = θ k+1. Else set θ = θ k+1, go back and repeat steps ii and iii again. Exercise: Derive the EM algorithm in the case that f 1 and f 2 have normal densities. It is straightforward to see that the arguments above can be generalised to the case that the density of Y i is a mixture of r different densities. However, we observe that the selection of r can be quite adhoc. There are methods for choosing r, these include the reversible jump MCMC methods proposed by Peter Green. 132
133 Example Question: Suppose that the regressors x t are believed to influence the response variable Y t. The distribution of Y t is PY t = y = p λy t1 exp λ t1y y! where λ t1 = expβ 1 x t and λ t2 = expβ 2 x t. +1 p λy t2 exp λ t2y, y! i State minimum conditions on the parameters, for the above model to be identifiable? ii Carefully explain giving details of Qθ,θ and the EM stages how the EM-algorithm can be used to obtain estimators of β 1,β 2 and p. iii Derive the derivative of Qθ,θ, and explain how the derivative may be useful in the maximisation stage of the EM-algorithm. iv Given an initial value, will the EM-algorithm always find the maximum of the likelihood? Explain how one can check whether the parameter which maximises the EM-algorithm, maximises the likelihood. Solution i 0 < p < 1 and β 1 β 2 these are minimum assumptions, there could be more which is hard to account for given the regressors x t. ii We first observe that PY t = y is a mixture of two Poisson distributions where each has the canonical link function. Define the unobserved variables, {U t }, which are iid and where PU t = 1 = p and PU t = 2 = 1 p and PY = y U i = 1 = λy t1 exp λ t1y y! and PY = y U i = 2 = λy t2 exp λ t2y y!. Therefore, we have logfy t,u t,θ = Y t β u t x t expβ u t x t +logy t!+logp, where θ = β 1,β 2,p. Thus, ElogfY t,u t,θ Y t,θ is ElogfY t,u t,θ Y t,θ = Y t β 1x t expβ 1x t +logy t!+logp πθ,y t + Y t β 2x t expβ 2x t +logy t!+logp 1 πθ,y t. where PU i Y t,θ is evaluated as PU i = 1 Y t,θ = πθ,y t = pf 1 Y t,θ pf 1 Y t,θ +1 pf 2 Y t,θ, 133
134 with f 1 Y t,θ = expβ 1 x ty t exp Y t expβ 1 x t Y t! Thus Qθ,θ is Qθ,θ = f 1 Y t,θ = expβ 1 x ty t exp Y t expβ 1 x t. Y t! Y t β 1x t expβ 1x t +logy t!+logp πθ,y t + Y t β 2x t expβ 2x t +logy t!+log1 p 1 πθ,y t. Using the above, the EM algorithm is the following: a Start with an initial value which is an estimator of β 1,β 2 and p, denote this as θ. b For every θ evaluate Qθ,θ. c Evaluate argmax θ Qθ,θ. Denote the maximum as θ and return to step b. d Keep iterating until the maximums are sufficiently close. iii The derivative of Qθ,θ is Qθ,θ β 1 = Qθ,θ β 2 = Qθ,θ p = Y t expβ 1x t x t πθ,y t Y t expβ 2x t x t 1 πθ,y t 1 p πθ,y t 1 1 p 1 πθ,y t. Thus maximisation of Qθ,θ can be achieved by solving for the above equations using iterative weighted least squares. iv Depending on the initial value, the EM-algorithm may only locate a local maximum. To check whether we have found the global maximum, we can start the EM-algorithm with several different initial values and check where they converge. Example Question 2 Let us suppose that F 1 t and F 2 t are two survival functions. Let x denote a univariate regressor. [25] i Show that Ft;x = pf 1 t expβ 1x +1 pf 2 t expβ 2x is a valid survival function and obtain the corresponding density function. 134
135 ii Suppose that T i are survival times and x i is a univariate regressor which exerts an influence an T i. Let Y i = mint i,c, where c is a common censoring time. {T i } are independent random variables with survival function Ft;x i = pf 1 t expβ 1x i +1 pf 2 t expβ 2x i, where both F 1 and F 2 are known, but p, β 1 and β 2 are unknown. State the censored likelihood and show that the EM-algorithm together with iterative least squares in the maximisation step can be used to maximise this likelihood sufficient details need to be given such that your algorithm can be easily coded. Solution i Since F 1 and F 2 are monotonically decreasing positive functions where F 1 0 = F 2 0 = 1 and F 1 = F 2 = 0, then it immediately follows that Ft,x = pf 1 t eβ 1 x +1 pf 2 t eβ 2 x is the same use that df 1t dt = f 1 t, thus Ft;x is a survival function. Ft, x = pe β1x f 1 tf 1 t eβ 1 x 1 1 pe β2x f 2 tf 2 t eβ 2 x 1 t ft;x = pe β1x f 1 tf 1 t eβ 1 x 1 +1 pe β2x f 2 tf 2 t eβ 2 x 1 ii The censored log likelihood is L n β 1,β 2,p = [δ i logfy i ;β 1,β 2,p+1 δ i logfy i ;β 1,β 2,p]. Clearly, directly maximizing the above is extremely difficult. Thus we look for an alternative method via the EM algorithm. Define the unobserved variable I i = { 1 with PIi = 1 = p = p 1 2 with PI i = 2 = 1 p = p 2. Then the joint density of ϕ i,δ i,i i is } { } δ i {logp Ii +β Ii x+logf Ii Y i +e β I x i 1logF Ii Y i +1 δ i logp Ii +e β I x i logf Ii Y i. Thus the complete log likelihood is L T Y,δ,I i ;β 1,β 2,p = {δ i [logp Ii +β Ii x+logf Ii Y i +e β I x i 1logF Ii Y i +1 δ i [logp Ii +e β I i x logf Ii Y i ]} 135
136 Now we need to calculate PI i Y i,δ i. We have ω δ i = PI i = 1 Y i,δ i = 1,p α,β α 1,β α 2 = p α e βα 1 x f 1 Y i F 1 Y i eβα 1 x 1 p α e βα 1 x f 1 Y i F 1 Y i eβα 1 x 1 +1 p α e βα 2 x f 2 Y i F 2 Y i eβα 2 x 1 ω δ i=0 i = PI i = 1 Y i,δ i = 0,p α,β α 1,β α 2 = p α F 1 Y i eβα 1 x p α F 1 Y i eβα 1 x +1 p α F 2 Y i eβα 2 x Therefore the complete likelihood conditioned on what we observe is Qθ,θ α = {δ i ω δ i i [logp+β 1x i +logf 1 Y i +e β 1x i 1logF 1 Y i ] + 1 δ i ω 1 δ i i + [ ] logp+e β 1x i logf 1 Y i } {δ i 1 ω δ i i [log1 p+β 2x i +logf 2 Y i +e β 2x i 1logF 2 Y i ] + 1 δ i 1 ω 1 δ i i [log1 p+e β 2x i logf 2 Y i ]} The conditional likelihood, above, looks unwieldy. However, the parameter estimates can to be separated. First, differentiating with respect to p gives Q T p = δ i ω δ 1 i i p + T ω 1 δ i i 1 δ i 1 T p δ i 1 ω δ i i 1 T 1 p 1 ω 1 δ 1 i i 1 δ i 1 p. Equating the above to zero we have the estimator ˆp = a a+b, where a = b = δ i ω δ i i + ω 1 δ i i 1 δ i T δ i 1 ω δ i i 1 ω 1 δ i i 1 δ i. Now we consider the estimates of β 1 and β 2 at the i th iteration step. Differentiating Q wrt 136
137 to β 1 and β 2 gives Q β 1 = Q β 2 = 2 Q β Q β 2 2 = = {δ i ω δ i i [ ] 1+e β 1x i logf 1 Y i +1 δ i ω 1 δ i i e β 1x i logf 1 Y i }x i = 0 ] {δ i 1 ω δ i i [1+e β 2x i logf 2 Y i δ i ω δ i i eβ 1x i logf 1 Y i +1 δ i ω 1 δ i i e β 1x i logf 1 Y i x 2 i +1 δ i 1 ω 1 δ i i e β 2x i logf 2 Y i }x i = 0 δ i 1 ω δ i i eβ 2x i logf 2 Y i +1 δ i 1 ω 1 δ i i e β 2x i logf 2 Y i x 2 i. Thus to estimate β 1,β 2 at the j th iteration we use [ β j 1 β j 2 ] = [ β j 1 1 β j 1 2 ] + 2 Q β Q β β j 1 [ Q β 1 Q β 2 ] β j 1 Thus β j 1 = β j Q β Q β 1 β j 1. And similarly for β j 2. Now we can rewrite Q 2 1 Q β1 2 β 1 β j 1 as X ω j 1 1 X 1 X S j 1 1 where ω j 1 X = x 1,x 2,...,x T, 1 = diag[ω j 1 S j 1 1 = ω j 1 1i S j 1 ij S j S j 1 1T = δ i ω s i eβj 1 i i 11,...,ω j 1 1T ],,with = δ i ω δ i i [1+eβj 1 logf 1 Y i +1 δ i ω 1 δ i i e βj 1 i x i logf 1 Y i 1 x i logf 1 Y i ]+1 δ i ω 1 δ i i e βj 1 1 x i logf 1 Y i ] Thus altogether in the EM-algorithm we have: Start with initial value β 0 1,β0 2,p0 Step 1 Set β 1,r 1,β 2,r 1,p r 1 = β1,β 2,p. Evaluate ω δ i i and ω 1 δ i i these probabilies/weights stay the same throughout the iterative least squares. 137
138 Step 2 Maximize Qθ,θ by using the algorithm p r = ar a r+b r where a r,b r are defined previously. Now evaluate β j 1 = β j 1 1 +X ω j 1 1 X 1 X S j01 1 same for β j 2, β j 2 +X ω j 1 2 X 1 X 1 S j 1 2 iterate until convergence. Step 3 Let β 1r,β 2r,p r be the limit of the iterative least squares, go back to step 1 until convergence. Example Question Let us suppose that X and Z are independent positive random variables with densities f X and f Z respectively. i Derive the density function of 1/X. ii Show that the density of XZ is or equivalently c 1 f Z cyf X c 1 dc. b Consider the linear regression model 1 x f Z y x f Xxdx 14.7 Y i = α x i +σ i ε i where ε i follows a standard normal distribution mean zero and variance 1 and σ 2 i follows a Gamma distribution fσ 2 ;λ = σ2κ 1 λ κ exp λσ 2, σ 2 0, Γκ with κ > 0. Let us suppose that α and λ are unknown parameters but κ is a known parameter. i Give an expression of the log-likelihood of Y i and explain why it is difficult to compute the maximum likelihood estimate? ii As an alternative to directly maximising the likelihood, the EM algorithm can be used instead. Derive the EM-algorithm for this case. In your derivation explain what quantities will have to be evaluated numerically. 138
139 Solution 3 a i P1/X c = PX > 1/c = 1 F X 1/c F X distribution function of X. Therefore the density of 1/X is f 1/X c = 1/c 2 f X 1/c. ii We first note that PXZ y X = x = PZ y/x. Therefore the density of XZ X is f XZ X y = x 1 f Z y x. Using this we obtain the density of XZ f XZ y = = PXZ = y X = xf X x = f XZ X y xf X xdx 1 x f Z y x f Xxdx 14.8 Or equivalently we can condition on 1 X to obtain f XZ y = PXZ = y 1 X = xf 1/Xc = = c 1 f Z cyf X c 1 dc. cf Z cy 1 c 2f Xc 1 dc Note that with a change of variables c = 1/x we can show that both integrals are equivalent. b i We recall that Y i = α x i +σ i ε i. Therefore the log-likelihood of Y i is L n α,λ = = n logf σε Y i α x i n 1 log x f σ Y i α x i ;λf ε xdx, x where we use 14.8 to obtain the density of f σε, f σ ;λ is the density of a square root Gamma random variable and f ε is the density of a normal. It is clear either it is very hard or impossible to obtain an explicit expression for f σε. ii Let U = σ 2 1,...,σ2 n denote the unobserved variances which are unobserved and Y = Y 1,...,Y n which is observed. The complete unobserved log-likelihood of U,Y is L T Y,U;α,λ = n logσi [ σi 2 Yi α ] 2 x i +κ 1logσ 2 i +κlogλ λσi 2 Of course, the above can not be evaluated, since U is unobserved. Instead we evaluate the condition expectation of the above with respect to what is observed. Thus the 139
140 conditioned likelihood with respect to Y and the parameters α,λ is Qα,λ = E L T Y,U;α,λ Y,α,λ = n { E logσi 2 σ 2 i ε 2 i = Y i α x i 2,λ 1 [Yi +E σi 2 σ2 iε 2 i = Y i α x i 2,λ α ] 2 x i +κ 1E logσi 2 σ2 iε 2 i = Y i α x i 2,λ +κlogλ } λeσi 2 σ2 iε 2 i = Y i α x i 2,λ. We note that the above is true because conditioning on Y i and α, means that σ 2 i ε2 i = Y i α x i 2 is observed. Thus by evaluating Q at each stage we can implement the EM algorithm: The EMalgorithm i Define an initial value θ 1 Θ. Let θ = θ 1. ii The expectation step The k+1-step, For a fixed θ evaluate logfy,u;θ Qθ,θ = E logfy,u;θ Y,θ = fu Y,θ du, for all θ Θ. iii The maximisation step Evaluate θ k+1 = argmax θ Θ Qθ,θ. We note that the maximisation can be done by finding the solution of logfy,u;θ E Y,θ = 0. iv If θ k and θ k+1 are sufficiently close to each other stop the algorithm and set ˆθ n = θ k+1. Else set θ = θ k+1, go back and repeat steps ii and iii again. The useful feature of this EM-algorithm is that if the weights 1 E σi 2 σiε 2 2 i = Y i α x i 2,λ Eσi σ 2 iε 2 2 i = Y i α x i 2,λ. are known. Then we donot need to numerically maximise Qα,λ at each stage. This 140
141 is because the derivative of Qα,λ leads to an explicit solution for α and λ: Qα, λ α Qα, λ α n 1 [Yi = 2 E σ 2 σiε 2 2 i = Y i α x i 2,λ α ] x i xi = 0 i n κ = λ Eσ2 i σiε 2 2 i = Y i α x i 2,λ = 0. It is straightfoward to see that the above can easily be solved for α and λ. Of course we need to evaluate the weights 1 E σiε 2 2 i = Y i α x i 2,λ σ 2 i Eσi σ 2 iε 2 2 i = Y i α x i 2,λ. This is done numerically, by noting that for a general g the conditional expectation is Egσ 2 σ 2 ε 2 = y = gσ 2 f σ 2 σ 2 ε 2σ2 ydσ 2. Thus to obtain the density of f σ 2 σ 2 ε 2 we note that Pσ2 < s σ 2 ε 2 = y = Py < ε 2 s = Pε 2 y/s = 1 Pε 2 y/s. Hence the density of σ 2 given σ 2 ε 2 = y is f σ 2 σ 2 ε 2s ε2 = 1 f s 2 ε 2y/s, where f ε 2 is a chi-squared distribution with one degree of freedom. Hence Egσ 2 σ 2 ε 2 = y = gσ 2 1 σ 2f ε 2y/σ2 dσ 2. Using the above we can numerically evaluate the above conditional expectations and thus Qα, λ. We keep iterating until we get convergence Hidden Markov Models Finally, we consider applications of the the EM-algorithm to parameter estimation in Hidden Markov Models HMM. This is a model where the EM-algorithm pretty much surpasses any other likelihood maximisation methodology. It is worth mentioning that the EM-algorithm in this setting is often called the Baum-Welch algorithm. Hidden Markov models are a generalisation of mixture distributions, however unlike mixture distibutions it is difficult to derive an explicit expression for the likelihood of a Hidden Markov Models. HMM are a general class of models which are widely used in several applications including speech recongition, and can easily be generalised to the Bayesian set-up. A nice description of them can be found on Wikipedia. 141
142 In this section we will only briefly cover how the EM-algorithm can be used for HMM. We do not attempt to address any of the issues surrounding how the maximisation is done, interested readers should refer to the extensive literature on the subject. The general HMM is described as follows. Let us suppose that we observe {Y t }, where the rvs Y t satisfy the Markov property PY t Y t 1,Y t 1,... = PY t Y t 1. In addition to {Y t } there exists a hidden unobserved discrete random variables {U t }, where {U t } satisfies the Markov property PU t U t 1,U t 2,... = PU t U t 1 and drives the dependence in {Y t }. In other words PY t U t,y t 1,U t 1,... = PY t U t. To summarise, the HMM is described by the following properties: i We observe {Y t } which can be either continuous or discrete random variables but do not observe the hidden discrete random variables {U t }. ii Both{Y t }and{u t }aretime-homogenuousmarkovrandomvariablesthatispy t Y t 1,Y t 1,... = PY t Y t 1 andpu t U t 1,U t 1,... = PU t U t 1. ThedistributionsofPY t,py t Y t 1, PU t and PU t U t 1 do not depend on t. iii Thedependencebetween{Y t }isdrivenby{u t },thatispy t U t,y t 1,U t 1,... = PY t U t. There are several examples of HMM, but to have a clear intepretation of them, in this section we shall only consider one classical example of a HMM. Let us suppose that the hidden random variable U t can take N possible values {1,...,N} and let p i = PU t = i and p ij = PU t = i U t 1 = j. Moreover, let us suppose that Y t are continuous random variables where Y t U t = i Nµ i,σi 2 and the conditional random variables Y t U t and Y τ U τ are independent of each other. Our objective is to estimate the parameters θ = {p i,p ij,µ i,σi 2} given {Y i}. Let f i ;θ denote the normal distribution Nµ i,σi 2. Remark HMM and mixture models Mixture models described in the above section are a particular example of HMM. In this case the unobserved variables {U t } are iid, where p i = PU t = i U t 1 = j = PU t = i for all i and j. Let us denote the log-likelihood of {Y t } as L T Y;θ this is the observed likelihood. It is clear that constructing an explicit expression for L T is difficult, thus maximising the likelihood is near impossible. In the remark below we derive the observed likelihood. Remark The likelihood of Y = Y 1,...,Y T is L T Y;θ = fy T Y T 1,Y T 2,...;θ...fY 2 Y 1 ;θpy 1 ;θ = fy T Y T 1 ;θ...fy 2 Y 1 ;θfy 1 ;θ. 142
143 Thus the log-likelihood is L T Y;θ = logfy t Y t 1 ;θ+fy 1 ;θ. The distribution of fy 1 ;θ is simply the mixture distribution t=2 fy 1 ;θ = p 1 fy 1 ;θ p N fy 1 ;θ N, where p i = PU t = i. The conditional fy t Y t 1 is more tricky. We start with fy t Y t 1 ;θ = fy t,y t 1 ;θ. fy t 1 ;θ An expression for fy t ;θ is given above. To evaluate fy t,y t 1 ;θ we condition on U t,u t 1 to give using the Markov and conditional independent propery fy t,y t 1 ;θ = i,j fy t,y t 1 U t = i,u t 1 = jpu t = i,u t 1 = j = i,j fy t U t = ipy t 1 U t 1 = jpu t = i U t 1 = jpu t 1 = i = i,j f i Y t ;θ i f j Y t 1 ;θ j p ij p i. Thus we have fy t Y t 1 ;θ = i,j f iy t ;θ i f j Y t 1 ;θ j p ij p i i p. ify t 1 ;θ i We substitute the above into L T Y;θ to give the expression L T Y;θ = Now try to maximise this! t=2 log i,j f iy t ;θ i f j Y t 1 ;θ j p ij p i i p ify t 1 ;θ i N +log p i fy 1 ;θ i Instead we seek an indirect method for maximising the likelihood. By using the EM algorithm we can maximise a likelihood which is a lot easier to evaluate. Let us suppose that we observe {Y t,u t }. Since PY U = PY T Y T 1,...,Y 1,UPY T 1 Y T 2,...,Y 1,U...PY 1 U = T PY t U t, and the distribution of Y t U t is Nµ Ut,σ 2 U t, then the complete likelihood of {Y t,u t } is T T fy t U t ;θ p U1 p Ut U t t=2
144 Thus the log-likelihood of the complete observations {Y t,u t } is L T Y,U;θ = logfy t U t ;θ+ logp Ut U t 1 +logp U1. Of course, we do not observe the complete likelihood, but the above can be used in order to define the function Qθ,θ which is maximised in the EM-algorithm. It is worth mentioning that given the transition probabilities of a discrete Markov chain that is {p i,j } ij one can obtain the marginal probabilities {p i }. Thus it is not necessary to estimate the marginal probabilities {p i } note that the exclusion of {p i } in the log-likelihood, above, gives the conditional complete log-likelihood. We recall that to maximise the observed likelihood L T Y;θ using the EM algorithm involves evaluating Qθ,θ, where T Qθ,θ = E logfy t U t ;θ+ = U = U t=2 logp Ut U t 1 +logp U1 Y,θ t=2 T logfy t U t ;θ+ logp Ut U t 1 +logp U1 pu Y,θ t=2 [logfy t U t ;θ]pu t Y,θ + U [logp Ut U t 1 ]PU t,u t 1 Y,θ +[logp U1 ]PU 1 Y,θ, and U denotesallcombinationsofu. SincePU t Y,θ = PU t Y,θ /PY,θ andpu t,u t 1 Y,θ = PU t,u t 1 Y,θ /PY,θ and PY,θ is common to all U t and is independent of θ we can define t=2 Qθ,θ = U [logfy t U t ;θ]pu t,y,θ + U [logp Ut U t 1 ]PU t,u t 1,Y,θ +[logp U1 ]PU 1,Y,θ, t=2 where Qθ,θ Qθ,θ and the maximum of Qθ,θ with respect to θ is the same as the maximum of Qθ,θ. Thus the quantity Qθ,θ is evaluated and maximised with respect to θ. For a given θ and Y, the conditional probabilities PU t,y,θ and PU t,u t 1,Y,θ can be evaluated through a series of iterative steps. For this example the EM algorithm is i Define an initial value θ 1 Θ. Let θ = θ 1. ii The expectation step, For a fixed θ evaluate PU t,y,θ, PU t,u t 1,Y,θ. Qθ,θ defined in
145 iii The maximisation step Evaluate θ k+1 = argmax θ Θ Qθ,θ by differentiating Qθ,θ wrt to θ and equating to zero. iv If θ k and θ k+1 are sufficiently close to each other stop the algorithm and set ˆθ n = θ k+1. Else set θ = θ k+1, go back and repeat steps ii and iii again. 145
146 146
147 Chapter 15 The nonparametric density 15.1 Nonparametric density estimation The nonparametric density estimator Inthesectionsaboveweoftenassumedthat{X t }wereiidrandomvariableswithdensityfx;θ 0, where f was known but θ 0 was unknown. In reality we often do not know the parametric family from which {X t } comes from. We may want an estimator of f, without imposing any parametric constraints on it we can even use this estimator as a guide as to what parametric would be appropriate for fitting to the data. The most obvious way to estimate f would be use the histogram. In other words, partition the interval into a sequence a 1 < a 2 < a 3 <... < a N, such that [a 1,a N ] is the range of {X t } T. Let N j denote the number of observations in the interval a j 1,a j ] and define ˆfx = 1 T N j=2 N j a j a j 1 I aj 1,a j ]x. There are a few drawbacks in using ˆfx: a A plot of ˆfx demonstrates that it is quite rough, blocky and not particularly appealing to look at. b Suppose that the frequency in the interval [a j 1,a j ] is very large, and most of the values of X t which lie in [a j 1,a j ] are close to a j 1. I want an estimator of fx, where x is less than a j 1, but is close to a j 1. The estimator of ˆfx, does not take into account the proximity of x to the values of X t which lie in the adjacent interval [a j 1,a j ]. Hence in some sense it is not particularly local. 147
148 Drawback b can be remedied by considering the following kernel density estimator of f ˆf T x = 1 T 1 h I [x h/2,x+h/2]x t = 1 T 1 2h I [ h/2,h/2]x X t = 1 ht I [ 1/2,12/] x X t h Notice that the above estimator is based on counting the number of observations in a window of length [ h/2,h/2] about x. Hence, using the above estimator we do not have to choose the partitions {a j } and the problem that x can be in close proximity of several X t, but in a different interval. The only parameter that needs to be chosen is the bandwidth h, which is a big headache. However, as a plot of ˆfT x will tell you, ˆfT still suffers from a roughness problem, and this is because a rectangular window is being used. We can replace the rectangular window I [ 1,1], by a smooth window, which gives a more appealing looking density estimator. Hence, in general, the density estimator of f is ˆf T x = 1 ht Kx X t, h where K is a positive kernel function and is such that Kxdx = 1. Observe that the kernel function K can be treated as a density function. Classical examples of kernel functions are the Gaussian and uniform which is the rectangular kernel etc. A different kernel will give a slightly different function. The choice of h depends on the sample size. The larger the sample size, the smaller h should be. To understand why recall that if h is small then Px h/2 X t x + h/2 hfx, therefore and an estimator of the above is fx Px h X t x+h h Px h/2 X t x+h/2 2 number in [x h/2,x+h/2] h 1 ht Kx X t. h Hence, h needs to be small, but not too small such that there are seldom any observations in in [x h,x+h]. Therefore we will suppose that h 0 as T. Remark To understand the kernel density estimator do some simulations. Draw a random sample from your favourite distribution. In R the density can be estimated using the function density. The function density gives various options. You have the option of using your favourite kernel, choices are gaussian, 148
149 epanechnikov, rectangular, triangular, biweight, cosine, optcosine. You can also specify the bandwidth using bw. plotdensityx,bw=0.5,kernel = rectangular computes and plots the kernel density estimator for the observations X using the rectangular kernel and bandwidth b = 0.5. See Figure 15.1 for the plot of the same set of observations n = 20 using different kernels and different bandwidths. See also Section in Davison We will discuss how to choose h in the section below Properties of ˆf T Thefirstthingtonoticeabout ˆf T isthatitisaviabledensity, inthesensethat ˆf T isnon-negative and ˆf T xdx = 1 T 1 h Kx X t dx = 1, h the above is obtained by making a change of variable u t = x Xt h. We now want to investigate how good an estimator of f is ˆf T. We do this by calculating the bias and the variance, which together gives the mean squared error. Assumption i The function f is Lipschitz continuous. This means that there exists a finite constant B such that fx fy B x y. Examples of Lipschitz continuous functions are functions whose first derivative are bounded. Intuitively a Lipschitz continuous function is function which is relatively smooth. ie. not too wiggly and rough. It changes smoothly. The size B, quantifies the slope or gradient - a large B, means that in parts f has a steep slope. Often the amount of wiggle is quantified through bounds on the second derivative. ii f belongs to the class CL, this means that B L. Hence a function f CL can have a maximum gradient L it cannot be too steep. iii The kernel function satisfies xkxdx <. iv The kernel function satisfies x Kx 2 dx <. 149
150 Rectangle Rectangle Density Density N = 20 Bandwidth = N = 20 Bandwidth = 0.4 Rectangle Rectangle Density Density N = 20 Bandwidth = N = 20 Bandwidth = 1 Gaussian Gaussian Density Density N = 20 Bandwidth = N = 20 Bandwidth = 0.4 Gaussian Gaussian Density Density N = 20 Bandwidth = N = 20 Bandwidth = 1 Figure 15.1: 20 observations drawn from a standard normal distribution. Top 4 plots use the rectangle kernel, bottom 4 plots use the Gaussian kernel. Each kernel was plotted with bandwidths h = 0.2,0.4,0.8,1 150
151 Unlike parametric estimators, nonparametric estimators has a sizeable bias, which needs to be tracked. We give a bound for the bias below. Lemma Suppose Assumption is satisfied. Then we have fx Eˆf T x = Oh. PROOF. We observe that 1 Eˆf T x = E T = 1 h Kx X t h 1 h Kx y h fydy. = E 1 Now by making the substitution u = x y/h we have 1 h Kx y h fydy = Kufx uhdu. h Kx X t h By the Lipschitz continuous assumption we have fx fx uh B uh, hence by using this we have Kufx uhdy Kufxdu = Bh Kufx uhdy uku du = Oh. Kufxdu We use the notation fx Eˆf T x = Oh, because B uku du are finite and constant, hence the magnitude of fx Eˆf T x depends on the bandwidth h. We recall that Oh means that 1 h Kufx uhdy Kufxdu <, for all h. The above basically means that ˆf T x is a biased estimator of fx. i A large h corresponds to a wide window function. This means allowing a lot of X t to be lie inside the window K x Xt h, as we will show below this leads to a small variance because of we have a large sample. ii The disadvantage of a large h, is as demonstrated above that we have a large bias. This is because a window is a local average, the wider the window the more we smooth any local features present in the data see Figure
152 Lemma Suppose Assumption is satisfied. Then we have PROOF. It is easy to see that 1 varˆf T x = var T varˆf T x = O 1 ht. 1 h Kx X t h = 1 T var 1 h Kx X t = 1 1 h T E h Kx X t h We first consider 1 2 E h Kx X t = h 2 1 T [ 1 E h Kx X t h 1 h 2Kx y h 2 fydy = 1 Ku 2 fx+uhdu h Ku 2 du+b uku 2 du, fx h the above is true by changing variables u = x y h. Next we consider E 1 h Kx Xt h. By using the results in Lemma we have Therefore altogether we have 1 E h Kx X t h = fx+oh. varˆf T x = O 1 ht + 1 T = O 1 ht, ]. which gives the desired results. that Using the above two lemmas we can obtain a bound for the mean squared error. We know 2 E ˆf T x fx = varˆf T x+ E ˆfT x 2 fx. In other words, the mean squared error is the variance plus the bias squared. By using Lemmas and above two results we have 2 E ˆf T x fx = var ˆfT x + = O h 2 }{{} bias + 1. }{{} ht variance 2 Eˆf T x fx The above illustrates the trade off which often arises in nonparametric statistics, the larger the bandwidth h we use, the wider the window more observations and the smaller the variance 152
153 but the larger the bias. On the other hand, a small h means a small window, a small bias but a large variance fewer observations. To decide what order we should choose h how small h should be with respect to the sample size, we try to find the h which leads to the smallest b bh. We do this by differentiating b2 + 1 bh with respect to h and equating to zero. This leads to the observation that if h = OT 1/3 we have the smallest b bh, which is the mean squared error, we have 2 1 E ˆf T x fx = O T 2/3. The above result implies that if the function f is Lipschitz continuous then by using the kernel density estimator the rate of convergence is OT 2/3, and this rate cannot be bettered if we use the kernel density estimator. Indeed if the function were more wiggly rough - hence was not even Lipschitz continuous, then the rate of convergence would be larger. However, if the density f were smoother this can be quantified by the number of derivatives it has, the mean squared error of ˆf T x improves so long as the kernel K satisfies suitable conditions. For example, in 2 the case that f has a bounded second derivative then E ˆf T x fx = O 1. T 4/5 Remark It is important to mention that the above says that if h 0 at such a rate that 0 < ht 1/3 <, then we have E ˆfT x fx 2 = O 1 T 2/3. It does not mean that we need to choose h such that it is exactly h = T 1/3, and herein lies the problem. The above gives is an asymptotic result, it does not tell us how to choose h in practice. There have been several different methods proposed for selecting the bandwidth h, one of the most popular is a method based on cross-validation good bandwidth selection methods have been proposed by both Jeff Hart and Simon Sheather. In the discussion above we have considered one estimator of fx, and for this estimator we have shown that E ˆf T x fx 2 = O 1 T 2/3. It is natural to ask whether there exists another estimator which gives a smaller mean squared error for example OT 1 instead of OT 2/3. In other words, is there a lower bound for the mean squared error, and no estimator can get better than this lower bound? We recall that we asked exactly the same question when we were considering the Cramer-Rao lower bound. However, there are two fundemental problems when working with nonparametric estimators. i Most nonparametric estimators tend to be biasedfor example, the kernel density estimator defined above is biased, whereas the Cramer-Rao lower bound is a bound for unbiased 153
154 estimators this is why we only consider the variance and not mean squared error for the Cramer-Rao lower bound. ii The Cramer-Rao bound is a bound for a parameters in a parametric family. In other words, we are estimating a finite number of parameters. In nonparametric statistics, we are estimating, in some sense, an infinite number of parameters. At first glance, there does toseem to beasimpleway of obtaining alower bound for the mean squared error. However, there does exist a bound called the van Trees or Bayesian Cramer-Rao bound that gives a lower bound for the mean squared error even when the estimator is biased. This bound is only for parametric distributions, but by using some small tricks we are able to 2 derive a lower bound for E ˆf T x fx The Bayesian van Trees inequality As far as I am aware it is not possible to obtain a lower for the mean squared error of estimators which allow for biased estimators. However, we derived in Section 2.1 the Bayesian Cramer-Rao bound. We start by recalling this bound. Let us suppose that X t has the density fx;θ, where θ is an unknown parameter and θ Θ where Θ is the parameter space. But now the parameters in the parameter space are not treated as deterministic but random. θ has a distribution which we denote as λθ. Hence fx;θ is the distribution of X t given that the parameter θ. {X t } is a random sample with density fx;θ, where θ is a draw from the distribution λθ. The prior distribution, λθ, of θ is a probability distribution that represents the experimenter s opinion about the unknown parameters. Now we place the mean squared error within the Bayesian framework. We note that the classical mean squared error of an estimator ˆθ T is E ˆθT θ 2, but in the Bayesian framework, this is a random quantity since it depends on θ. To emphasis that θ is random, we rewrite the classical mean squared error as E ˆθT θ 2 θ = ˆθ θ 2 fˆθ θdˆθ Since E ˆθT θ 2 θ is random we take the expectation with respect to the distribution λθ, ie. E λ E ˆθT θ 2 θ = E ˆθT θ 2 θ λθdθ. Now suppose that the prior density λ satisfies the following assumption. 154
155 Assumption θ is defined over the compact interval [a,b] and λx 0 as x a and x b so λa = λb = 0. Furthermore we have the usual regularity conditions Assumption Regularity Conditions 1 Let fx;θ be a probability density function, which satisfies i logfx;θ fx;θdx = 0. ii fx;θdx = fx;θ dx. iii For any g, gx1,...,x T n fx i;θdx = gx 1,...,g T T iv E logfx;θ 2 is strictly positive. fx i;θ Theorem Suppose that {X t } T are iid random variables with density fx;θ, where θ has the prior distribution λ. Suppose Assumptions and hold. Let ˆθ T be an estimator of θ. Then we have where E E θ ˆθ T θ 2 1 TE λ Iθ+Iλ logfx;θ 2 θ logfx;θ 2 Iθ = E = fx;θdx logλθ 2 and Iλ = λθdθ. PROOF. See Section 2.1. We note that the bound above does not require that ˆθ T is an unbiased estimator of θ. We now apply the Bayesian Cramer-Rao bound to obtain a lower bound for any density estimator of fx. See also the books by van Trees Detection, Estimation and Modulation Theory 1968 and the Bayesian Bounds for Parameter Estimation and Nonlinear Filtering/Tracking dx An application of the van Trees inequality to nonparametric density estimation We recall that the class of functions CL is defined as { CL = f; fx fy L x y, } fxdx = 1. In the rest of this section we will prove the following result. 155
156 Theorem Suppose that {X t } T are iid random variables with density fx CL. Let ˆf T be any estimator of f. Then we have the following lower bound K maxeˆf T2/3 T x fx 2, f where K is a finite constant which does not depend on f or ˆf T. The above bound tells any estimator of the worst function in CL cannot do better than K T 2/3. In other words, we cannot obtain an estimator of the worst function which can be do better than the rate K T 2/3. The worst function is a little ambigious, but one can think of it as the most wiggly and rough function that the class CL permits. In the section above we derived a lower bound for the mean squared error, by integrating over the parameter space. However, these bounds are for finite dimensional parameter spaces for example θ could be a mean, whereas as the functions in CL are infinitely dimensional. To resolve this problem we consider a parametric family which is a subset CL we will assume that L > 1. It is clear that a lower bound for the mean squared error of any element in CL will also be a lower bound for the mean squared error of the worst function in CL ie. for any f CL we have inf ˆf Eˆf T x fx 2 inf ĝ sup Eĝ T x gx g CL Hence by obtaining a lower bound for a subset of CL we also have a lower bound for the worst function. The problem is that we need to choose an appropriate subset of CL which is not too simple, in order to a lower bound which is as close as possible to sup f CL Eˆf T x fx 2. More precisely, if we chose a nice parametric family for example the Gaussian distributions with known variance but unknown mean which is a subset of CL we are likely to obtain a parametric lower bound OT 1. However, it is likely we will show this below, that this is lower bound is too small, estimators of densities which do not behave well will not have such a small lower bound. Instead we choose a parametric family, which is rough but still lies in CL. Let W be a density function defined on [0,1], where sup x Wx 1 and sup x W x 1. Define the class of parametric functions { } GM = fx;η;fx;η = ηwmx+1 η/mi [0,1] x,0 < η L/M M 1, hence the parameter space of η is Θ M = {η;0 η L/M M 1 }. Since for 0 x,y 1 that fx;η fy;η ηm x y +1 η/m x y L x y, 156
157 hence the parameter space of η is Θ M = {η;0 η L/M M 1 }. Therefore, we have GM CL. But we observe a special feature of GM, as M, gets large, the functions inside GM can become more and more spikey, but to keep them within the class CL, the magntitude of the spike is reduced by using an ever smaller value of η make a sketch of fx;η. Now we define a prior distribution for η. Let us suppose that λ : [0,1] R is a density defined in the unit interval, with λ0 = 0, λ1 = 0 and Iλ = logλη η 2 ληdη <. Let λ M η = Mλ Mη, where M = M M 1 /L, it is clear that the density λ M η also satisfies Assumption Let ˆη denote any estimator of η, hence fx;ˆη is an estimator of fx;η. Suppose that ˆη GM, we now use Theorem to obtain a lower bound for E λm E fx;ˆη fx 2 η = WMx+I [0,1] x 2 EλM E {ˆη η 2 η } As the sample size T grows we let M also grow hence the spike in the density becomes more peaky. To ensure that GM lies in CL, we have to shrink the parameter space Θ M, hence the prior distribution λ M will have a smaller support. Lemma Suppose that Assumption is satisfied and M 2. Then we have for some finite constant C. E λm E {ˆη η 2 η } C T/M +M 2, PROOF. We first note that by using the Bayesian Cramer-Rao bound we have where and E λm E {ˆη η 2 η } 1 TE λm I M η+iλ M, logfx;η 2 I M η = E η η logλ M θ 2λM Iλ M = θdθ. We now obtain upper bounds for the above terms. To bound I M η we observe that logfx;η 2 WMx I [0,1] x 2 =, η ηwmx+1 η/mi [0,1] x hence for a sufficiently large M we have I M η = 1 0 WMx I [0,1] x 2 η[wmx M 1 I [0,1] x]+i [0,1] x dx K 157 K WMx I [0,1] x 2 dx [WMx 2 2WMx+1]dx K/M.
158 The above inequality is obtained by making a change of variables and using that Wxdx = 1 and Wx 2 dx 1. Therefore we have E λm I M η K/M. We now obtain an upper bound for I M η. By definition of I M η we have I M η = logλm η 2 λ M ηdη = M M 1 2 η L Therefore, by using the above we have the bound logλu u E λm E {ˆη η 2 η } 1 KT/M + M M 1 L 2 Iλ C T/M +M 2, 2 λudu = M M 1 2 Iλ. L which gives the required result. that Now by using 15.3 and the same argument as in 15.2 we have for all estimators ĝ of g E λm E { fx;ˆη fx;η 2 } η = WMx+I[0,1] x 2 EλM E {ˆη η 2 } η sup Eĝ T x gx 2. g CL that Therefore, by using Lemma and that WMx + I [0,1] x 2 2, we have for all M, C T/M +M 2 4E λ M E {ˆη η 2 } η inf sup Eĝ T x gx 2, ĝ g CL which means that for all M, we have C T/M +M 2 inf sup Eĝ T x gx 2. ĝ g CL Since the above lower bound is holds for all CM CL and M. to make this bound as tight aspossible, wechoosethem tomaximisetheabovelowerbound. BydifferentiatingT/M+M 2 with respect to M and equating to zero, we get M = T 1/3. This lead to the bound which proves Theorem C inf sup Eĝ T2/3 T x gx 2, ĝ g CL This basically means that any estimator of the worst function in CL cannot do better than than the rate OT 2/3. In other words, there exists functions which are only Lipschitz continuous whose nonparametric estimators cannot better the OT 2/3 rate. 158
159 Remark Now let us return to the kernel density estimator 15.1, where we showed that E ˆf T x fx 2 = O 1 T 2/3. Theorem means that 15.1 is the best estimator of functions which are Lipschitz continuous which includes functions which have a bounded first derivative. Furthermore, this rate cannot be bettered without further smoothness assumptions on the function f. However, note the fundemental difference between parametric and nonparametric statistics. Nonparametric estimators are more flexible than parametric estimators, but the cost of this flexibility, is that the rate rates of convergence is worse than the parametric rates where the mean squared error is usually OT 1. For more details see Gill and Levit Bernoulli,
160 160
161 Chapter 16 The loss function and estimating equations 16.1 Loss functions Up until now our main focus has been on parameter estimating via the maximum likelihood. However, the negative maximum likelihood is criterion belong to a group of criterions known as loss functions. Loss functions are usually distances, such as the l 1 and l 2 distance. Typically we estimate a parameter by minimising the loss function, and using as the estimator the parameter which minimises the loss. Usually but not always the way to solve the loss function is to differentiate it and equate it to zero. This motivates another means of estimating parameters, by finding the solution of a equation these are usually called estimating functions. For many examples, there is an estimating function which corresponds to a loss function and loss function which corresponds to an estimating equation. But this will not always be the case. Example Examples where the loss function does not have a derivative Consider the Laplacian also known as the double exponential, which is defined as { fy;θ,µ = 1 1 2θ exp xy θ /θ = 2θ expy θ/θ y < θ exp y θ/θ y θ. We observe {Y t } and our objective is to estimate the location parameter µ, for now the scale parameter θ is not of interest. The log-likelihood is 1 2θ L T θ = T log2θ 1 θ Y i θ. 161
162 We maximise the above to estimate µ. We see that this is equivalent to minimising the loss function L T θ = Y i θ = 1 2 Y i θ+ 1 2 θ Y i. Y i >θ Y i θ If we make a plot of L T over θ, and consider how L T, behaves at the ordered observations {Y t }, we see that it is piecewise continuous that is it is continuous but non-differentiable at the points Y t. On closer inspection if T is odd we see that L T has its minimum at θ = Y T/2, which is the sample median to prove this to yourself take an example of four observation and construct L T, you can then extend this argument via a psuedo induction method. Heuristically, the derivative of the loss function is number of Y i less than µ minus number of Y i greater than or equal to µ, using this reasoning gives us a another argument as to why the sample median minimises the loss. Note that it is a little ambigiuous what the minimum is when T is even. In summary, the normal distribution gives rise to the l 2 -loss function and the sample mean. In contrast the Laplacian gives rise to the l 1 -loss function and the sample median. Consider the generalisation of the Laplacian, usually called the assymmetric Laplacian, which is defined as fy;θ,µ = where 0 < p < 1. The log-likelihood is { L T θ = Y i >θ p θ exppy θ/θ y < θ 1 p θ exp 1 py θ/θ y θ. 1 py i θ+ Y i θpθ Y i. Using similar arguments to those in part i, it can be shown that the minimum of L T is approximately the pth quantile Estimating Functions See also Section 7.2 in Davison 2002 and Lars Peter Hansen 1982, Econometrica. Estimating functions is a unification and generalisation of the maximum likelihood methods and the method of moments. It should be noted that it is a close cousin of the generalised method of moments and generalised estimating equation. We first consider a few examples and will later describe a feature common to all these examples. 162
163 Example i Let us suppose that {Y t } are iid random variables with Y t Nµ,σ 2. The log-likelihood in proportional to L T µ,σ 2 = 1 2 logσ2 1 2σ 2 X t µ 2. We know that to estimate µ and σ 2 we use the µ and σ 2 which are the solution of 1 2σ σ 4 X t µ 2 = 0 1 2σ 2 X t µ 2 = ii In general suppose {Y t } are iid random variables with Y i f ;θ. The log-likelihood is L T θ = T logfθ;y t. If the regularity conditions are satisfied then to estimate θ we use the solution of L T θ = iii Let us suppose that {Y t } are iid random variables with a Weibull distribution fx;θ = α φ x φ α exp x/φ α, where α,φ > 0. We know that see Section 9.1, Example EX = φγ1+α 1 and EX 2 = φ 2 Γ1+ 2α 1. Therefore EX φγ1 + α 1 = 0 and EX 2 φ 2 Γ1 + 2α 1 = 0. Hence by solving 1 T X t φγ1+α 1 = 0 1 T Xt 2 φ 2 Γ1+2α 1 = 0, 16.3 we obtain estimators of α and Γ. This is essentially a method of moments estimator of the parameters in a Weibull distribution. iv We can generalise the above. It can be shown that EX r = φ r Γ1+rα 1. Therefore, for any distinct s and r we can estimate α and Γ using the solution of 1 T Xt r φ r Γ1+rα 1 = 0 1 T Xt s φ s Γ1+sα 1 = v Consider the simple linear regression Y t = αx t +ε t, with Eε t = 0 and varε t = 1, the least squares estimator of α is the solution of 1 T Y t ax t x t =
164 We observe that all the above estimators can be written as the solution of a homogenous equations - see equations 16.1, 16.2, 16.3, 16.4 and In other words, for each case we can define a random function G T θ, such that the above estimators are the solutions of G T θ T = 0. In the case that {Y t } are iid then G T θ = T gy t;θ, for some function gy t ;θ. The function G T θ is called an estimating function. All the function G T, defined above, satisfy the unbiased property which we define below. Definition An estimating function G T is called unbiased if at the true parameter θ 0 G T satisfies E G T θ 0 = 0. Hence the estimating function is alternative way of viewing a parameter estimator. Until now, parameter estimators have been defined in terms of the maximum of the likelihood. However, an alternative method for defining an estimator is as the solution of a function. For example, suppose that {Y t } are random variables, whose distribution depends in some way on the parameter θ 0. We want to estimate θ 0, and we know that there exists a function such that Gθ 0 = 0. Therefore using the data {Y t } we can define a random function, G T where EG T θ = Gθ. Hence we can use the parameter θ T, which satisfies G T θ = 0, as an estimator of θ. We observe that such estimators include most maximum likelihood estimators and method of moment estimators. Example Based on the examples above we see that i The estimating function is G T µ,σ = ii The estimating function is G T θ = L Tθ. 1 2σ σ 4 T X t µ 2 1 2σ 2 T X t µ 2. iii The estimating function is 1 T T G T α,φ = X t φγ1+α 1 1 T T X2 t φ 2 Γ1+2α 1 iv The estimating function is 1 T T G T α,φ = Xs t φ s Γ1+sα 1 1 T T Xr t φ r Γ1+rα
165 v The estimating function is G T a = 1 T Y t ax t x t. The advantage of this approach is that sometimes the solution of an estimating equation will have a smaller finite sample variance than the MLE. Even though asymptotically under certain conditions the MLE will asymptotically attain the Cramer-Rao bound which is the smallest variance. Moreover, MLE estimators are based on the assumption that the distribution is known else the estimator is misspecified - see Section 9.1.1, however sometimes an estimating equation can be free of such assumptions. Recall that in Example v that for 1 E T Y t αx t x t = 0, 16.6 is true regardless of the distribution of Y t {ε t } and is also true if there {Y t } are dependent random variables. We recall that Example v is the least squares estimato, which is a consistent estimator of the parameters for a variety of settings see Rao 1973, Linear Statistical Inference and its applications, and your STAT612 notes. Example In many statistical situations it is relatively straightforward to find a suitable estimating function rather than find the likelihood. Consider the time series {X t } which satisfies X t = a 1 X t 1 +a 2 X t 2 +σa 1,a 2 ε t, where {ε t } are iid random variables. We do not know the distribution of ε t, but because ε t is independent of X t 1 and X t 2, by multiplying the above equation by X t 1 an taking expections we have EX t X t 1 = a 1 EX 2 t 1+a 2 EX t 1 X t 2 EX t X t 2 = a 1 EX 2 t 2+a 2 EX 2 t 2. Since the above time series is stationary we have not formally defined this - but basically it means the properties of {X t } do not vary over time, it can be shown that 1 T X tx t r is an estimator of EX t X t 1 and that E 1 T X tx t r = EX t X t r. Hence replacing the above with its estimators we obtain the estimating equations 1 T T G 1 a 1,a 2 = X 1 tx t 1 a T 1 T X2 t 1 a 2 1 T t X t 1X t 2 1 T T X 1 tx t 2 a T 1 T X 1 t 1X t 2 a T 2 T X2 t 2 We now show that under certain conditions θ T is a consistent estimator of θ. 165
166 Theorem Suppose that G T θ is an unbiased estimating function, where G T θ T = 0 and EG T θ 0 = 0. i If θ is a scalar, for every T G T θ is a continuous monotonically decreasing function in θ and EG T θ 0, then we have θ P T θ0. ii If we can show that sup θ G T θ EG T θ P 0 and EG T θ is uniquely zero at θ 0 then we have θ P T θ0. PROOF. The proof of case i is relatively straightforward see also page 318 in Davison 2002, and is best understand by making a plot of G T θ with θ T < θ 0 ε < theta 0. We first note that since EG T θ 0, for any fixed ε > 0 we have G T θ 0 ε P E G T θ 0 ε > 0, 16.7 since G T is monotonically decreasing for all T. Now, since G T θ is monotonically decreasing we see that θ T < θ 0 ε implies G T θ T G T θ 0 ε > 0 and visa-versa hence P θt θ 0 ε 0 = P G T θ T G T θ 0 ε > 0. Butwehavefrom16.7thatEG T θ 0 ε P EG T θ 0 ε > 0. ThusP G T θ T G T θ 0 ε > 0 P 0 and P θt θ 0 ε < 0 P 0 as T. A similar argument can be used to show that that P θt θ 0 +ε > 0 P 0 as T. As the above is true for all ε, together they imply that θ P T θ0 as T. The proof of ii is more involved, but essentially follows the lines of the proof of Theorem exercise for those wanting to do some more theory. We now show normality, which will give us the variance of the limiting distribution of θ T. Theorem Let us suppose that {Y t } are iid random variables and G T θ = 1 T T gy t;θ. Suppose that θ T P θ0 and the first and second derivatives of G T have a finite expectation we will assume that θ is a scalar to simplify notation. Then we have as T. P vargy t ;θ 0 T θt θ 0 N 0, [ E gyt;θ ] 2, θ0 166
167 PROOF. We use the standard Taylor expansion to prove the result which you should be expert in by now. Using a Taylor expansion we have 1 T G T θ T = G T θ 0 + θ T θ 0 G Tθ θt 16.8 θ T θ 0 = E G 1 Tθ 1 θ0 G T θ 0 +o p where the above is due to G Tθ θt t gy t;θ is the sum of iid random variables we have T 1/2, P E G T θ θ0 as T. Now, since G T θ 0 = TGT θ 0 P N 0,varG T θ 0, 16.9 since EgY t ;θ 0 = 0. Therefore 16.8 and 16.9 together give P T θt θ 0 N 0, as required. vargy t ;θ 0 [ E gyt;θ θ0 ] 2, Remark In general if {Y t } are not iid random variables then under certain conditions we have P varg T θ 0 T θt θ 0 N 0,T[ E GT θ ] 2. θ0 Example The Huber estimator We describe the Huber estimator which is a well known estimator of the mean which is robust to outliers. The estimator can be written as an estimating function. Let us suppose that {Y t } are iid random variables with mean θ, and density function which is symmetric about the mean θ. So that outliers do not effect the mean, a robust method of estimation is to truncate the outliers and define the function c Y t < c+θ g c Y t ;θ = Y t c c+θ Y t c+θ c Y t > c+θ The estimating equation is G c,t θ = g c Y t ;θ. And we use as an estimator of θ, the θ T which solves G c,t θ T =
168 i In the case that c =, then we observe that G,T θ = T Y t θ, and the estimator is θ T = Ȳ. Hence without truncation, the estimator of the mean is the sample mean. ii In the case that c is small, then we have truncated many observations. vargy t;θ 0 E gyt ;θ It is clear that [ ] 2 Iθ 0 1 where Iθ is the Fisher information. Hence for θ0 extremely large sample sizes, the MLE is the best estimator Optimal estimating functions As in illustrated in Example iii,iv there are several different estimators of the same parameters. A natural question, is which estimator does one use? We can answer this question by using the result in Theorem Roughly speaking Theorem implies though this is not true in the strictest sense that var θt 1 vargy t ;θ 0 [ T E gyt;θ ] θ0 In the case that θ is a multivariate vector the above generalises to var θt 1 E gy 1 t;θ vargy t ;θ 0 E gy 1 t;θ T Hence, we should choose the estimating function which leads to the smallest variance. Example Consider Example iv where we are estimating the Weibull parameters θ and α. Different s and r give different equations, but are all are estimating the same parameters. To decide which s and r to use, we calculate the variances for different s and r and choose the one with the smallest variance. Now substituting E gy t;θ = rφ r 1 Γ1+rα 1 φr α 2 Γ1+rα 1 sφ s 1 Γ1+sα 1 φs α 2 Γ1+sα 1 and using that EX r = φ r Γ1+rα 1 1 varx r t covxt,x r t s. t covxt,x r t s varxt s into for different r and s we can calculate the variance and and choose the estimating function with the smallest variance. We now consider an example, which is a close precursor to GEE generalised estimating equations. Usually GEE is a method for estimating the parameters in generalised linear models, where there is dependence in the data. 168
169 Example Suppose that {Y t } are independent random variables with mean {µ t θ 0 } and variance {V t θ 0 }, where the parametric form of {µ t } and {V t } are known, but θ 0 is unknown. A possible estimating function is G T θ = Y t µθ, where we note that EG T θ 0 = 0. However, this function does not take into account the variance, instead let us consider the general weighted function G W T θ = w t θy t µθ. Again we observe that EG W T θ 0 = 0. But we need to know which weight wθ to use. Hence we choose the weight which minimises the variance in Since {Y t } are independent we observe that varg W T θ 0 = w t θ 0 2 V t θ 0 E G W T θ T θ0 = E w tθ 0 Y t µ t θ 0 Hence we have w t θ 0 µ tθ 0 = var θt T w tθ 0 2 V t θ 0 T w tθ 0 µ t θ 0 2. w t θ 0 µ tθ 0. Now we want to choose the weights, thus the estimation function, which has the smallest variance. Therefore we look for weights which minimise the above. Since the above is a ratio, and we observe that a small w t θ leads to a large denominator but a large numerator, we do the minimisation by using Lagrangian multiplers. That is set T w tθ 0 µ θ 0 2 = 1 which means T w tθ 0 µ θ 0 = 1 and minimise [ T ] w t θ 0 2 V t θ 0 +λ w t θ 0 µ θ 0 1, with respect to {w t θ 0 } and λ. Partially differentiating the above with respect to {w t θ 0 } and λ and setting to zero leads to w t θ = λµ tθ/v t θ/2 µ tθ/v t. Hence the optimal estimating function is G µ V 1 T θ = µ tθ V t θ Y t µ t θ. 169
170 Observe that the above equations resembles the derivative of weighted least squared criterion 1 V t θ Y t µ t θ 2. But an important difference us that the term involving the derivative of V t θ has been ignored. Remark We conclude this section by mentioning that one generalisation of estimating equations is the generalised method of moments. We observe the random vectors {Y t } and it is known that there exist a function g ;θ such that EgY t ;θ 0 = 0. To estimate θ 0, rather than find the solution of 1 T T gy t;θ, a matrix M T is defined and the parameter which mimimises is used as an estimator of θ. 1 T gy t ;θ MT 1 T gy t ;θ 170
171 Chapter 17 Generalised Linear Models 17.1 Generalised Linear Models Generalised linear models GLM is a generalisation of ordinary least squares regression. See also Davison, Section , Green 1984 and Dobson and Barnett To motivate the GLM approach let us briefly overview linear models An overview of linear models Let us consider the two competing linear nested models Restricted model: Y t = β 0 + Full model: Y t = β 0 + q β j x t,i +ε t, j=1 q β j x t,i + j=1 j=q+1 p β j x t,i +ε t, 17.1 where {ε t } are iid random variables with mean zero and variance σ 2. Let us suppose that we observe {Y t,x t,i } T, where {Y t} are normal. The classical method for testing H 0 : Model 0 against H A : Model 1 is to use the F-test ANOVA. That is, let ˆσ R 2 be the residual sum of squares under the null and ˆσ F 2 be the residual sum of squares under the alternative. Then the F-statistic is F = S 2 R SF 2 /p q, ˆσ 2 F 171
172 where S 2 F = σ 2 F = Y t 1 T p p ˆβ j F x t,j 2 SR 2 = j=1 Y t p ˆβ j F x t,j 2. j=1 Y t q ˆβ j R x t,j 2 j=1 and under the null F F p q,t p. Moreover, if the sample size is large p qf D χ 2 p q. We recall that the residuals of the full model are r t = Y t ˆβ 0 q ˆβ j=1 j x t,i p ˆβ j=q+1 j x t,i and the residual sum of squares SF 2 is used to measure how well the linear model fits the data see Chapter 8 and STAT612 notes. The F-test and ANOVA are designed specifically for linear models. In this section the aim is to generalise Model specification. Estimation Testing. Residuals. to a larger class of models. To generalise we will be in using a log-likelihood framework. To see how this fits in with the linear regression, let us now see how ANOVA and the log-likelihood ratio test are related. Suppose that σ 2 is known, then the log-likelihood ratio test for the above hypothesis is 1 S 2 σ 2 R SF 2 χ 2 p q, where we note that since {ε t } is Gaussian, this is the exact distribution and not an asymptotic result. In the case that σ 2 is unknown and has to be replaced by its estimator ˆσ F 2, then we can either use the approximation 1 S 2 R SF 2 D χ 2 p q, T, ˆσ 2 F or the exact distribution which returns us to the F-statistic. S 2 R SF 2 /p q ˆσ 2 F 172 F p q,t p,
173 On the other hand, if the variance σ 2 is unknown we can go the log-likelihood ratio test route. The log-likelihood ratio test reduces to log S2 R S 2 F = log 1+ S2 F S2 R ˆσ 2 F D χ 2 p q. We recall that by using the expansion log1+x = x+ox 2 we obtain log S2 R S 2 F = log 1+ S2 R S2 F SF 2 = S2 R S2 F S 2 F +o p 1. Now we know the above is approximately χ 2 p q. But it is straightforward to see that by dividing by p q and multiplying by T p we have T p p q log S2 R S 2 F T p = p q log 1+ S2 R S2 F SF 2 = S2 R S2 F /p q ˆσ F 2 +o p 1 = F +o p 1. Hence we have transformed the log-likelihood ratio test into the F-test, which we discussed at the start of this section. The ANOVA and log-likelihood methods are asymptotically equivalent. In the case that {ε t } are non-gaussian, but the model is linear with iid random variables, the above results also hold. However, in the case that the regressors have a nonlinear influence on the response and/or the response is not normal we need to take an alternative approach. Through out this section we will encounter such models. We will start by focussing on the following two problems: i How to model the relationship between the response and the regressors when the reponse is non-gaussian, and the model is nonlinear. ii Generalise ANOVA for nonlinear models Motivation Let us suppose {Y t } are independent random variables where it is believed that the regressors x t x t is a p-dimensional vector has an influence on {Y t }. Let us suppose that Y t is a binary random variable taking either zero or one and EY t = PY t = 1 = π t. HowtomodeltherelationshipbetweenY t andx t? Asimpleapproach,istousealinearmodel, ie. let EY t = β x t, But a major problem with this approach is that EY t, is a probability, and for many values of β, β x t will lie outside the unit interval - hence a linear model is not 173
174 meaningful. However, we can make a nonlinear transformation which transforms the regressors to the unit interval. For example let EY t = π t = expβ x t 1+expβ x t, this transformation lies between zero and one. Hence we could just use a nonlinear regression to estimate the parameters. That is rewrite the model as Y t = expβ x t 1+expβ x t + ε t, and use the estimator ˆβ T, where ˆβ T = argmin β Y t expβ x t 2 1+expβ, x t as an estimator of β. However, this approach also has its drawbacks. One obvious drawback is that the variance varε t = vary t = π t 1 π t, hence the variance depends on the parameters. Hence, alternative methods are required. The GLM approach is a general framework for a wide class of distributions. We recall that in Section 3.1 we considered maximum likelihood estimation for iid random variables which come from the natural exponential family. Distributions in this family include the normal, binary, binomial and Poisson, amongst others. We recall that the natural exponential family has the form fy;θ = exp yθ κθ+cy, where κθ = bη 1 θ. To be a little more general we will suppose that the distribution can be written as yθ κθ fy;θ = exp +cy,φ, 17.2 φ where φ is a nuisance parameter called the disperson parameter, it plays the role of the variance in linear models and θ is the parameter of interest. We recall that examples of exponential models include i The exponential distribution is already in natural exponential form with θ = λ and φ = 1. The log density is logfy;θ = λy +logλ. ii For the binomial distribution we let θ = log π π 1 π and φ = 1, since log 1 π is invertible this gives π logfy;θ = logfy;log 1 π = yθ nlog n. +log 1+expθ y 174
175 iii For the normal distribution we have that logfy;µ,σ 2 y µ2 = 2σ 2 µy µ 2 /2 = σ logσ log2π + y log2π. Hence φ = σ 2, θ = µ, κθ = θ 2 /2 and cy,φ = X 2 t/2+ 1 2π log2π. iv The Poisson log distribution can be written as logfy;µ = ylogµ µ+logy!, Hence θ = logµ, κθ = expθ and cy = logy!. v Other members in this family include, Gamma, Beta, Multinomial and inverse Gaussian. Remark Moments Using Lemma see Section 3.1 we have EY = κ θ and vary = κ θφ. GLM is a method which generalises the methods in linear models to the exponential family recall that the normal model is a subclass of the exponential family. In the GLM setting it is usually assumed that the response variables {Y t } has a density which comes from the natural exponential family, which has log density yt θ t κθ t logfy t ;θ t = +cy t,φ, φ where the parameter θ t depends on the regressors. The regressors influence the response through a linear predictor η t = β x t and a link function, which connects β x t to the mean EY t = µθ t = κ θ t. More precisely, there is a monotonic link function g, which is defined such that gµθ t = η t = β x t. To summarise The log likelihood function of {Y t } is κ θ t = g 1 η t = µθ t = EY t θ t = µ 1 g 1 η t = θη t vary t = 1 dµθ t φ dθ t L T β = Yt θη κθ t φ +cy t,φ. The choice of link function is rather subjective. One of the most popular is the canonical link which we define below. 175
176 Definition The canonical link function The canonical link function is where we let η t = θ t. Hence µ t = κ η t and gκ θ t = gκ η t = η t. The canonical link is usually used because it make the calculations simple. We observe with the canonical link the log-likelihood of {Y t } is L T β = Yt β x t κβ x t which we will show is relatively simple to maximise. φ +cy,φ, Example The log-likelihood and canonical link function i The canonical link for the exponential fy t ;λ t = λ t exp λ t y t is θ t = λ t = β x t, and λ = β x t. The log-likelihood is Y t β x t logβ x t. ii The canonical link for the binomial is θ t = β x t = log πt 1 π t, hence π t = expβ x t 1+expβ x t. The log-likelihood is Y t β x t +n t log expβ x t nt +log 1+expβ. x t Y t iii The canonical link for the normal is θ t = β x t = µ t. The log-likelihood is Y t β x t 2 2σ logσ log2π, which is the usual least squared criterion. If the canonical link were not used, we would be in the nonlinear least squares setting that is the log-likelihood is Y t g 1 β x t 2 2σ logσ log2π, iv The canonical link for the Poisson is θ t = β x t = logλ t, hence λ t = expβ x t. The loglikelihood is Y t β x t expβ x t +logy t!. However, as mentioned above, the canonical link is simply used for its mathematical simplicity. There exists other links, which can often be more suitable. 176
177 Remark Link functions for the Binomial We recall that the link function is defined a monotonic function g, where η t = β x t = gµ t. The choice of link is up to you. A popular choice to let g 1 = a well known distribution function. The motivation for this is that for the Binomial distribution µ t = n t π t where π t is the probability of a success. Clearly 0 π t 1, hence using g 1 = distribution function or survival function makes sense. Examples include i The Logistic link, this is the canonical link function, where β x t = gµ t = log πt 1 π t = µ log t n t µ t. i The Probit link, where π t = Φβ x t, Φ is the standard normal distribution function and the link function is β x t = gµ t = Φ 1 µ t /n t. ii The extreme value link, where the distribution function is Fx = 1 exp expx. Hence in this case the link function is β x t = gµ t = log log1 µ t /n t Estimating the parameters in a GLM The score function for GLM The score function for generalised linear models has a very interesting form, which we will now derive. From now on, we will suppose that φ t φ for all t, and that φ is known though even in the case that it is unknown, it will not change the estimation scheme used. Much of the theory remains true without this restriction, but this makes the derivations a bit cleaner, and is enough for all the models we will encounter. where With this substitution, recall that the log-likelihood is L T β,φ = { Yt θ t ηθ t φ { Yt θ t ηθ t L t β,φ = φ } +cy t,φ = } +cy t,φ l t, and θ t is a function of β t through the relationship gκ θ t = η t = β x t. For the MLE of β, we need to solve the likelihood equations L T β j = l t β j = 0 for all j = 1,...,p. To obtain an interesting expression for the above, recall that vary t = φµ θ t and η t = gµ t, 177
178 and let µ θ t = Vµ t. Since Vµ t = dµt dθ t, inverting the derivative we have dθt dµ t = 1/Vµ t. Furthermore, since dηt dµ t = g µ t, inverting the derivative we have dµt dη t = 1/g µ t. By the chain rule for differentiation and using the above we have l t β j = dl t dη t η t β j = dl t dη t η t t t β j 17.4 = dl t dθ t dµ t η t dθ t dµ t dη t β j = dl 1 1 t dµt dηt η t dθ t dθ t dµ t β j = Y t κ θ t κ θ t 1 g µ t 1 x tj φ = Y t µ t x tj φvµ t g µ t Thus the likelihood equations we have to solve for the MLE of β are since µ t = g 1 β x t. Y t µ t x T tj φvµ t g µ t = Y t g 1 β x t x tj φvg 1 β x t g = 0, 1 j p, 17.5 µ t We observe that the above set of equations can be considered as estimating functions see Section?? since the expectation of the above set of equations is equal to zero. Moreover, it has a very similar structure to the Normal equations arising in ordinary least squares. Example i Normal {Y t } with mean µ t = β x t. Here, we have gµ t = µ t = β x t so g µ t = dgµt dµ t 1; also Vµ t 1, φ = σ 2, so the equations become 1 σ 2 Y t β x t x tj = 0 Ignoring the factor σ 2, the LHS is the jth element of the vector X T Y Xβ, so the equations reduce to the Normal equations of least squares: X T Y Xβ = 0 or equivalently X T Xβ = X T Y. ii Poisson {Y t } with log-link function, hence mean µ t = logβ x t hence gµ t = logµ t. This time, g µ t = 1/µ t, vary t = Vµ t = µ t and φ = 1. Substituting µ t = expβ x t, into the equations become Y t e β x t x tj = 0, which are not linear in β and no explicit solution is possible in general. 178
179 Remark The GLM score function and weighted least squares The GLM score has a very interesting relationship with weighted least squares. First recall that the GLM takes the form Y t µ t x T tj φvµ t g µ t = Y t g 1 β x t x tj φv t g µ t = 0, 1 j p Next let us construct the weighted least squares criterion. Since EY t = µ t and vary t = φv t, the weighted least squares criterion corresponding to {Y t } is S T β = Y t µθ t 2 φv t = Y t g 1 β x t 2 φv t. The weighted least squares criterion S T is independent of the underlying distribution and has been constructed using the first two moments of the random variable. Returning to the weighted least squares estimator, we observe that this is the solution of S T β j = s t β µ t µ t β j + s t β V t V t β j = 0 1 j p, where s t β = Yt µθt2 φv t. Now let us compare S T β j with the estimating equations corresponding to the GLM those in We observe that 17.6 and the first part of the RHS of the above are the same, that is s t β µ t µ t β j = Y t µ t x tj φvµ t g µ t = 0. In other words, the GLM estimating equations corresponding to the exponential family and the weighted least squares estimating equations are closely related as are the corresponding estimators. However, it is simpler to solve T s tβ µ t µ t β j = 0 than S T β j = 0. Interestingly since the at the true β the derivatives are such that T E s t β µ t ST = 0 and E = 0, µ t β j β j then this implies that the other quantity in the partial sum, E S T β j is also zero, ie. T E s t β V t = 0. V t β j 179
180 The Newton-Raphson scheme See also Section and Green It is clear from the examples above that usually there does not exist a simple solution for the likelihood estimator of β. However, we can use the Newton-Raphson scheme to estimate β. We will derive an interesting expression for the iterative scheme. Other than the expression being useful for implementation, it also highlights the estimators connection to weighted least squares. We recall that the Newtom Rapshon consists of the scheme β i+1 = β i H i 1 u i where the p 1 gradient vector u i is u i LT =,..., L T β 1 β β=βi p and the p p Hessian matrix H i is given by H i jk = 2 L T β β j β k β=β i, for j,k = 1,2,...,p, both u i and H i being evaluated at the current estimate β i. By using 17.4, the score function at the ith iteration is u i j = L Tβ β j β=β i = = The Hessian at the ith iteration is H i jk dl t dη t η t β j β=β i = l t β j β=β i dl t dη t β=β ix tj. = 2 L T β 2 l t β j β β=β i = k β j β β=β i k lt = β k β β=β i j lt = x tj β k η β=β i t lt = x tj x tk η t η t = 2 l t ηt 2 β=β ix tj x tk
181 Thus if we write s i for the T 1 vector with and W i for the diagonal n n matrix with s i t = l t η t β=β i W tt = d2 l t dη 2 t we have u i = X T s i and H = X T Wi X and the Newton-Raphson iteration is β i+1 = β i H i 1 u i = β i +X T Wi X 1 X T s i Fisher scoring for GLM s Typically, partly for reasons of tradition, we use a modification of this in fitting statistical models. The matrix W is replaced by W, another diagonal T T matrix with W i tt = E W i tt β i = E d2 l t β i dηt 2. OnereasonfortakingtheexpectationisthatW i tt is sure to have non-negativetypically positive diagonal entries, since see Section 1.1 W i tt = E d2 l t dηt 2 β i dlt = var β i dη t so that W = vars i β i, and the matrix is non-negative-definite. Using the Fisher score function the iteration becomes β i+1 = β i +X T W i X 1 X T s i. Iteratively reweighted least squares The iteration β i+1 = β i +X T W i X 1 X T s i 17.8 is formally similar to the familiar solution for least squares estimates in linear models: β = X T X 1 X T Y or more particularly the related weighted least squares estimates: β = X T WX 1 X T WY 181
182 In fact, 17.8 can be re-arranged to have exactly this form. Algebraic manipulation gives β i = X T W i X 1 X T W i Xβ i and X T W i X 1 X T s i = X T W i X 1 X T W i W i 1 s i. Therefore substituting the above into 17.8 gives β i+1 = X T W i X 1 X T W i Xβ i +X T W i X 1 X T W i W i 1 s i = X T W i X 1 X T W i Xβ i +W i 1 := X T W i X 1 X T W i Z i. One reason that the above equation is of interest is that it has the form of weighted least squares. More precisely, it has the form of a weighted least squares regression of Z i on X with the diagonal weight matrix W i. That is let z i t denote the ith element of the vector Z i, then β i+1 minimises the following weighted least squares criterion W i i t z t β 2. x t Of course, in reality W i t and z i t are functions of β i, hence the above is often called iteratively reweighted least squares. Estimating of the dispersion parameter φ Recall that in the linear model case, the variance σ 2 did not affect the estimation of β. and In the general GLM case, continuing to assume that φ t = φ, we have s t = dl t dη t = dl dθ t dθ t dµ t dµ t dη t = Y t µ t φvµ t g µ t vary t W tt = vars t = {φvµ t g µ t } 2 = φvµ t {φvµ t g µ t } 2 1 = φvµ t g µ t 2 so that 1/φ appears as a scale factor in W and s, but otherwise does not appear in the estimating equations or iteration for β. Hence φ does not play a role in the estimation of β. As in the Normal/linear case, a we are less interested in φ, and b we see that φ can be separately estimated from β. Recall that vary t = φvµ t, thus EY t µ t 2 Vµ t = φ 182
183 We can use this to suggest a simple estimator for φ: φ = 1 T p Y t µ t 2 V µ t = 1 T p Y t µ t 2 µ θ t where µ t = g 1 β x and θ t = µ 1 g 1 β x. Recall that the above resembles estimators of the residual variance. Indeed, it can be argued that the distribution of the above is close to χ 2 T p. Remark We mention that a slight generalisation of the above is when the dispersion parameter satisfies φ t = a t φ, where a t is known. In this case, an estimator of the φ is φ = 1 T p Y t µ t 2 a t V µ t = 1 T p Y t µ t 2 a t µ θ t Deviance, scaled deviance and residual deviance The scaled deviance Instead of minimising the sum of squares which is done for linear models we have been maximising a log-likelihood L T β. Furthermore, we recall Sˆβ = rt 2 = Yt ˆβ 0 q ˆβ j x t,i j=1 p j=q+1 ˆβ j x t,i 2 is a numerical summary of how well the linear model fitted, S β = 0 means a perfect fit. In this section we will define the equivalent of residuals and square residuals for GLM. so What is the best we can do in fitting a GLM? Recall l t = Y tθ t κθ t φ +cy t,φ dl t dθ t = 0 Y t κ θ t = 0 A model that achieves this equality for all t is called saturated the same terminology is used for linear models. In other words, will need one free parameter for each observation the alternative gκ θ t = β x t is a restriction. Denote the corresponding θ t by θ t, i.e. the solution of κ θ t = Y t. Now consider the difference 2{l t θ t l t θ t } = 2 φ {Y t θ t θ t κ θ t +κθ t }. 183
184 Maximising the likelihood is the same as minimising this quantity, which is always non-negative, and is 0 only if there is a perfect fit to the i th observation. This is analogous to linear models, where maximising the normal likelihood is the same as minimising least squares criterion which is zero when the fit is best. Thus L T θ = T l t θ t provides a baseline value for the log-likelihood. Example The normal linear model κθ t = 1 2 θ2 t, κ θ t = θ t = µ t, θ t = Y t and φ = σ 2 so 2{l t θ t l t θ t } = 2 σ 2{Y ty t µ t 1 2 Y 2 t µ2 t} = Y t µ t 2 /σ 2. Hence 2{l t θ t l t θ t } includes the classical residual squared for the normal case. In general, let D t = 2{Y t θ t ˆθ t κ θ t +κˆθ t } We call D = T D t the deviance of the model. If φ is present, let D φ D φ = 2{L T θ L T θ}. is the scaled deviance. Thus the residual deviance plays the same role for GLM s as does the residual sum of squares for linear models. Interpreting D t We will now show that D t = 2{Y t θ t ˆθ t κ θ t +κˆθ t } Y t µ t 2. V µ t To show the above we require expression for Y t θ t ˆθ t and κ θ t κˆθ t. We use Taylor s theorem to expand κ and κ about ˆθ t to obtain κ θ t κˆθ t + θ t ˆθ t κ ˆθ t θ t ˆθ t 2 κ ˆθ t 17.9 and κ θ t κ ˆθ t + θ t ˆθ t κ ˆθ t
185 But κ θ t = Y t, κ ˆθ t = µ t and κ ˆθ t = V µ t, so 17.9 becomes and becomes κ θ t κˆθ t + θ t ˆθ t µ t θ t ˆθ t 2 V µ t κ θ t κˆθ t θ t ˆθ t µ t θ t ˆθ t 2 V µ t, Y t µ t + θ t ˆθ t V µ t Now substituting and into D t gives Y t µ t θ t ˆθ t V µ t D t = 2{Y t θ t ˆθ t κ θ t +κˆθ t } 2{Y t θ t ˆθ t θ t ˆθ t µ t 1 2 θ t ˆθ t 2 V µ t } θ t ˆθ t 2 V µ t Y t µ t 2. V µ t Recalling that vary t = φvµ t and EY t = µ t, D t /φ is like a standardised squared residual. The signed square root of this approximation is called the Pearson residual. In other words Y t µ t signy t µ t 2 V µ t is called a Pearson residual. The distribution theory for this is very approximate, but a rule of thumb is that if the model fits, the scaled deviance D/φ or in practice D/ φ χ 2 T p. Deviance residuals The analogy with the normal example can be taken further. The square roots of the individual terms in the residual sum of squares are the residuals, Y t β x t. We use the square roots of the individual terms in the deviances residual in the same way. However, the classical residuals can be both negative and positive, and the deviances residuals should behave in a similar way. But what sign should be used? The most obvious solution is to use { Dt if Y t µ t < 0 r t = Dt if Y t µ t 0 Thus we call the quantities {r t } the deviance residuals. Diagnostic plots 185
186 We recall that for linear models we would often plot the residuals against the regressors to check to see whether a linear model is appropriate or not. One can make useful diagnostics plots which have exactly the same form as linear models, except that deviance residuals are used instead of ordinary residuals, and linear predictor values instead of fitted values. Limiting distributions and standard errors of estimators In the majority of examples we have considered in the previous sections see, for example, Sections 1.1 and 4.1 we observed iid {Y t } with distribution f ;θ. We showed that 1 θ T N θ 0,, TIθ where Iθ = 2 logfx;θ fx;θdx Iθ is Fisher s information. However this result was 2 based on the observations being iid. In the more general setting where {Y t } are independent but not identically distributed we have that β N p β,iβ 1 where now Iβ is a p p matrix of the entire sample, where using equation 17.7 we have Iβ jk = E 2 l d 2 l t = E β j β k dηt 2 x tj x tk = X T WX jk. Thus for large samples we have where W is evaluated at the MLE β. β N p β,x T WX 1, The analysis of deviance How can we test hypotheses about models, and in particular decide which explanatory variables to include? The two close related methods we will consider below are the log-likelihood ratio test and an analogue of the analysis of variance ANOVA, called the analysis of deviance. Let us concentrate on the simplest case, of testing a full vs. a reduced model. Partition the model matrix X and the parameter vector β as ] X = [X 1 X 2 where X 1 is n q and β 1 is q 1 this is analogous to equation 17.1 for linear models. The full model is η = Xβ = X 1 β 1 + X 2 β 2 and the reduced model is η = X 1 β 1. We wish to test β = β1 H 0 : β 2 = 0, i.e. that the reduced model is adequate for the data. 186 β 2,
187 Define the rescaled deviances for the full and reduced models and D R φ = 2{L T θ sup β 2 =0,β 1 L T θ} D F φ = 2{L T θ sup β 1,β 2 L T β} where we recall that L T θ = T l t θ t is likelihood of the saturated model defined in Section Taking differences we have D R D F φ which is the likelihood ratio statistic. = 2 { sup β 1,β 2 L T β sup β 2 =0,β 1 L T θ } The results in Theorem 6.1.1, equation 6.5 the log likelihood ratio test for composite hypothesis also hold for observations which are not identically distributed. Hence using a generalised version of Theorem we have D R D F φ = 2 { sup β 1,β 2 L T β sup β 2 =0,β 1 L T θ } D χ 2 p q. So we can conduct a test of the adequacy of the reduced model D R D F φ by referring to a χ 2 p q, and rejecting H 0 if the statistic is too large p-value too small. If φ is not present in the model, then we are good to go. If φ is present and unknown, we estimate φ with φ = 1 T p Y t µ t 2 V µ t = 1 T p Y t µ t 2 µ θ t see Section Now we consider D R D F, we can then continue to use the χ φ 2 p q distribution, but since we are estimating φ we can use the statistic D R D F p q D F T p as in the normal case compare with Section against F p q,t p, Examples Example Question Suppose that {Y t } are independent random variables with the canonical exponential family, whose logarithm satisfies logfy;θ t = yθ t κθ t φ +cy;φ, where φ is the dispersion parameter. Let EY t = µ t. Let η t = β x t = θ t hence the canonical link is used, where x t are regressors which influence Y t. [14] 187
188 a i Obtain the log-likelihood of {Y t,x t } T. ii Denote the log-likelihood of {Y t,x t } T as L Tβ. Show that L T β j = Y t µ t x t,j φ and 2 L T β k β j = κ θ t x t,j x t,k. φ b Let Y t have Gamma distribution, where the log density has the form logfy t ;µ t = Y { t/µ t logµ t ν } { } ν 1 logν 1 +logγν 1 + ν 1 1 logy t Solution EY t = µ t, vary t = µ 2 t/ν and ν t = β x t = gµ t. i What is the canonical link function for the Gamma distribution and write down the corresponding likelihood of {Y t,x t } T. ii Suppose that η t = β x t = β 0 +β 1 x t,1. Denote the likelihood as L T β 0,β 1. What are the first and second derivatives of L T β 0,β 1? iii Evaluate the Fisher information matrix at β 0 and β 1 = 0. iv Using your answers in ii,iii and the mle of β 0 with β 1 = 0, derive the score test for testing H 0 : β 1 = 0 against H A : β 1 0. a i The general log-likelihood for {Y t,x t } with the canonical link function is L T β,φ = Yt β x t κβ x t ii In the differentiation use that κ θ t = κ β x t = µ t. φ +cy t,φ. b i For the gamma distribution the canonical link is θ t = η t = 1/µ t = 1/beta x t. Thus the log-likelihood is L T β = where c can be evaluated. ii Use part ii above to give L T β β j L T β β i β j 1 Y t β x t log 1/β x t +cν 1,Y t, ν = ν 1 = ν Y t +1/β x t x t,j 1 β x t,i x t,j x t
189 iii Take the expectation of the above at a general β 0 and β 1 = 0. iv Using the above information, use the Wald test, Score test or log-likelihood ratio test. Example Question: It is a belief amongst farmers that the age of a hen has a negative influence on the number of eggs she lays and the quality the eggs. To investigate this, m hens were randomly sampled. On a given day, the total number of eggs and the number of bad eggs that each of the hens lays is recorded. Let N i denote the total number of eggs hen i lays, Y i denote the number of bad eggs the hen lays and x i denote the age of hen i. It is known that the number of eggs a hen lays follows a Poisson distribution and the quality whether it is good or bad of a given egg is an independent event independent of the other eggs. Let N i be a Poisson random variable with mean λ i, where we model λ i = expα 0 +γ 1 x i and π i denote the probability that hen i lays a bad egg, where we model π i with π i = expβ 0 +γ 1 x i 1+expβ 0 +γ 1 x i. Suppose that α 0,β 0,γ 1 are unknown parameters. a Obtain the likelihood of {N i,y i } m. b Obtain the estimating function score of the likelihood and the Information matrix. c Obtain an iterative algorithm for estimating the unknown parameters. d For a given α 0,β 0,γ 1, evaluate the average number of bad eggs a 4 year old hen will lay in one day. e Describe in detail a method for testing H 0 : γ 1 = 0 against H A : γ 1 0. Solution a Since the canonical links are being used the log-likelihood function is L m α 0,β 0,γ 1 = L m Y N+L m N m = Y i βx i N i log1+expβx i +N i αx i αx i +log m Y i βx i N i log1+expβx i +N i αx i αx i. where α = α 0,γ 1, β = β 0,γ 1 and x i = 1,x i. 189 Ni Y i +logn i!
190 b We know that if the canonical link is used the score is L = m φ 1 Y i κ β x i = m Yi µ i and the second derivative is m m 2 L = φ 1 κ β x i = vary i. Using the above we have for this question the score is The second derivative is 2 L m α L m β L m γ 1 2 L m α 0 = L m β 0 = L m γ 1 = = = = m λ i m Ni λ i m Yi N i π i m Ni λ i + Yi N i π i x i. 2 L m α 0 γ 1 = m N i π i 1 π i m m λ i x i λ i +N i π i 1 π i 2 L m β 0 γ 1 = x 2 i. m N i π i 1 π i x i Observing that EN i = λ i the information matrix is m λ i 0 m λ iπ i 1 π i x i Iθ = 0 m λ iπ i 1 π i m λ iπ i 1 π i x i m λ iπ i 1 π i x m i λ iπ i 1 π i x m i λ i +λ i π i 1 π i x 2 i. c We can estimate θ 0 = α 0,β 0,γ 1 using Newton-Raphson with Fisher scoring, that is θ i = θ i +Iθ i 1 S i 1 where S i 1 = m m Ni λ i m Yi N i π i Ni λ i + Yi N i π i x i.. 190
191 d We note that given the regressor x i = 4, the average number of bad eggs will be EY i = EEY i N i = EN i π i = λ i π i = expα 0 +γ 1 x i expβ 0 +γ 1 x i. 1+expβ 0 +γ 1 x i e Give either the log likelihood ratio test, score test or Wald test. Example Question 191
192 192
193 Chapter 18 Count Data 18.1 Proportion data, Count data and Overdispersion See Sections 10.4 and 10.5, Davison 2002, Agresti 1996, for a thorough theoretical account and Simonoff 2003 for a very interesting, applied perspective. In the previous section we generalised the linear model framework to the exponential family. GLM is often used for modelling count data, in these cases usually the Binomial, Poisson or Multinomial distributions are used. Types of data and the distribution: Hence we would observe Y t,x t for 1 t T and from Distribution Regressors Response variables Binomial x t Y t = Y t,n Y t = Y t,1,y t,2 Poission x t Y t = Y t Multinomial x t Y t = Y t,1,y t,2,...,y t,m i Y t,i = N Distribution Binomial PY t,1 = k,y t,2 = N k = N k Probabilities 1 πβ x t N k πβ x t k Poission PY t = k = λβ x t k exp β x t k! Multinomial PY t,1 = k 1,...,Y t,m = k m = N k 1,...,k m π1 β x t k 1...π m β x t km the data we are able to select the distribution we want to fit, and thus estimate β. In this section we will be mainly dealing with count data where the regressors tend to be ordinal not continuous regressors. This type of data normally comes in the form of a contingency table. One of the most common type of contingency table is the two by two table, and we will consider this in the Section below. 193
194 Two by Two Tables Consider the following 2 2 contingency table Male Female Total Blue Pink Total Given with the above table, it is natural to ask is whether there is an association between gender and colour preference. The standard method is test for independence. However, we could also pose question in a different way: are proportion of females who like blue the same as the proportion of males who like blue. In this case we can equivalently test for equality of proportions this equivalance usually only holds for 2 by 2 tables. There are various methods for testing the above hypothesis The log-likelihood ratio test. The Score test The Wald test we have not covered this in much detail, but it is basically using the parameter estimators standardised with the square root of the inverse Fisher information. Through Pearson residuals which is the main motivation of the chi-squared test for independence. The test statistic we construct depends on the test that we wish to conduct and how we choose to model the data. The fact that there can be so many tests for doing the same thing can be quite baffling. But recall that in Section 4 we showed that asymptotically most of these tests are equivalent, meaning that for large sample sizes the test statistics will be close or close under transformation this can be shown by making a Taylor expansion. The differences between the tests are in their small finite sample properties. Example Test for independence For example, the chi-square test for independence is based upon the Pearson residuals: i,j O i,j E i,j 2 E i,j, where O i,j has a Poisson distribution with mean λ i,j, where E i,j is an estimator of λ i,j under the assumption of independence. However, we can use the log-likelihood ratio test. Both these 194
195 tests are asymptotically equivalent, but the log-likelihood ratio test offers an explanation as to why the limiting distribution of both is a chi-squared with R 1 C 1-degrees of freedom see Exercise 4.37, page 135 of Davison Let us consider the altenative approach, testing for equality of proportions. Let π M denote the proportion of males who prefer pink over blue and π F the proportion of females who prefer pink over blue. Suppose we want to test that H 0 : π F = π M against H 0 : π F π M. On method for testing the above hypothesis is to use the test for equality of proportions using the Wald test, which gives the test statistic ˆπ F ˆπ M Iπ 1/2 = ˆπ F ˆπ M. ˆπM 1 ˆπ M n 1 + ˆπ F1 ˆπ F n 2 We know that if the sample size is sufficient large, then under the null the above has a limiting standard normal distribution. It can be shown that this is asymptotically equivalent to the log-likelihood ratio test and score tests show this. An alternative route for conducting the test, is to parameterise π M and π F and do a test based on the parametrisation. For example, without loss of generality we can rewrite π M and π F as π F = expγ 1+expγ π M = expγ +δ 1+expγ +δ. Hence using this parameterisation, the above test is equivalent to testing H 0 : δ = 0 against H A : δ 0. We can then use the log likelihood ratio test or any others to do the test. Davison 2002, Section 4, discusses the advantages of using a test based on this reparameterisation using prospective and retrospective studies - a course on biostatistics would give more details on this Count data and contingency tables Consider the following experiment. Suppose we want to know whether ethnicity plays any role in the number of children a females has. We take large sample of women and determine her ethnicity and the number of children. The data is collected below in the form of a 3 2 contingency table. How can such data arise? There are several ways this data could have been Background A Background B
196 collected, and this influences the model we choose to fit to this data. Consider the general Take the 2-dimensional case, an R C table, with cells indexed by i,j. Note that in the above example R = 2 and C = 3. a The subjects arise at random, the study continues until a fixed time elapses. Each subject is categorised according to two variables. Suppose the number in cell i,j is Y ij, then it is reasonable to assume Y ij Poissonλ ij for some {λ ij }, which will be the focus of study. In this case the distribution is PY = y = C R j=1 λ y ij ij exp λ ij y ij! b The total number of subjects is fixed at N, say. The numbers in cells follow a multinomial distribution: Y ij MN;π ij : if i j y ij = N. PY = y = C N! R j=1 y ij! C R j=1 π y ij ij c One margin is fixed: say {y +j = C y ij} for each j = 1,2,...,R. In each column, we have an independent multinomial sample PY = y = R j=1 y +j! C y ij! where ρ ij is the probability that a column-j individual is in row i so ρ +j = C ρ ij = 1. Of course, the problem is without knowledge of how the data was collected it is not possible to know which model to use. However, we now show that all the models are closely related, and with a suitable choice of link functions, the models lead to the same estimators. We will only show the equivalence between cases a and b, a similar argument can be extended to case c. We start by show that if π ij and λ ij are related in a certain way, then the log-likelihoods of both the poisson and the multinomial are effectively the same. Define the following loglikelihoods for the Poisson, Multinomial and the sum of independent Poissons as follows L P λ = C L M π = log R j=1 C C ρ y ij ij y ij logλ ij λ ij logy ij! N! C R j=1 y ij! + R y ij logπ ij j=1 L F λ ++ = N logλ ++ λ ++ logn!. 196
197 We observe that L P is the log distribution of {y i,j } under Poisson sampling, L M is the log distribution of {y i,j } under multinomial sampling, and L F is the distribution of ij Y ij, where Y ij are independent Poisson distributions each with mean λ ij, N = ij Y ij and λ ++ = ij λ ij. Theorem Let L P,L M and L F be defined as above. If λ and π are related through π ij = λ ij s,t λ st where C is independent of i,j. Then we have that λ ij = Cπ ij, L P λ = L M π+l F C. PROOF. The proof is straightforward. Consider the log-likelihood of the Poisson C R L P λ = y ij logλ ij λ ij logy ij! = = j=1 C R j=1 [ C j=1 y ij logcπ ij Cπ ij logy ij! R y ij logπ ij +logn! = L M π+l F C. C R j=1 ] logy ij! + C R j=1 y ij logc C logn! Which leads to the required result. Remark The above result means that the likelihood of the independent Poisson conditioned on the total number of participants is N, is equal to the likelihood of the multinomial distribution where the relationship between probabilities and means are given above. The above result basically means that so long as a the probabilities and mean are connected through the relation π ij = λ ij s,t λ st and λ ij = Cp ij, then it does not matter whether the multinomial distribution or Possion distribution is used to do the estimation. Before we show this result we first illustrate a few models which are commonly used in categorical data. Example Let us consider suitable models for the number of children and ethnicity data. Let us start by fitting a multinomial distribution using the logistic link. We start y modelling β x t. One possible model is β x = η +α 1 δ 1 +α t δ 2 +α 3 δ 3 +β 1 δ 1 +β 2 δ 2, 197
198 where δ i = 1 if the female has i children and zero otherwise, δ1 = 1 if female belongs to ethnic group A and zero otherwise, δ2 = 1 if female belongs to ethnic group B and zero otherwise. The regressors in this example are x = 1,δ 1,...,δ2. Hence for a given cell i,j we have β x ij = η ij = η +α i +β j. One condition that we usually impose when doing the estimation is that 3 α i = 0 and β 1 + β 2 = 0. These conditions mean the system is identifiable. Without these conditions you can observe that there exists another { α i }, { β i } and η, such that η ij = η +α i +β j = η + α i + β j. Now let understand what the above linear model means in terms of probabilities. Using the logistic link we have π ij = g 1 β x ij = expη +α i +β j s,t expη +α s +β t = expα i s expα s expβ j t expβ t. where π ij denotes the probability of having i children and belonging to ethnic group j and x ij is a vector with ones in the appropriate places. What we observe is that the above model is multiplicative, that is π ij = π i+ π +j where π i+ = j π ij and π +j = i π i+. This means by fitting the above model we are assuming independence between ethnicity and number of children. To model dependence we would use an interaction term in the model β x = η +α 1 δ 1 +α t δ 2 +α 3 δ 3 +β 1 δ 1 +β 2 δ 1 + i,j γ ij δ i δ j, hence η ij = η +α i +β j +α ij. However, for R C tables an interaction term means the model is saturated ie. the MLE estimator of the probability π ij is simply y ij /N. But for R C L, we can model interactions without the model becoming saturated. These interactions may have interesting interpretations about the dependence structure between two variables. By using the analysis of deviance which is a effectively the log-likelihood ratio test, we can test whether certain interaction terms are significant - similar things were done for linear models. The distribution theory comes from what we derived in Sections 4 and 16. Now if we transform the above probabilities into Poisson means using λ ij = Cπ ij, the for the case there is no-interaction the mean of Poisson at cell i,j is λ ij = Cexpη +α i +β j. 198
199 In the above we have considered various methods for modelling the probabilities in a mulitnomial and Poisson distributions. In the theorem we show that so long as the probabilities and Poisson means are linked in a specific way, the estimators of β will be identical. Theorem Let us suppose that π ij and µ ij are defined by where γ and β are the only unknowns. Let L P β,γ = L M β = π ij = π ij β λ ij = γcβπ ij β, C R j=1 C j=1 y ij logγcβπ ij β γcβπ ij β R y ij logπ ij β L F β,γ = N logγcβ γcβ, which is the log-likelihoods for the Multinomial and Poisson distributions without unnecessary constants such as y ij!. Define ˆβ P,ˆγ P = argmaxl P β,γ ˆβ B = argmaxl M β ˆγ F = argmaxl F β,γ. Then ˆβ P = ˆβ M and ˆγ P = ˆγ M = N/Cˆβ M. PROOF. We first consider L P β,γ. Since i,j p i,jβ = 1 we have L P β,γ = = C R j=1 C R j=1 y ij logγcβπ ij β+γcβπ ij β y ij logπ ij β+n logγcβ γcβ. Now we consider the partial derivatives of L P to obtain L P β L P γ = L M β +γ C β N γcβ 1 = 0 N = γ Cβ = 0. Solving the above we have that ˆβ P and ˆγ P satisfy ˆγ P = N Cˆβ P L M β β=ˆβ P =
200 Now we consider the partial derivatives of L M and L C L M β = 0 L F γ = N γ Cβ = Comparing the estimators in 18.1 and 18.2 it is clear that the maximum likelihood estimators of β based on the Poisson and the Binomial distributions are the same. Example Let us consider fitting the Poisson and the multinomial distributions to the data in a contingency table where π ij and λ ij satisfy λ ij = expη +β x ij and π ij = expβ x ij s,t expβ x s,t. Making a comparison with λ ij β = γcβπ ij β we see that γ = expη and Cβ = s,t expβ x s,t. Then it by using the above theorem the estimator of β is the parameter which maximises C R j=1 y ij log expβ x ij, s,t expβ x s,t and the estimator of γ is the parameter which maximises which is η = logn log s,t expˆβ x s,t. N logexpηcˆβ expηcˆβ, Overdispersion The binomial and Poisson distributions have the disadvantage that they are determined by only one parameter π in the case of Binomial and λ in the case of Possion. This can be a severe disadvantage when it comes to modelling certain types of behaviour in the data. A type of common behaviour in count data is overdispersed, in the sense that the variance appears to be larger than the model variance. For example, if we fit a Poisson to the data the mean is λ, but the variance is larger than λ. Often the data can suggest overdispersion. For example, we showed in Section that the Pearson residuals for the Poisson is r t = Y t ˆµ t φ 1/2 Vµ t 1/2 = Y t ˆµ t µt. If the model is correct, the residuals {r r } should be close to a standard normal distribution. However, in the case of overdispersion it is likely that the estimated variance of r t will be greater than one. Alternatively if varx t /EX t depends on µ t, then we should see some dependence between r t and µ t in a plot. More quantitively, this may mean that a goodness of fit test may give rise to large p-value etc. 200
201 The modelling of overdispersion can be introduced in various ways. Below we discuss various ways for introducing overdispersion for mainly Poisson models. i Zero inflated models. The number of zeros in count data can sometimes be more inflated than Poisson or binomial distributions are capable of modelling for example, if we model the number of times a child visits the dentist, we may observe that there is large probability the child will not visit the dentist. To model this type of behaviour we can use the inflated zero Poisson model, where 1 p1 exp λ = 1 p+pexp λ k = 0 PY = k = p exp λλk k! k > 0. We observe that the above is effectively a mixture model. It is straightforward to show that EY = pλ and vary = pλ1+λ1 p, hence vary t EY t = 1+λ1 p. WeobservethatthereismoredispersionherethanclassicalPoissonwherevarX t /EX t = 1. ii Modelling overdispersion through moments The above model is specific to boosting zero. One can introduce overdispersion by simply modelling the moments. That is define a psuedo Poisson model in terms of its moments, where EX t = λ and varx t = λ1+δ δ 0. This method does, not specify the distribution, just gives conditions on the moments. We can do similar moment boosting to the binomial distribution too. iii Modelling overdispersion with another distribution Another method for introducing overdispersion to include a latent unobserved parameter ε. Let us assume that ε is a positive random variable where Eε = 1 and varε = ξ. We suppose that the distribution of Y conditioned on ε is Poisson, ie. PY = k ε = λεk exp λε k!. To obtain the moments of Y we note that for any random variable Y we have vary = EY 2 EY 2 = E EY 2 ε EY ε 2 +E EY ε 2 E EY ε 2 = E vary ε +varey ε, 201
202 where we note that vary ε = k=0 k2 PY = k ε k=0 kpy = k ε2 and EY ε = kpy = k ε. Applying the above to the conditional Poisson we have k=0 vary = E2λε λε+varλε = λ+λ 2 ξ = λ1+λξ and EY = EEY ε = λ. The above gives an expression in terms of moments. If we want to derive the distribution of Y, we require the distribution of ε. This is normally hard in practice to verify, but for reasons of simple interpretation we often let ε have a Gamma distribution fε;ν,κ = ν κ ε k 1 Γκ exp νε, where ν = κ, hence Eε = 1 and varε = 1/ν= ξ. Therefore in the casethat εisagammadistribution withdistribution function fε;ν,ν = νν ε ν 1 Γν exp νε the distribution of Y is PY = k = PY = k εfε;ν,νdε = = λε k exp λε Γk +ν Γνk! k! ν ν λ k ν +λ ν+λ. ν ν ε k 1 Γν exp νεdε This is called a negative Binomial because in the case that ν is an integer is resembles a regular Binomial but can take infinite different outcomes. The negative binomial only belongs to the exponential family if ν is known and does not need to be estimated. Not all distributions on ε lead to explicit distributions of Y. The Gamma is popular because it leads to an explicit distribution for Y often it is called the conjugate distribution. A similar model can also be defined to model overdispersion in proportion data, using a random variable whose conditional distribution is Binomialsee page 512, Davison Parameter estimation for in the presence of overdispersion See also Section 10.6, Davison We now consider various methods for estimating the parameters. Some of the methods described below will be based on the Estimating functions and derivations from Section , equation Letussupposethat{Y t }areoverdispersedrandomvariableswithregressors{x t }andey t = µ t withgµ t = β x t. Thenaturalwaytoestimatetheparametersβ istousealikelihoodmethod. However, the moment based modelling of the overdispersion does not have a model attached so it is not possible to use a likelihood method, and the modelling of the overdispersion using, 202
203 say, a Gamma distribution, is based on a assumption that is hard in practice to verify that the latent variable is Gaussian. Hence, though sometimes it is possible to fit an overdispersed using likelihood methods for example the negative Binomial, often this is undesireable. An alternative approach is to use moment based/estimating function methods which are more robust to misspecification than likelihood methods. In the estimation we discuss below we will focus on the Poisson case, though it can easily be generalised to the non-poisson case. Let us return to equation 17.5 Y t µ t x tj φvµ t g µ t = Y t µ t x tj φvµ t In the case of the Poisson distribution, with the log link the above is dµ t dη t = 0 1 j p Y t expβ x t x tj = 0 1 j p We recall if {Y t } are Possion random variables with mean expβ x t, then variance of the limiting distribution of β is β β N p 0,X T WX 1, where and Iβ jk = E 2 l = E β j β k d 2 l t dηt 2 x tj x tk = X T WX jk. W = diag E 2 l 1 η 1 η1 2,...,E 2 l T η T ηt 2 = diag expβ x 1,...,expβ x T. However, as we mentioned in Section , equations 18.3 and 18.4 do not have to be treated as derivatives of a likelihood. Equations 18.3 and 18.4 can be views as an estimating function, since they only only based on the first and second order moment of {Y t }. Hence they can be used as the basis of the estimation scheme even if they are not as efficient as the likelihood. In the overdispersion literature the estimating equations functions are often called the Quasi-likelihood. Example Let us suppose that {Y t } are independent random variables with mean expβ x t. We use the solution of the estimating function gy t ;β = Y t expβ x t x tj = 0 1 j p. to estimate β. We now derive the asymptotic variance for two cases. 203
204 i Modelling overdispersion through moments Let us suppose that EY t = expβ x t and vary t = 1+δexpβ x t δ 0. Then if the regularity conditions are satisfied then we can use to obtain the limiting variance. We observe that E T gy t;β = X T diag expβ x 1,...,expβ x T X β var gy t ;β = 1+δX T diag expβ x 1,...,expβ x T X. Hence the limiting variance is 1+δX T WX 1 = 1+δX T diag expβ x 1,...,expβ x T X 1. Therefore, in the case that the variance is 1+δexpβ x t, the variance of the estimator using the estimating equations T gy t;β, is larger than for the regular Poisson model. But is δ is quite small, the difference is not much. We mention, that to obtain an estimator of the limiting variance we need to estimate δ. ii Modelling overdispersion when vary t = expβ x t 1+expβ x t ξ. We mention that in this case the estimating equation T gy t;β is not fully modelling the variance. In this case we have E T gy t;β = X T WX and var gy t ;β = X T WX, β where W = diag expβ x 1,...,expβ x T W = diag expβ x 1 1+ξexpβ x 1,...,expβ x T 1+ξexpβ x T. Hence the limiting variance is X T WX 1 X T WXX T WX 1. We mention that the estimating equation can be adapted to take into count the overdispersion in this case. In other words we can use as an estimator of β, the β which solves Y t expβ x t 1+ξexpβ x t x tj = 0 1 j p. Though we mention that we probably have to also estimate ξ when estimating β. 204
205 Chapter 19 Survival Analysis with explanatory variables 19.1 Survival analysis and explanatory variables In this section we build on the introduction to survival analysis given in Section 12. Here we consider the case that some explanatory variables such as gender, age etc. may have an influence on survival times. See also Section 10.8, Davison We recall that in Section 12, the survivial times {T i } were iid random variables, which may or may not be observed. We observe Y i = minc i,t i and the indicator variable δ i which tells us whether the individual is censored or not, that is δ i = 1 if Y i = T i ie. the ith individual was not censored and other zero otherwise. In this case, we showed that the likelihood with the censoring times {c i } treated as deterministic is given in 13.4 as L n θ = = n δ i logfy i ;θ+1 δ i log 1 FY i ;θ n δ i loght i ;θ n HY i ;θ, where ft;θ, Ft;θ = PT i t, ht;θ = ft;θ/ft;θ and Ht;θ = t 0 hy;θdy = log Ft; θ denote the density, survival function, hazard function and cumulative hazard function respectively. We now consider the case that the survival times {T i } are not identically distributed but determined by some regressors {x i }. Furthermore, the survival times could be censored, hence we observe {Y i,δ i,x i }, where Y i = mint i,c i. Let us suppose that T i has the distribution specified by the hazard function ht;x i,β hence the hazard depends on both parameters and 205
206 explanatory variables x i, and we want to analyse the dependency on x i. It is straightforward to see that the log-likelihood of {Y i,δ i,x i } is L n β = n n δ i loght i ;x i,β HY i ;x i,β. There are two main approaches for modelling the hazard function h: Proportional hazards PH. The effect of x i is to scale up or down the hazard function Accelerated life AL. The effect of x i is to speed up or slow down time. We recall that from the hazard function, we can obtain the density of T i, though for survival data, the hazard function is usually more descriptive. In the sections below we define the proportional hazard and accelerated life hazard function and consider methods for estimating β The proportional hazards model Proportional hazard functions are used widely in medical applications. Suppose the effect of x is summarised by a one-dimensional non-negative hazard ratio function ψx; β sometimes called the risk score. That is ht;x,β = ψx;βh 0 t, where h 0 t is a fully-specified baseline hazard. We choose the scale of measurement for x so that ψ0 = 1, i.e. h 0 t = ht;0,β. It follows that Ht;x,β = ψx;βh 0 t Ft;x,β = F 0 t ψx;β ft;x,β = ψxf 0 t ψx;β 1 f 0 t. Recall that in question HW5, we showed that if Fx was a survival function, then Fx γ also defines a survival function, hence it corresponded to a well defined density. The same is true of the proportional hazards function. By defining ht;x,β = ψx;βh 0 t, where h 0 is a hazard function, we have that ht;x,β is also a viable hazard function. A common choice is ψx;β = expβ x, with β to be estimated. This is called the exponential hazard ratio. 206
207 MLE for the PH model with exponential hazard ratio The likelihood equations corresponding to ht;x,β and Ht;x,β are L n β = = n δ i logexpβ x i h 0 Y i n [ δ i β x i +logh 0 Y i ] n expβ x i H 0 Y i n [ expβ x i H 0 Y i ], where the baseline hazard h 0 and H 0 is assumed known. The derivative of the above likelihood is L n β β j = n δ i x ij n x ij e β x i H 0 Y i = 0 1 j p. In general ther is no explicit solution for ˆβ, but there is in some special cases. For example, suppose the observations fall into k disjoint groups with x ij = 1 if i is in group j, 0 otherwise. Let m j be the number of uncensored observations in group j, that is m j = i δ ix ij. Then the likelihood equations become L T β β j hence the mle estimator of β j is = m j i δ i e β j H 0 Y i = 0 β j = log [ m j / i δ i H 0 Y i ]. Another case that can be solved explicitly is where there is a single explanatory variable x that takes only non-negative integer values. Then Ln β is just a polynomial in eβ and may be solvable. But in general, we need to use numerical methods. The numerical methods can be simplified by rewriting the likelihood as a GLM log-likelihood, plus an additional term which plays no role in the estimation. This means, we can easily estimate β using existing statistical software. We observe that log-likelihood can be written as L n β = = n [ δ i β x i +logh 0 Y i ] n [ δ i β x i +logh 0 Y i ] n [ expβ x i H 0 Y i ] n expβ x i H 0 Y i + Hence the parameter which maximises L n β also maximises L n β, where n δ i log h 0Y i H 0 Y i. L n β = n [ δ i β x i +logh 0 Y i ] n expβ x i H 0 Y i. 207
208 In other words the likelihoods L n β and L n β lead to the same estimators. This means that we can use L n β as a means of estimating β. The interesting feature about L n β is that it is the log-likelihood of the Poisson distribution where δ i is the variable though in our case it only takes zero and one with mean λ i = expβ x i H 0 Y i. Hence we can do the estimation of β within the GLM framework, where we use a Poisson log-likelihood δ i,x i as the observations and regressors, and model the mean λ i as expβ x i H 0 Y i. It is worth mentioning that the above estimation method is based on the assumption that the baseline hazard h 0 is known. This will not always be the case, and we may want to estimate β without placing any distributional assumptions on h. This is possible using a Kaplan Meier semiparametric type likelihood. The reader is referred to a text book on survival analysis for further details Accelerated life model An alternative method for modelling the influence of explanatory variables regressors on the response is to use the accelerated life model. An individual with explanatory variables x is assumedtoexperiencetimespeededupbyanon-negativefactorξx, wherewesupposeξ0 = 1, i.e. x = 0 represents the baseline again. Thus: Ft;x = F 0 ξxt ft;x = ξxf 0 ξxt ht;x = ξxh 0 ξxt. If there were only a small number of possible values for ξx, either through x being very discrete ordinal, or because of the assumed form for ξ, we could just take the unique values of ξx as parameters, and estimate these the same can be done in the PH case. Except in the case mentioned above, we usually assume a parametric form for ξ and estimate the parameters. As with the PH model, a natural choice is ξx = e β x. Popular choices for the baseline F 0 is exponential, gamma, Weibull, log-normal and loglogistic. MLE for the AL model with exponential speed-up In this section we will assume that ξx = expβ x. Hence Ft;x = F 0 expβ xt. There are various methods we can use to estimate β. One possibility is to go the likeihood route L n β = n δ i logh 0 expβ n x i Y i expβ x i H 0 expβ x i Y i, 208
209 where the baseline hazard function h 0 is known. But this would mean numerically maximising the likelihood through brute force. To use such a method, we would require a good initial value for β. To obtain a good initial value, we now consider an alternative method for estimating β. Let us define the transformed random variable W = logt +β x. The distribution function of W is P{W < w} = P{logT < w β x} = 1 P{T expw β x} = 1 Fe w β x = 1 F 0 e w Thus, W has a distribution that is independent of x, and indeed is completely known if we assume the baseline is fully specified. Hence log T satisfies the linear model logt i = µ 0 β x i +ε i, where EW = µ 0 and ε i are iid random variables with mean zero. Hence if the observations have not been censored we can estimate β x, by minimising the log-likelihood n β x i +logf 0 logt i +β x i. However, an even simplier method is to use classical least squares to estimate β. In other words use the µ and β which minimise n logti µ β 2 x i as estimators of µ 0 and β respectively. Hence, this gives us the best Minimum Variance Linear Unbiased EstimatorMVLUE of β. But it is worth mentioning that a likelihood based estimator gives a smaller asymptotic variance. If there is censoring, there are more complicated algorithms for censored linear models, or use Newton-Raphson for solving the likelihood equations. Unlike, the proportional hazard models, there is no connection between parameter estimation in accelerated life models and GLM The relationship between the PH and AL models The survivor functions under the two models are PH with hazard ratio function ψ: Ft;x = F 0 t ψx and AL with speed-up function ξ: Ft;x = F 0 ξxt. Let us suppose the baseline survival distribution in both cases is the Weibull with { t α } F 0 t = exp θ 209
210 Hence using this distribution the proportional hazards and accelerated life survival functions are { t α { tξx α } F PH t;x = exp ψx} and F AL t;x = exp. θ θ Comparingtheabovesurvivalfunctionsweseethatifξx ψx 1/α,thenwehaveF PH t;x = F AL t;x. In fact, it is quite easy to show that this is the only case where the two models coincide Goodness of fit As in most cases of statistical modelling we want to verify whether a model is appropriate for a certain data set. In the case of linear models, we do this by considering the residual sum of squares and for GLM we consider the deviance see Section The notion of residual can be extended to survival data. We recall that residuals in general should be pivotal or asymptotically pivotal in the sense that their distribution should not depend on the unknown parameters. We now make a tranformation of the survival data which is close to pivotal if the survival distribution and model are correctly specified. Let us first consider the case that the data is not censored. Let T i denote the survival time, which has the survival and cumulative hazard functions F i and H i t = logf i t later we will introduce its dependence on the explanatory variable x i. Let us consider the distribution of H i T i. The distribution function of H i T i is PH i T i y = PlogF i T i > y = PF i T i > exp y = P1 F i T i exp y = PFT i 1 exp y = PT i F 1 1 exp y = FF 1 1 exp y = 1 exp y. Hence the distribution of H i T i is a exponential with mean one, in other words it does not depend on any unknown parameters. Therefore in the case of uncensored data, to check for adequency of the model we can fit the survival models {Ft,x i ;β} to the observations {T i } and check whether the transformed data {HT i,x i ; ˆβ} are close to iid exponentials. These are called the Cox-Snell residuals, they can be modified in the case of censoring. 210
211 Chapter 20 Nonparametric regression 20.1 A quick glimpse at Nonparametric Regression Let us return to the linear models which motivated the GLM framework. We recall that in a linear model Y t = β x t + ε t. We can generalise the linear mean by assuming that Y t = µβ x t + ε t, where the function µ is known and our objective is to estimate β. The above models are known as parametric models. However, sometimes parametric models are not flexible enough, for example, often we do not know the function µ. An alternative, more flexible method is to work within a nonparametric framework. In this section we will briefly consider methods for estimating the regression function g : [0,1] R in the model Y i = gx i +ε i, where {ε i } are iid random variables with Eε i = 0 and varε i = σ 2. On interpretation of the above model, is that we observe a corrupted signal Y i and we want to estimate the underlying signal. We do not assume that g has a parametric form ie, linear etc, but we do assume that g is smooth, in the sense that the first r derivatives exist and are bounded typically it is assumed that r = 2. There may different methods for estimating g. The most popular tend to be Kernel based methods. Local polynomial methods where the kernel method is a special case. Splines. Series expansions such as Fourier series, wavelets etc. 211
212 Penalised least squares often called the roughness penality approach. Wavelet threshold methods often called nonlinear estimation. We now give a very brief tour of some of the methods described above. For a full treatment you should attend a course on nonparametric estimation. We briefly mention that lower bounds for the mean squared error of estimators g can be obtained using similar techniques to those used in Section Local linear regression Let us suppose that gx does not have a known parametric form, but it is smooth, in the sense that its first two derivatives exists. Then by using a Taylor expansion we have that gx = gx 0 +x x 0 g x 0 +Ox x 0 2 gx 0 +x x 0 g x 0, if x and x 0 are close. In other words, gx may not have a parametric form, but for x in a neighbourhood of x 0 we can approximate g by a line. That is gx a 0 +a 1 x x 0, for some a 0 and a 1 in a neighbourhood of x 0 ie. [x 0 h,x 0 + h]. We observe that the approximating coefficients a 0,a 1 are only valid in a neighbourhood of x 0, therefore to estimate a 0,a 1 we only use those regressors in a neighbourhood of x 0. We do this by using local least squares rather than global least squares, which we would use for a regular linear model. Hence we define the local linear least squares criterion L T x 0 ;a 0,a 1 = 1 ht K x i x 0 Yi a 0 a 1 x i x 0 2 h where K is a kernel similar to the kernel used in nonparametric density estimation - see Section We use â 0 x 0,â 1 x 0 = argminl T x 0 ;a 0,a as an estimator of L T. We mention that explicit expressions for â 0 x 0,â 1 x 1 can easily be obtained. Moreover, for every x 0 we need to re-evaluate a 0 x 0,a 1 x 0 and the estimator of gx 0 is ĝx 0 = a 0 x
213 Remark The Nadaraya-Watson estimator The local constant least squares estimator is a simplified version of the above estimator. In this case, we can locally approximate g with a constant, gx c 0. The local constant least squares criterion is L T x 0 ;c = i 1 ht K x i x 0 Yi c 2. h We use as an estimator of gx 0, ĉx 0 where ĉx 0 = argminl T x 0 ;c = i K x i x 0 h Yi i K x i x 0 h The above estimator is usually called the Nadaraya-Watson estimator. We mentioned above that the Nadaraya-Watson estimator is a simplified version of the local linear regression estimator. The reason that the local linear regression estimator is often prefered to the Nadaraya-Watson estimator is that if the second derivative of g exists, it can be shown that the local linear regression estimator of gx 0, â 0 x 0 has a smaller bias than ĉx 0. More precisely, straightforward calculations show that E fx 0 â 0 x 0 2 = O h ht. E fx 0 ĉx 0 2 = O h ht We mention that that as T grows, we observe the corrupted function g more in the neighbourhood of x 0 hence we are getting more information on the behaviour of the function gx 0. But to reduce bias we must let the neighbourhood of localisation shrink h 0 as T. Of course, we can generalise the local linear estimator to a local quadratic estimator. In this case, if the third derivative of the g, exists, then the bias of the local quadratic estimator will be smaller than the bias of local linear estimator. But this gain only arises if the third derivative of g is bounded. This can be quite a strong assumption. Moreover, the more parameters we estimate in local linear regression we estimate two, in local quadratic regression we estimate three the larger the variance in the estimation. Therefore, it is not clear, whether we really gain by using high order polynomials in estimation. There are various methods for estimating the bandwidth h, these include cross-validation etc. See also Section 10.7, in Davison Series expansion methods An alternative method for estimating g is to write it as a series expansion and estimate the coefficients in the expansion. Define L 2 [0,1] as all functions f : [0,1] R such that 1 0 fx 2 dx <. Let {φ k } k be an orthonormal basis of L 2 [0,1] meaning that φ k1 xφ k2 xdx = 1 if 213
214 k 1 = k 2 and zero otherwise. Suppose that g L 2 [0,1], then this implies we can represent g as gx = a k φ k x, where a k = k= 1 0 gxφ k xdx. An example of an orthogonal basis is the Fourier basis {exp2iπx} k. Define the truncated function r g r x = a k φ k x. k= r It is clear that as r the truncated function g r x gets closer to the true function gx. The difference g r x gx depends on two main factors i the smoothness of the function g and ii the properties of basis. For example, bounds for the difference between the truncated the Fourier series expansion g r x and gx are well known. Recall that we observe Y i = gx i + ε i. We can estimate the coefficient a k under certain conditions on the design {x i } with â T,k = 1 T and use ĝ T,r x = Y i φ k x i r â T,k φ k x k= r as an estimator of g r x. Since g r x is an approximation of gx, ĝ T,r x is an estimator of gx. The quality mean squared error in this approximation depends on various factors: The larger r the smaller the bias in the estimator, ie. Eĝ T,r x gx 0 as r. However, a large r, leads to a large variance varĝ T,r x - since we have to estimate a large number of coefficients. From above we see, like kernel estimators, there is a trade off between the bias and variance Penalised least square estimators An alternative method of estimation is penalised least squares. This is a standard least squares estimator of the function g, but to prevent overfitting, the least squares criterion is penalised with a penality, that penalises rough functions which tend to overfit the data. We use the criterion Yi hx i 2 +λ h x 2 dx, 214
215 where λ is the penality which is chosen apriori, as the basis for the estimation. An estimator of g is the function which ĝ T x which minimises the above criterion. It is clear from the construction of ĝ T, that the second derivative of ĝ T x must exist. Moreover, Green and Silverman 1994, show that the function ĝ T x can be easily obtained and is a B-spline. 215
216 216
217 Chapter 21 A short review of Time Series 21.1 A very short blast of Time Series Up until now, we have generally assumed that the observations {Y t } are independent. In situations where {Y t } is observed over time, the assumption of independence my be unrealistic especially for observations which are close in observation times. If the dependency is ignored, dependency in the data can cause many problems. The least worst scenario is that the parameter estimators will be consistent, but the standard errors may be wrong - leading to misleading CIs etc. A worse situation is that the model will be completely misspecified, leading to wrong conclusions. A good introduction to Time Series is Shumway and Stoffer 2006, more advanced texts are Priestley 1988, Brockwell and Davis 1991 and Pourahmadi For this reason, it is important to check for dependency and if there is evidence of dependence, model the dependence structure. Often but not always the dependency will manifest as correlation between observations. The covariance can be estimated using the empirical covariance at lag r which is ĉ T r = 1 T T r Y t Y t+ r. Now if {Y t } are independent, then ĉ T r should be close to zero. Indeed if {Y t } are iid random variables, then Tˆρ T r D N0,1 where ˆρ T r = ĉ T r/ĉ T 0 is the empirical correlation. On the other hand, if {Y t } is a stationary time series, then ĉ T r is an estimator of the covariance at lag r which is cr = covy t,y t+r. Stationarity is an important sometime restrictive assumption that is often used in time series, it basically means that the structure of the observations {Y t } does not change over time. More precisely, the joint distribution of Y t1,...,y tk and Y τ+t1,...,y τ+tk are the same for all τ and sequences t 1,...,t k. 217
218 There are several tests for non-correlation. Under the assumption of independence ĉ T r is an estimator of zero, hence many tests for non-correlation is based on ĉ T r. The most popular is the Ljung-Box test which is an adaption of the Box-Pierce test where we define the test statistic as T = TT +2 h r=1 ρ T r 2 T r. Under the null of independence, using that Tˆρ T r ρr D N0,1, we have that T D χ 2 h. Hence the Ljung-Box test can be used to test for correlation in the observations. If there is correlations, then T will be a non-central chi-squared distribution where the non-centrality parameters is a function of {Tρr} h r=1. This means we will tend to reject the null if the alternative is true. Adaptions of this test are used in time series to check for goodness if fit. Remark It is important to mention that there exists dependent times series which are uncorrelated. One famous example, is the GARCH model, commonly used to model financial time series, the time series are uncorrelated but dependent. This is a nonlinear time series and does not fit into the frame work of linear time series models which is described in this section. Let us suppose that we reject the null and there is evidence to suggest that there is linear dependence. If we can assume that the time series is stationary a commonly used time series model is the autoregressive AR process. The observations {Y t } are said to come from an ARp model if they satisfy Y t = p a j Y t j +ε t, j=1 where {ε t } are iid random variables with Eε t = 0 and varε t = σ 2. To ensure stationarity some assumptions are required on the coefficients {a j }. Remark Properties of the Autoregressive process The autoregressive process is said to be p-markovian, in the sense that the distribution of Y t given Y t 1,...,Y t m where m > p is the same as the distribution of Y t given Y t 1,...,Y t p. In other words, the best predictor of Y t given the past, only requires the observations {Y t j } p j=1 prediction by using those in the further past. we cannot improve the It is worth mentioning that the inverse of the variance/covariance matrix of Σ = vary 1,...,Y T has a very interesting structure. The elements of the inverse of any variance/covariance matrix give information about the conditional linear dependence between two random variables. Let ρ i,j = Σ 1 i,j. Then it can be shown, if ρ i,j = 0, then the conditional covariance between Y i 218
219 and Y j given all the other random variables, is zero, ie. covy i,y j Y k ;k i,j = 0 this is called the partial covariance. This is quite a useful result and is used widely. It also means that in the case of ARp processes, the inverse matrix Σ will be a band matrix only non-zero along the diagonal and a few rows off the diagonal. Recall that we encountered the AR1 model in Section To refresh our memory we will return to this example. Let us suppose p = 1 and {Y t } satisfies an AR1 model Y t = ay t 1 +ε t, where a < 1. Now by iterating backwards Y t almost surely has the solution Since ε t are iid random variables we have that Y t = aay t 1 +ε t 1 +ε t t = a j ε t j +a t Y 0 = a j ε t j. j=0 j=0 vary t = covy t,y t+r = j=0 j=0 a 2j σ 2 = σ2 1 a 2 a 2j+ r σ 2 = σ2 a r 1 a Using conditional arguments see Section the log-likelihood of {Y t } T is L T Y;θ = logfy 1 ;θ+ = logf X Y 1 ;θ+ = logf X Y 1 ;θ+ logfy t Y t 1,...,Y 1 ;θ t=2 logfy t Y t 1 ;θ t=2 logf ε Y t ay t 1, where f ε is the density of {ε t } and f X is the marginal density of Y t. If the innovations ε t are Gaussian, since Y t = j=0 aj ε t j ie. Y t is the sum of Gaussian random variables, then Y t is also a Gaussian random variables. In this case, the log-likelihood is proportional to σ 2 L T a,σ = log 1 a 2 Y1 2 T σ 2 /1 a 2 logσ σ 2Y t ay t 1. 2 We use the parameters which maximise the above as estimators of σ and a. We note that if we remove the first term, logf X Y 1 ;θ, from the above we have the conditional log-likelihood t=2 t=2 t=2 logσ σ 2Y t ay t
220 whichifoftenusedtoestimatethear1parametera, evenify t isnon-gaussian. Themaximum of the above is the least squares estimator of a, which is â LS,T = T t=2 Y ty t 1 T t=2 Y. t 1 2 The same method can be used to construct estimators of ARp parameters. It is worth remarking that if Y t has a non-zero mean and satisfies Y t = µ+ay t 1 +ε t, then {Y t } has the solution Y t = a j µ+ε t j = µ 1 a + a j ε t j. j=0 In the case that {ε t } is Gaussian, the above model has log-likelihood proportional to σ 2 L T a,σ,µ = log 1 a 2 Y 1 µ 1 a 2 T σ 2 /1 a 2 logσ σ 2Y t µ 1 a ay t 1 µ 1 a 2. Usually we reparametrise the above set-up and define µ = µ and the log likelihood as L T a,σ, µ = log t=2 σ 2 1 a 2 Y 1 µ 2 T σ 2 /1 a 2 logσ σ 2Y t µ ay t 1 µ. 2 We then use µ, σ 2 and a which maximise the above, as the parameter estimators. t=2 We observe that the variance of the sample mean Ȳ = 1 T T Y t is different to the variance of iid random variables. Using 21.1 we have varȳ = 1 T 2 = t,τ=1 σ 2 T1 a T 2 covy t,y τ = 1 T 2 r=1 j=0 vary t + 2 T 2 1 r σ 2 a r T 1 a 2. r=1 1 r covy t,y t+r T Under the assumption that asymptotically Ȳ is normal this is true, but it is not trivial to show, then we can construct CIs for the mean based on the varȳ derived above. We observe that due to the dependence the CI will be different to the the CI constructed under the assumption of σ2 independence. It it worth noting that the limiting variance of var TȲ 1 a 2 + r=1 σ2 a r 1 a 2 as T. 220
221 Remark The Yule-Walker estimator The above gives the least squares estimators of the AR1 parameter. There is an alternative method of estimation called the Yule-Walker estimators. Asymptotically both methods are the same, but the Yule-Walker type estimator can have finite sample properties which are often more desireable than the least squares estimator. We now describe the Yule-Walker estimator. The AR1 model is Y t = ay t 1 + ε t. Multiplying both sides of this equation by Y t 1 and taking expectation assuming this is allowed leads to EY t Y t 1 = aey 2 t 1+Eε t Y t 1 } {{ } =0 EY t Y t 1 = aey 2 t 1. Hence EY t Y t 1 aey 2 t 1 = 0. By estimating EY ty t 1 and EY 2 t 1 with ĉ T 1 = 1 T T 1 Y t Y t+1 ĉ T 0 = 1 T Y 2 t we have the almost unbiased estimating equation G T a = ĉ T 1 aĉ T 0 = 0. The Yule-Walker estimator is the solution of this equation â YW,T = ĉt1 T 1 ĉ T 0 = Y ty t+1 T Y. t 2 An advantage of the Yule-Walker estimator over the least squares estimator is that â YW,T < 1, hence it leads to a viable parameter estimator recall that for {Y t } to be a stationary time series we require that a < 1. On the other hand, it is not guaranteed that for finite sample â LS,T < 1, however asymptotically â LS,T P a, where a < 1. A lot of time series analysis is based on discrete time time series, for example, the autoregressive model described above. It is worth concluding this section by discussing, briefly, continuous time, time series. Continuous time time series models such as the Black-Scholes equation are often used to model finanical data so called high frequency data. Another application where continuous time time series arise is in the nonparametric, functional data literature. Let us suppose that {Xt;t [a,b]} is a continuous time process observed on the interval [a,b]. Let cu, v = covxu, Xv, the covariance can be estimated if we observe several realisations of {Xu}, for example, in longitudual data, where we observe the iid {X i u} and estimate cu,v with ĉ T u,v = 1 T T X iux i v we have assumed EXu = 0. We are not assuming that 221
222 {Xu} is stationary. As in principle component analysis PCA, it is well known that Xu and {cu, v} admit the representation cu,v = λ k e k uē k v, Xu = k=1 ξ k e k u, 21.2 where {e k u} are orthogonal eigen functions, λ k are non-negative eigenvalues where λ k 0 as k and {ξ k } are uncorrelated random variables where ξ k = b a Xue kudu. Equation 21.2 can be used as a means of modelling cu,v by imposing an orthogonal basis. Moreover, often the eigenfunctions {e k u} can be used as way of describing Xu in terms of main features. See Silverman and Ramsey 2002 and the related literature. k=1 222
i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by
Statistics 580 Maximum Likelihood Estimation Introduction Let y (y 1, y 2,..., y n be a vector of iid, random variables from one of a family of distributions on R n and indexed by a p-dimensional parameter
Multivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
Principle of Data Reduction
Chapter 6 Principle of Data Reduction 6.1 Introduction An experimenter uses the information in a sample X 1,..., X n to make inferences about an unknown parameter θ. If the sample size n is large, then
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
Basics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu [email protected] Modern machine learning is rooted in statistics. You will find many familiar
GLM, insurance pricing & big data: paying attention to convergence issues.
GLM, insurance pricing & big data: paying attention to convergence issues. Michaël NOACK - [email protected] Senior consultant & Manager of ADDACTIS Pricing Copyright 2014 ADDACTIS Worldwide.
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
STAT 830 Convergence in Distribution
STAT 830 Convergence in Distribution Richard Lockhart Simon Fraser University STAT 830 Fall 2011 Richard Lockhart (Simon Fraser University) STAT 830 Convergence in Distribution STAT 830 Fall 2011 1 / 31
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
Maximum Likelihood Estimation
Math 541: Statistical Theory II Lecturer: Songfeng Zheng Maximum Likelihood Estimation 1 Maximum Likelihood Estimation Maximum likelihood is a relatively simple method of constructing an estimator for
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
What is Statistics? Lecture 1. Introduction and probability review. Idea of parametric inference
0. 1. Introduction and probability review 1.1. What is Statistics? What is Statistics? Lecture 1. Introduction and probability review There are many definitions: I will use A set of principle and procedures
Metric Spaces. Chapter 7. 7.1. Metrics
Chapter 7 Metric Spaces A metric space is a set X that has a notion of the distance d(x, y) between every pair of points x, y X. The purpose of this chapter is to introduce metric spaces and give some
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
Chapter 6: Point Estimation. Fall 2011. - Probability & Statistics
STAT355 Chapter 6: Point Estimation Fall 2011 Chapter Fall 2011 6: Point1 Estimat / 18 Chap 6 - Point Estimation 1 6.1 Some general Concepts of Point Estimation Point Estimate Unbiasedness Principle of
Nonparametric adaptive age replacement with a one-cycle criterion
Nonparametric adaptive age replacement with a one-cycle criterion P. Coolen-Schrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK e-mail: [email protected]
Statistical Theory. Prof. Gesine Reinert
Statistical Theory Prof. Gesine Reinert November 23, 2009 Aim: To review and extend the main ideas in Statistical Inference, both from a frequentist viewpoint and from a Bayesian viewpoint. This course
THE CENTRAL LIMIT THEOREM TORONTO
THE CENTRAL LIMIT THEOREM DANIEL RÜDT UNIVERSITY OF TORONTO MARCH, 2010 Contents 1 Introduction 1 2 Mathematical Background 3 3 The Central Limit Theorem 4 4 Examples 4 4.1 Roulette......................................
The Expectation Maximization Algorithm A short tutorial
The Expectation Maximiation Algorithm A short tutorial Sean Borman Comments and corrections to: em-tut at seanborman dot com July 8 2004 Last updated January 09, 2009 Revision history 2009-0-09 Corrected
MATH4427 Notebook 2 Spring 2016. 2 MATH4427 Notebook 2 3. 2.1 Definitions and Examples... 3. 2.2 Performance Measures for Estimators...
MATH4427 Notebook 2 Spring 2016 prepared by Professor Jenny Baglivo c Copyright 2009-2016 by Jenny A. Baglivo. All Rights Reserved. Contents 2 MATH4427 Notebook 2 3 2.1 Definitions and Examples...................................
Stochastic Inventory Control
Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the
1 Short Introduction to Time Series
ECONOMICS 7344, Spring 202 Bent E. Sørensen January 24, 202 Short Introduction to Time Series A time series is a collection of stochastic variables x,.., x t,.., x T indexed by an integer value t. The
(Quasi-)Newton methods
(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting
A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses. Michael R. Powers[ 1 ] Temple University and Tsinghua University
A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses Michael R. Powers[ ] Temple University and Tsinghua University Thomas Y. Powers Yale University [June 2009] Abstract We propose a
Master s Theory Exam Spring 2006
Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem
Adaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
Introduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby
The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series.
Cointegration The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series. Economic theory, however, often implies equilibrium
Efficiency and the Cramér-Rao Inequality
Chapter Efficiency and the Cramér-Rao Inequality Clearly we would like an unbiased estimator ˆφ (X of φ (θ to produce, in the long run, estimates which are fairly concentrated i.e. have high precision.
Algebra Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the 2012-13 school year.
This document is designed to help North Carolina educators teach the Common Core (Standard Course of Study). NCDPI staff are continually updating and improving these tools to better serve teachers. Algebra
LOGNORMAL MODEL FOR STOCK PRICES
LOGNORMAL MODEL FOR STOCK PRICES MICHAEL J. SHARPE MATHEMATICS DEPARTMENT, UCSD 1. INTRODUCTION What follows is a simple but important model that will be the basis for a later study of stock prices as
Stat 704 Data Analysis I Probability Review
1 / 30 Stat 704 Data Analysis I Probability Review Timothy Hanson Department of Statistics, University of South Carolina Course information 2 / 30 Logistics: Tuesday/Thursday 11:40am to 12:55pm in LeConte
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e.
CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e. This chapter contains the beginnings of the most important, and probably the most subtle, notion in mathematical analysis, i.e.,
MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.
MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column
MULTIVARIATE PROBABILITY DISTRIBUTIONS
MULTIVARIATE PROBABILITY DISTRIBUTIONS. PRELIMINARIES.. Example. Consider an experiment that consists of tossing a die and a coin at the same time. We can consider a number of random variables defined
Lecture Notes 1. Brief Review of Basic Probability
Probability Review Lecture Notes Brief Review of Basic Probability I assume you know basic probability. Chapters -3 are a review. I will assume you have read and understood Chapters -3. Here is a very
CHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
How To Prove The Dirichlet Unit Theorem
Chapter 6 The Dirichlet Unit Theorem As usual, we will be working in the ring B of algebraic integers of a number field L. Two factorizations of an element of B are regarded as essentially the same if
PUTNAM TRAINING POLYNOMIALS. Exercises 1. Find a polynomial with integral coefficients whose zeros include 2 + 5.
PUTNAM TRAINING POLYNOMIALS (Last updated: November 17, 2015) Remark. This is a list of exercises on polynomials. Miguel A. Lerma Exercises 1. Find a polynomial with integral coefficients whose zeros include
Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur
Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur Module No. #01 Lecture No. #15 Special Distributions-VI Today, I am going to introduce
Differentiating under an integral sign
CALIFORNIA INSTITUTE OF TECHNOLOGY Ma 2b KC Border Introduction to Probability and Statistics February 213 Differentiating under an integral sign In the derivation of Maximum Likelihood Estimators, or
Ideal Class Group and Units
Chapter 4 Ideal Class Group and Units We are now interested in understanding two aspects of ring of integers of number fields: how principal they are (that is, what is the proportion of principal ideals
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES Contents 1. Random variables and measurable functions 2. Cumulative distribution functions 3. Discrete
Fairfield Public Schools
Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity
SAS Certificate Applied Statistics and SAS Programming
SAS Certificate Applied Statistics and SAS Programming SAS Certificate Applied Statistics and Advanced SAS Programming Brigham Young University Department of Statistics offers an Applied Statistics and
Assessing the Relative Power of Structural Break Tests Using a Framework Based on the Approximate Bahadur Slope
Assessing the Relative Power of Structural Break Tests Using a Framework Based on the Approximate Bahadur Slope Dukpa Kim Boston University Pierre Perron Boston University December 4, 2006 THE TESTING
Econometrics Simple Linear Regression
Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight
State Space Time Series Analysis
State Space Time Series Analysis p. 1 State Space Time Series Analysis Siem Jan Koopman http://staff.feweb.vu.nl/koopman Department of Econometrics VU University Amsterdam Tinbergen Institute 2011 State
Vector and Matrix Norms
Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data
4. Continuous Random Variables, the Pareto and Normal Distributions
4. Continuous Random Variables, the Pareto and Normal Distributions A continuous random variable X can take any value in a given range (e.g. height, weight, age). The distribution of a continuous random
Sections 2.11 and 5.8
Sections 211 and 58 Timothy Hanson Department of Statistics, University of South Carolina Stat 704: Data Analysis I 1/25 Gesell data Let X be the age in in months a child speaks his/her first word and
Maximum likelihood estimation of mean reverting processes
Maximum likelihood estimation of mean reverting processes José Carlos García Franco Onward, Inc. [email protected] Abstract Mean reverting processes are frequently used models in real options. For
From the help desk: Bootstrapped standard errors
The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution
Continued Fractions and the Euclidean Algorithm
Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction
Dongfeng Li. Autumn 2010
Autumn 2010 Chapter Contents Some statistics background; ; Comparing means and proportions; variance. Students should master the basic concepts, descriptive statistics measures and graphs, basic hypothesis
Factor analysis. Angela Montanari
Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number
**BEGINNING OF EXAMINATION** The annual number of claims for an insured has probability function: , 0 < q < 1.
**BEGINNING OF EXAMINATION** 1. You are given: (i) The annual number of claims for an insured has probability function: 3 p x q q x x ( ) = ( 1 ) 3 x, x = 0,1,, 3 (ii) The prior density is π ( q) = q,
Math 4310 Handout - Quotient Vector Spaces
Math 4310 Handout - Quotient Vector Spaces Dan Collins The textbook defines a subspace of a vector space in Chapter 4, but it avoids ever discussing the notion of a quotient space. This is understandable
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
University of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014.
University of Ljubljana Doctoral Programme in Statistics ethodology of Statistical Research Written examination February 14 th, 2014 Name and surname: ID number: Instructions Read carefully the wording
E3: PROBABILITY AND STATISTICS lecture notes
E3: PROBABILITY AND STATISTICS lecture notes 2 Contents 1 PROBABILITY THEORY 7 1.1 Experiments and random events............................ 7 1.2 Certain event. Impossible event............................
LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as
LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values
Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: [email protected] Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
SF2940: Probability theory Lecture 8: Multivariate Normal Distribution
SF2940: Probability theory Lecture 8: Multivariate Normal Distribution Timo Koski 24.09.2015 Timo Koski Matematisk statistik 24.09.2015 1 / 1 Learning outcomes Random vectors, mean vector, covariance matrix,
The Heat Equation. Lectures INF2320 p. 1/88
The Heat Equation Lectures INF232 p. 1/88 Lectures INF232 p. 2/88 The Heat Equation We study the heat equation: u t = u xx for x (,1), t >, (1) u(,t) = u(1,t) = for t >, (2) u(x,) = f(x) for x (,1), (3)
Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)
Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in
An extension of the factoring likelihood approach for non-monotone missing data
An extension of the factoring likelihood approach for non-monotone missing data Jae Kwang Kim Dong Wan Shin January 14, 2010 ABSTRACT We address the problem of parameter estimation in multivariate distributions
Notes from Week 1: Algorithms for sequential prediction
CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking
CITY UNIVERSITY LONDON. BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION
No: CITY UNIVERSITY LONDON BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION ENGINEERING MATHEMATICS 2 (resit) EX2005 Date: August
Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University
Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision
APPLIED MATHEMATICS ADVANCED LEVEL
APPLIED MATHEMATICS ADVANCED LEVEL INTRODUCTION This syllabus serves to examine candidates knowledge and skills in introductory mathematical and statistical methods, and their applications. For applications
The equivalence of logistic regression and maximum entropy models
The equivalence of logistic regression and maximum entropy models John Mount September 23, 20 Abstract As our colleague so aptly demonstrated ( http://www.win-vector.com/blog/20/09/the-simplerderivation-of-logistic-regression/
9.2 Summation Notation
9. Summation Notation 66 9. Summation Notation In the previous section, we introduced sequences and now we shall present notation and theorems concerning the sum of terms of a sequence. We begin with a
Introduction to Detection Theory
Introduction to Detection Theory Reading: Ch. 3 in Kay-II. Notes by Prof. Don Johnson on detection theory, see http://www.ece.rice.edu/~dhj/courses/elec531/notes5.pdf. Ch. 10 in Wasserman. EE 527, Detection
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
4: SINGLE-PERIOD MARKET MODELS
4: SINGLE-PERIOD MARKET MODELS Ben Goldys and Marek Rutkowski School of Mathematics and Statistics University of Sydney Semester 2, 2015 B. Goldys and M. Rutkowski (USydney) Slides 4: Single-Period Market
Simple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)
2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came
MAS2317/3317. Introduction to Bayesian Statistics. More revision material
MAS2317/3317 Introduction to Bayesian Statistics More revision material Dr. Lee Fawcett, 2014 2015 1 Section A style questions 1. Describe briefly the frequency, classical and Bayesian interpretations
THE FUNDAMENTAL THEOREM OF ARBITRAGE PRICING
THE FUNDAMENTAL THEOREM OF ARBITRAGE PRICING 1. Introduction The Black-Scholes theory, which is the main subject of this course and its sequel, is based on the Efficient Market Hypothesis, that arbitrages
t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).
1. Line Search Methods Let f : R n R be given and suppose that x c is our current best estimate of a solution to P min x R nf(x). A standard method for improving the estimate x c is to choose a direction
4.5 Linear Dependence and Linear Independence
4.5 Linear Dependence and Linear Independence 267 32. {v 1, v 2 }, where v 1, v 2 are collinear vectors in R 3. 33. Prove that if S and S are subsets of a vector space V such that S is a subset of S, then
The Basics of Graphical Models
The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures
Quotient Rings and Field Extensions
Chapter 5 Quotient Rings and Field Extensions In this chapter we describe a method for producing field extension of a given field. If F is a field, then a field extension is a field K that contains F.
5. Continuous Random Variables
5. Continuous Random Variables Continuous random variables can take any value in an interval. They are used to model physical characteristics such as time, length, position, etc. Examples (i) Let X be
Introduction to Probability
Introduction to Probability EE 179, Lecture 15, Handout #24 Probability theory gives a mathematical characterization for experiments with random outcomes. coin toss life of lightbulb binary data sequence
The sample space for a pair of die rolls is the set. The sample space for a random number between 0 and 1 is the interval [0, 1].
Probability Theory Probability Spaces and Events Consider a random experiment with several possible outcomes. For example, we might roll a pair of dice, flip a coin three times, or choose a random real
1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let
Copyright c 2009 by Karl Sigman 1 Stopping Times 1.1 Stopping Times: Definition Given a stochastic process X = {X n : n 0}, a random time τ is a discrete random variable on the same probability space as
Bias in the Estimation of Mean Reversion in Continuous-Time Lévy Processes
Bias in the Estimation of Mean Reversion in Continuous-Time Lévy Processes Yong Bao a, Aman Ullah b, Yun Wang c, and Jun Yu d a Purdue University, IN, USA b University of California, Riverside, CA, USA
Chapter 4, Arithmetic in F [x] Polynomial arithmetic and the division algorithm.
Chapter 4, Arithmetic in F [x] Polynomial arithmetic and the division algorithm. We begin by defining the ring of polynomials with coefficients in a ring R. After some preliminary results, we specialize
Big Data - Lecture 1 Optimization reminders
Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics
1 Teaching notes on GMM 1.
Bent E. Sørensen January 23, 2007 1 Teaching notes on GMM 1. Generalized Method of Moment (GMM) estimation is one of two developments in econometrics in the 80ies that revolutionized empirical work in
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
DRAFT. Algebra 1 EOC Item Specifications
DRAFT Algebra 1 EOC Item Specifications The draft Florida Standards Assessment (FSA) Test Item Specifications (Specifications) are based upon the Florida Standards and the Florida Course Descriptions as
