18 Generalised Linear Models

Transcription

1 18 Generalised Linear Models Generalised linear models (GLM) is a generalisation of ordinary least squares regression. See also Davison, Section , Green (1984) and Dobson and Barnett (2008). To motivate the GLM approach let us briefly overview linear models An overview of linear models Let us consider the two competing linear nested models Restricted model: Y t β 0 + Full model: Y t β 0 + q β j x ti + ε t j1 q β j x ti + j1 p jq+1 β j x ti + ε t (87) where {ε t } are iid random variables with mean zero and variance σ 2. Let us suppose that we observe {(Y t x ti )} T, where {Y t} are normal. The classical method for testing H 0 : Model 0 against H A : Model 1 is to use the F-test (ANOVA). That is, let ˆσ R 2 be the residual sum of squares under the null and ˆσ F 2 be the residual sum of squares under the alternative. Then the F-statistic is where F S 2 R SF 2 /(p q) ˆσ 2 F S 2 F σ 2 F T (Y t 1 T p p ˆβ j F x tj ) 2 SR 2 j1 T (Y t p ˆβ j F x tj ) 2. j1 T (Y t q ˆβ j R x tj ) 2 j1 and under the null F F p qt p. Moreover, if the sample size is large (p q)f χ 2 p q. We recall that the residuals of the full model are r t Y t ˆβ 0 q ˆβ j1 j x ti p ˆβ jq+1 j x ti and the residual sum of squares SF 2 is used to measure how well the linear model fits the data (see Chapter 8 and STAT612 notes). The F-test and ANOVA are designed specifically for linear models. In this section the aim is to generalise Model specification. Estimation 125

2 Testing. Residuals. to a larger class of models. To generalise we will be in using a log-likelihood framework. To see how this fits in with the linear regression, let us now see how ANOVA and the log-likelihood ratio test are related. Suppose that σ 2 is known, then the log-likelihood ratio test for the above hypothesis is 1 S 2 σ 2 R SF 2 χ 2 p q where we note that since {ε t } is Gaussian, this is the exact distribution and not an asymptotic result. In the case that σ 2 is unknown and has to be replaced by its estimator ˆσ F 2, then we can either use the approximation or the exact distribution 1 ˆσ 2 F which returns us to the F-statistic. S 2 R SF 2 χ 2 p q S 2 R SF 2 /(p q) ˆσ 2 F T F p qt p On the other hand, if the variance σ 2 is unknown we can go the log-likelihood ratio test route. The log-likelihood ratio test reduces to log S2 R S 2 F log 1 + (S2 F S2 R ) ˆσ F 2 χ 2 p q. We recall that by using the expansion log(1 + x) x + O(x 2 ) we obtain log S2 R S 2 F log 1 + (S2 R S2 F ) SF 2 S2 R S2 F S 2 F + o p (1). Now we know the above is approximately χ 2 p q. But it is straightforward to see that by dividing by (p q) and multiplying by (T p) we have (T p) (p q) log S2 R S 2 F (T p) (p q) log 1 + (S2 R S2 F ) SF 2 (S2 R S2 F )/(p q) ˆσ F 2 + o p (1) F + o p (1). 126

3 Hence we have transformed the log-likelihood ratio test into the F -test, which we discussed at the start of this section. The ANOVA and log-likelihood methods are asymptotically equivalent. In the case that {ε t } are non-gaussian, but the model is linear with iid random variables, the above results also hold. However, in the case that the regressors have a nonlinear influence on the response and/or the response is not normal we need to take an alternative approach. Through out this section we will encounter such models. We will start by focussing on the following two problems: (i) How to model the relationship between the response and the regressors when the reponse is non-gaussian, and the model is nonlinear. (ii) Generalise ANOVA for nonlinear models Motivation Let us suppose {Y t } are independent random variables where it is believed that the regressors x t (x t is a p-dimensional vector) has an influence on {Y t }. Let us suppose that Y t is a binary random variable taking either zero or one and (Y t ) P (Y t 1) π t. How to model the relationship between Y t and x t? A simple approach, is to use a linear model, ie. let (Y t ) β x t, But a major problem with this approach is that (Y t ), is a probability, and for many values of β, β x t will lie outside the unit interval - hence a linear model is not meaningful. However, we can make a nonlinear transformation which transforms the regressors to the unit interval. For example let (Y t ) π t exp(β x t ) 1 + exp(β x t ) this transformation lies between zero and one. Hence we could just use a nonlinear regression to estimate the parameters. That is rewrite the model as Y t exp(β x t) 1+exp(β x t) + ε t, and use the estimator ˆβ T, where ˆβ T arg min β T Y t exp(β x t ) exp(β x t ) as an estimator of β. However, this approach also has its drawbacks. One obvious drawback is that the variance var(ε t ) var(y t ) π t (1 π t ), hence the variance depends on the parameters. Hence, alternative methods are required. The GLM approach is a general framework for a wide class of distributions. We recall that in Section 3 we considered maximum likelihood estimation for iid random variables which come from the natural exponential family. Distributions in this family include the normal, binary, 127

4 binomial and Poisson, amongst others. We recall that the natural exponential family has the form f(y; θ) exp yθ κ(θ) + c(y) where κ(θ) b(η 1 (θ)). To be a little more general we will suppose that the distribution can be written as yθ κ(θ) f(y; θ) exp + c(y ) (88) where is a nuisance parameter (called the disperson parameter, it plays the role of the variance in linear models) and θ is the parameter of interest. We recall that examples of exponential models include (i) The exponential distribution is already in natural exponential form with θ λ and 1. The log density is log f(y; θ) λy + log λ. (ii) For the binomial distribution we let θ log( π π 1 π ) and 1, since log( 1 π ) is invertible this gives π log f(y; θ) log f(y; log 1 π ) yθ n log n. + log 1 + exp(θ) y (iii) For the normal distribution we have that log f(y; µ σ 2 (y µ)2 ) 2σ 2 µy µ 2 /2 σ log σ log(2π) + y log(2π). Hence σ 2, θ µ, κ(θ) θ 2 /2 and c(y ) Xt 2 / π log 2π. (iv) The Poisson log distribution can be written as log f(y; µ) y log µ µ + log y Hence θ log µ, κ(θ) exp(θ) and c(y) log y. (v) Other members in this family include, Gamma, Beta, Multinomial and inverse Gaussian. Remark 18.1 Moments) Using Lemma 3.1 (see Section 3) we have (Y ) κ (θ) and var(y ) κ (θ). 128

5 GLM is a method which generalises the methods in linear models to the exponential family (recall that the normal model is a subclass of the exponential family). In the GLM setting it is usually assumed that the response variables {Y t } has a density which comes from the natural exponential family, which has log density yt θ t κ(θ t ) log f(y t ; θ t ) + c(y t ) where the parameter θ t depends on the regressors. The regressors influence the response through a linear predictor η t β x t and a link function, which connects β x t to the mean (Y t ) µ(θ t ) κ (θ t ). More precisely, there is a monotonic link function g( ), which is defined such that g(µ(θ t )) η t β x t. To summarise The log likelihood function of {Y t } is κ (θ t ) g 1 (η t ) µ(θ t ) (Y t ) θ t µ 1 (g 1 (η t )) θ(η t ) var(y t ) 1 dµ(θ t ). (89) dθ t L T (β) T Yt θ(η) κ(θ t ) + c(y t ). The choice of link function is rather subjective. One of the most popular is the canonical link which we define below. Definition 18.1 The canonical link function) The canonical link function is where we let η t θ t. Hence µ t κ (η t ) and g(κ (θ t )) g(κ (η t )) η t. The canonical link is usually used because it make the calculations simple. We observe with the canonical link the log-likelihood of {Y t } is L T (β) T Yt β x t κ(β x t ) which we will show is relatively simple to maximise. + c(y ) Example 18.1 The log-likelihood and canonical link function) (i) The canonical link for the exponential f(y t ; λ t ) λ t exp( λ t y t ) is θ t λ t β x t, and λ β x t. The log-likelihood is T Y t β x t log(β x t ). 129

6 (ii) The canonical link for the binomial is θ t β x t log( πt 1 π t ), hence π t exp(β x t) 1+exp(β x t). The log-likelihood is T Y t β x t + n t log exp(β x t ) nt + log 1 + exp(β. x t ) Y t (iii) The canonical link for the normal is θ t β x t µ t. The log-likelihood is (Y t β x t ) 2 2σ log σ log(2π) which is the usual least squared criterion. If the canonical link were not used, we would be in the nonlinear least squares setting that is the log-likelihood is (Y t g 1 (β x t )) 2 2σ log σ log(2π) (iv) The canonical link for the Poisson is θ t β x t log λ t, hence λ t exp(β x t ). The loglikelihood is T Y t β x t exp(β x t ) + log Y t. However, as mentioned above, the canonical link is simply used for its mathematical simplicity. There exists other links, which can often be more suitable. Remark 18.2 Link functions for the Binomial) We recall that the link function is defined a monotonic function g, where η t β x t g(µ t ). The choice of link is up to you. A popular choice to let g 1 a well known distribution function. The motivation for this is that for the Binomial distribution µ t n t π t (where π t is the probability of a success ). Clearly 0 π t 1, hence using g 1 distribution function (or survival function) makes sense. Examples include (i) The Logistic link, this is the canonical link function, where β x t g(µ t ) log( πt 1 π t ) µ log( t n t µ t ). (i) The Probit link, where π t Φ(β x t ), Φ is the standard normal distribution function and the link function is β x t g(µ t ) Φ 1 (µ t /n t ). (ii) The extreme value link, where the distribution function is F (x) 1 exp( exp(x)). Hence in this case the link function is β x t g(µ t ) log( log(1 µ t /n t )). 130

7 18.3 Estimating the parameters in a GLM The score function for GLM The score function for generalised linear models has a very interesting form, which we will now derive. From now on, we will suppose that t for all t, and that is known (though even in the case that it is unknown, it will not change the estimation scheme used). Much of the theory remains true without this restriction, but this makes the derivations a bit cleaner, and is enough for all the models we will encounter. where With this substitution, recall that the log-likelihood is L T (β ) T Yt θ t η(θ t ) Yt θ t η(θ t ) L t (β ) + c(y t ) + c(y t ) T t and θ t is a function of β t through the relationship g(κ (θ t )) η t β x t. For the MLE of β, we need to solve the likelihood equations L T β j T t β j 0 for all j 1... p. To obtain an interesting expression for the above, recall that var(y t ) µ (θ t ) and η t g(µ t ) and let µ (θ t ) V (µ t ). Since V (µ t ) dµt dθ t, inverting the derivative we have dθt dµ t 1/V (µ t ). Furthermore, since dηt dµ t g (µ t ), inverting the derivative we have dµt dη t 1/g (µ t ). By the chain rule for differentiation and using the above we have t β j d t dη t η t β j d t dη t η t θ t θ t β j (90) d t dθ t dµ t η t dθ t dµ t dη t β j d 1 1 t dµt dηt η t dθ t dθ t dµ t β j (Y t κ (θ t )) (κ (θ t )) 1 (g (µ t )) 1 x tj (Y t µ t )x tj V (µ t )g (µ t ) 131

8 Thus the likelihood equations we have to solve for the MLE of β are T (since µ t g 1 (β x t )). (Y t µ t )x T tj V (µ t )g (µ t ) (Y t g 1 (β x t ))x tj V (g 1 (β x t ))g 0 1 j p (91) (µ t ) We observe that the above set of equations can be considered as estimating functions (see Section 17 since the expectation of the above set of equations is equal to zero). Moreover, it has a very similar structure to the Normal equations arising in ordinary least squares. Example 18.2 (i) Normal {Y t } with mean µ t β x t. Here, we have g(µ t ) µ t β x t so g (µ t ) dg(µt) dµ t 1; also V (µ t ) 1, σ 2, so the equations become 1 σ 2 T (Y t β x t )x tj 0 Ignoring the factor σ 2, the LHS is the jth element of the vector X T (Y Xβ ), so the equations reduce to the Normal equations of least squares: X T (Y Xβ ) 0 or equivalently X T Xβ X T Y. (ii) Poisson {Y t } with log-link function, hence mean µ t log β x t (hence g(µ t ) log µ t ). This time, g (µ t ) 1/µ t, var(y t ) V (µ t ) µ t and 1. Substituting µ t exp(β x t ), into the equations become T (Y t e β x t )x tj 0 which are not linear in β and no explicit solution is possible in general The Newton-Raphson scheme See also Section and Green (1984). It is clear from the examples above that usually there does not exist a simple solution for the likelihood estimator of β. However, we can use the Newton-Raphson scheme to estimate β. We will derive an interesting expression for the iterative scheme. Other than the expression being useful for implementation, it also highlights the estimators connection to weighted least squares. 132

9 We recall that the Newtom Rapshon consists of the scheme (β (i+1) ) (β (i) ) (H (i) ) 1 u (i) where the p 1 gradient vector u (i) is u (i) LT... L T β 1 β ββi) p and the p p Hessian matrix H (i) is given by H (i) jk 2 L T (β) β j β k ββ i) for j k p, both u (i) and H (i) being evaluated at the current estimate β (i). By using (90), the score function at the ith iteration is u (i) j L T (β) β j ββ i) The Hessian at the ith iteration is T T d t dη t η t β j ββ i) t β j ββ i) T d t dη t ββ i)x tj. H (i) jk 2 L T (β) T 2 t β j β ββ i) k β j β ββ i) k T t β k β ββ i) j T t x tj β k η ββ i) t T t x tj x tk η t η t T Thus if we write s (i) for the T 1 vector with 2 t ηt 2 ββ i)x tj x tk (92) s (i) t and W (i) for the diagonal n n matrix with t η t ββ i) W tt d2 t dη 2 t 133

10 we have u (i) X T s (i) and H X T W (i) X and the Newton-Raphson iteration is (β (i+1) ) (β (i) ) (H (i) ) 1 u (i) (β (i) ) + (X T W (i) X) 1 X T s (i) Fisher scoring for GLM s Typically, partly for reasons of tradition, we use a modification of this in fitting statistical models. The matrix W is replaced by W, another diagonal T T matrix with W (i) tt ( W (i) tt β (i) ) d2 t β (i) dηt 2. One reason for taking the expectation is that W (i) tt diagonal entries, since (see Section 1) W (i) tt d2 t dηt 2 β (i) var is sure to have non-negative (typically positive) dt β (i) dη t so that W var(s (i) β (i) ), and the matrix is non-negative-definite. Using the Fisher score function the iteration becomes (β (i+1) ) (β (i) ) + (X T W (i) X) 1 X T s (i) Iteratively reweighted least squares The iteration (β (i+1) ) (β (i) ) + (X T W (i) X) 1 X T s (i) (93) is formally similar to the familiar solution for least squares estimates in linear models: β (X T X) 1 X T Y or more particularly the related weighted least squares estimates: β (X T W X) 1 X T W Y In fact, (93) can be re-arranged to have exactly this form. Algebraic manipulation gives (β (i) ) (X T W (i) X) 1 X T W (i) X(β (i) ) and (X T W (i) X) 1 X T s (i) (X T W (i) X) 1 X T W (i) (W (i) ) 1 s (i). Therefore substituting the above into (93) gives (β (i+1) ) (X T W (i) X) 1 X T W (i) X(β (i) ) + (X T W (i) X) 1 X T W (i) (W (i) ) 1 s (i) (X T W (i) X) 1 X T W (i) X(β (i) ) + (W (i) ) 1 : (X T W (i) X) 1 X T W (i) Z (i). 134

11 One reason that the above equation is of interest is that it has the form of weighted least squares. More precisely, it has the form of a weighted least squares regression of Z (i) on X with the diagonal weight matrix W (i). That is let z (i) t denote the ith element of the vector Z (i), then β (i+1) minimises the following weighted least squares criterion Of course, in reality W (i) t reweighted least squares. and z (i) t T W (i) (i) t z t β 2. x t are functions of β (i), hence the above is often called iteratively Estimating of the dispersion parameter Recall that in the linear model case, the variance σ 2 did not affect the estimation of β. and In the general GLM case, continuing to assume that t, we have s t d t d dθ t dµ t Y t µ t dη t dθ t dµ t dη t V (µ t )g (µ t ) W tt var(y t ) var(s t ) {V (µ t )g (µ t )} 2 V (µ t ) {V (µ t )g (µ t )} 2 1 V (µ t )(g (µ t )) 2 so that 1/ appears as a scale factor in W and s, but otherwise does not appear in the estimating equations or iteration for β. Hence does not play a role in the estimation of β. As in the Normal/linear case, (a) we are less interested in, and (b) we see that can be separately estimated from β. Recall that var(y t ) V (µ t ), thus E((Y t µ t ) 2 ) V (µ t ) We can use this to suggest a simple estimator for : 1 T p T (Y t µ t ) 2 V ( µ t ) 1 T p T (Y t µ t ) 2 µ ( θ t ) where µ t g 1 ( β x) and θ t µ 1 g 1 ( β x). Recall that the above resembles estimators of the residual variance. Indeed, it can be argued that the distribution of the above is close to χ 2 T p. 135

12 Remark 18.3 We mention that a slight generalisation of the above is when the dispersion parameter satisfies t a t, where a t is known. In this case, an estimator of the is 1 T p T (Y t µ t ) 2 a t V ( µ t ) 1 T p T (Y t µ t ) 2 a t µ ( θ t ) 18.4 Deviance, scaled deviance and residual deviance The scaled deviance Instead of minimising the sum of squares (which is done for linear models) we have been maximising a log-likelihood L T (β). Furthermore, we recall S( ˆβ) T rt 2 T Yt ˆβ 0 q ˆβ j x ti j1 p jq+1 ˆβ j x ti 2 is a numerical summary of how well the linear model fitted, S( β) 0 means a perfect fit. In this section we will define the equivalent of residuals and square residuals for GLM. so What is the best we can do in fitting a GLM? Recall t Y tθ t κ(θ t ) + c(y t ) d t dθ t 0 Y t κ (θ t ) 0 A model that achieves this equality for all t is called saturated (the same terminology is used for linear models). In other words, will need one free parameter for each observation (the alternative g(κ (θ t )) β x t is a restriction). Denote the corresponding θ t by θ t, i.e. the solution of κ ( θ t ) Y t. Now consider the difference 2{ t ( θ t ) t (θ t )} 2 {Y t( θ t θ t ) κ( θ t ) + κ(θ t )}. Maximising the likelihood is the same as minimising this quantity, which is always non-negative, and is 0 only if there is a perfect fit to the i th observation. This is analogous to linear models, where maximising the normal likelihood is the same as minimising least squares criterion (which is zero when the fit is best). Thus L T ( θ) T t( θ t ) provides a baseline value for the log-likelihood. 136

13 Example 18.3 The normal linear model) κ(θ t ) 1 2 θ2 t, κ (θ t ) θ t µ t, θ t Y t and σ 2 so 2{ t ( θ t ) t (θ t )} 2 σ 2 {Y t(y t µ t ) 1 2 Y 2 t µ2 t } (Y t µ t ) 2 /σ 2. Hence 2{ t ( θ t ) t (θ t )} includes the classical residual squared for the normal case. In general, let D t 2{Y t ( θ t ˆθ t ) κ( θ t ) + κ(ˆθ t )} We call D T D t the deviance of the model. If is present, let D D 2{L T ( θ) L T ( θ)}. is the scaled deviance. Thus the residual deviance plays the same role for GLM s as does the residual sum of squares for linear models Interpreting D t We will now show that D t 2{Y t ( θ t ˆθ t ) κ( θ t ) + κ(ˆθ t )} (Y t µ t ) 2. V (µ t ) To show the above we require expression for Y t ( θ t ˆθ t ) and κ( θ t ) κ(ˆθ t ). theorem to expand κ and κ about ˆθ t to obtain We use Taylor s κ( θ t ) κ(ˆθ t ) + ( θ t ˆθ t )κ (ˆθ t ) ( θ t ˆθ t ) 2 κ (ˆθ t ) (94) and κ ( θ t ) κ (ˆθ t ) + ( θ t ˆθ t )κ (ˆθ t ) (95) But κ ( θ t ) Y t, κ (ˆθ t ) µ t and κ (ˆθ t ) V (µ t ), so (94) becomes κ( θ t ) κ(ˆθ t ) + ( θ t ˆθ t )µ t ( θ t ˆθ t ) 2 V (µ t ) κ( θ t ) κ(ˆθ t ) ( θ t ˆθ t )µ t ( θ t ˆθ t ) 2 V (µ t ) (96) and (95) becomes Y t µ t + ( θ t ˆθ t )V (µ t ) Y t µ t ( θ t ˆθ t )V (µ t ) (97) 137

14 Now substituting (96) and (97) into D t gives D t 2{Y t ( θ t ˆθ t ) κ( θ t ) + κ(ˆθ t )} 2{Y t ( θ t ˆθ t ) ( θ t ˆθ t )µ t 1 2 ( θ t ˆθ t ) 2 V (µ t )} ( θ t ˆθ t ) 2 V (µ t ) (Y t µ t ) 2. V (µ t ) Recalling that vary t V (µ t ) and (Y t ) µ t, D t / is like a standardised squared residual. The signed square root of this approximation is called the Pearson residual. In other words (Y t µ t ) sign(y t µ t ) 2 V (µ t ) is called a Pearson residual. The distribution theory for this is very approximate, but a rule of thumb is that if the model fits, the scaled deviance D/ (or in practice D/ ) χ 2 T p Deviance residuals The analogy with the normal example can be taken further. The square roots of the individual terms in the residual sum of squares are the residuals, Y t β x t. We use the square roots of the individual terms in the deviances residual in the same way. However, the classical residuals can be both negative and positive, and the deviances residuals should behave in a similar way. But what sign should be used? The most obvious solution is to use Dt if Y t µ t < 0 r t Dt if Y t µ t 0 Thus we call the quantities {r t } the deviance residuals. Diagnostic plots We recall that for linear models we would often plot the residuals against the regressors to check to see whether a linear model is appropriate or not. One can make useful diagnostics plots which have exactly the same form as linear models, except that deviance residuals are used instead of ordinary residuals, and linear predictor values instead of fitted values Limiting distributions and standard errors of estimators In the majority of examples we have considered in the previous sections (see, for example, Sections 1 and 4) we observed iid {Y t } with distribution f( ; θ). We showed that 1 θ T N θ 0 T I(θ) 138

15 where I(θ) 2 log f(x;θ) f(x; θ)dx (I(θ) is Fisher s information). However this result was θ 2 based on the observations being iid. In the more general setting where {Y t } are independent but not identically distributed we have that β N p (β (I(β)) 1 ) where now I(β) is a p p matrix (of the entire sample), where (using equation (92)) we have (I(β)) jk E 2 T d 2 t β j β k dηt 2 x tj x tk (X T W X) jk. Thus for large samples we have where W is evaluated at the MLE β. β N p (β (X T W X) 1 ) The analysis of deviance How can we test hypotheses about models, and in particular decide which explanatory variables to include? The two close related methods we will consider below are the log-likelihood ratio test and an analogue of the analysis of variance (ANOVA), called the analysis of deviance. Let us concentrate on the simplest case, of testing a full vs. a reduced model. Partition the model matrix X and the parameter vector β as X X 1 X 2 where X 1 is n q and β 1 is q 1 (this is analogous to equation (87) for linear models). The full model is η Xβ X 1 β 1 + X 2 β 2 and the reduced model is η X 1 β1. We wish to test β β1 H 0 : β 2 0, i.e. that the reduced model is adequate for the data. Define the rescaled deviances for the full and reduced models β 2 and D R 2{L T ( θ) sup β 20β 1 L T (θ)} D F 2{L T ( θ) sup β 1β 2 L T (β)} where we recall that L T ( θ) T t( θ t ) is likelihood of the saturated model defined in Section Taking differences we have D R D F 2 sup L T (β) sup L T (θ) β 1β 2 β 20β 1 139

16 which is the likelihood ratio statistic. The results in Theorem 6.1, equation (26) (the log likelihood ratio test for composite hypothesis) also hold for observations which are not identically distributed. Hence using a generalised version of Theorem 6.1 we have D R D F 2 sup L T (β) sup L T (θ) χ 2 p q. β 1β 2 β 20β 1 So we can conduct a test of the adequacy of the reduced model D R D F by referring to a χ 2 p q, and rejecting H 0 if the statistic is too large (p-value too small). If is not present in the model, then we are good to go. If is present (and unknown), we estimate with 1 T p T (Y t µ t ) 2 V ( µ t ) 1 T p T (Y t µ t ) 2 µ ( θ t ) (see Section ). Now we consider D R D F, we can then continue to use the χ b 2 p q distribution, but since we are estimating we can use the statistic D R D F p q D F T p against as in the normal case (compare with Section 18.1). F (p q)(t p) 18.5 Examples Example 18.4 Question Suppose that {Y t } are independent random variables with the canonical exponential family, whose logarithm satisfies log f(y; θ t ) yθ t κ(θ t ) + c(y; ) where is the dispersion parameter. Let E(Y t ) µ t. Let η t β x t θ t (hence the canonical link is used), where x t are regressors which influence Y t. [14] (a) (i) Obtain the log-likelihood of {(Y t x t )} T. (ii) Denote the log-likelihood of {(Y t x t )} T as L T (β). Show that L T β j T (Y t µ t )x tj and 2 L T β k β j T κ (θ t )x tj x tk. (b) Let Y t have Gamma distribution, where the log density has the form log f(y t ; µ t ) Y t/µ t log µ t ν ν 1 log ν 1 + log Γ(ν 1 ) + ν 1 1 log Y t E(Y t ) µ t, var(y t ) µ 2 t /ν and ν t β x t g(µ t ). 140

17 (i) What is the canonical link function for the Gamma distribution and write down the corresponding likelihood of {(Y t x t )} T. (ii) Suppose that η t β x t β 0 + β 1 x t1. Denote the likelihood as L T (β 0 β 1 ). What are the first and second derivatives of L T (β 0 β 1 )? (iii) Evaluate the Fisher information matrix at β 0 and β 1 0. (iv) Using your answers in (ii,iii) and the mle of β 0 with β 1 0, derive the score test for testing H 0 : β 1 0 against H A : β 1 0. Solution (a) (i) The general log-likelihood for {(Y t x t )} with the canonical link function is L T (β ) T Yt (β x t κ(β x t )) + c(y t ). (ii) In the differentiation use that κ (θ t ) κ (β x t ) µ t. (b) (i) For the gamma distribution the canonical link is θ t η t 1/µ t 1/beta x t. Thus the log-likelihood is L T (β) T 1 Y t (β x t ) log( 1/β x t ) + c(ν 1 Y t ) ν where c( ) can be evaluated. (ii) Use part (ii) above to give L T (β) β j ν 1 T L T (β) β i β j ν 1 Y t + 1/(β x t ) x tj T 1 (β x ti x tj x t ) (iii) Take the expectation of the above at a general β 0 and β 1 0. (iv) Using the above information, use the Wald test, Score test or log-likelihood ratio test. Example 18.5 Question: It is a belief amongst farmers that the age of a hen has a negative influence on the number of eggs she lays and the quality the eggs. To investigate this, m hens were randomly sampled. On a given day, the total number of eggs and the number of bad eggs that each of the hens lays is recorded. Let N i denote the total number of eggs hen i lays, Y i denote the number of bad eggs the hen lays and x i denote the age of hen i. 141

18 It is known that the number of eggs a hen lays follows a Poisson distribution and the quality (whether it is good or bad) of a given egg is an independent event (independent of the other eggs). Let N i be a Poisson random variable with mean λ i, where we model λ i exp(α 0 + γ 1 x i ) and π i denote the probability that hen i lays a bad egg, where we model π i with π i exp(β 0 + γ 1 x i ) 1 + exp(β 0 + γ 1 x i ). Suppose that (α 0 β 0 γ 1 ) are unknown parameters. (a) Obtain the likelihood of {(N i Y i )} m. (b) Obtain the estimating function (score) of the likelihood and the Information matrix. (c) Obtain an iterative algorithm for estimating the unknown parameters. (d) For a given α 0 β 0 γ 1, evaluate the average number of bad eggs a 4 year old hen will lay in one day. (e) Describe in detail a method for testing H 0 : γ 1 0 against H A : γ 1 0. Solution (a) Since the canonical links are being used the log-likelihood function is L m (α 0 β 0 γ 1 ) L m (Y N) + L m (N) m Y i βx i N i log(1 + exp(βx i )) + N i αx i αx i + log m Y i βx i N i log(1 + exp(βx i )) + N i αx i αx i. where α (α 0 γ 1 ), β (β 0 γ 1 ) and x i (1 x i ). (b) We know that if the canonical link is used the score is and the second derivative is m L 1 Y i κ (β x i ) m Yi µ i m m 2 L 1 κ (β x i ) var(y i ). Ni Y i + log N i 142

19 Using the above we have for this question the score is The second derivative is 2 L m α L m β L m γ 1 2 L m α 0 L m β 0 L m γ 1 m λ i m Ni λ i m Yi N i π i m Ni λ i + Yi N i π i x i. 2 L m α 0 γ 1 m N i π i (1 π i ) m m λ i x i λ i + N i π i (1 π i ) 2 L m β 0 γ 1 x 2 i. m N i π i (1 π i )x i Observing that E(N i ) λ i the information matrix is m λ i 0 m λ iπ i (1 π i )x i I(θ) 0 m λ iπ i (1 π i ) m λ iπ i (1 π i )x i m λ iπ i (1 π i )x m i λ iπ i (1 π i )x m i λ i + λ i π i (1 π i ) x 2 i. (c) We can estimate θ 0 (α 0 β 0 γ 1 ) using Newton-Raphson with Fisher scoring, that is θ i θ i + I(θ i ) 1 S i 1 where S i 1 m m Ni λ i m Yi N i π i Ni λ i + Yi N i π i x i.. (d) We note that given the regressor x i 4, the average number of bad eggs will be E(Y i ) E(E(Y i N i )) E(N i π i ) λ i π i exp(α 0 + γ 1 x i ) exp()β 0 + γ 1 x i. 1 + exp(β 0 + γ 1 x i ) (e) Give either the log likelihood ratio test, score test or Wald test. 143

20 19 Proportion data, Count data and Overdispersion See Sections 10.4 and 10.5, Davison (2002), Agresti (1996), for a thorough theoretical account and Simonoff (2003) for a very interesting, applied perspective. In the previous section we generalised the linear model framework to the exponential family. GLM is often used for modelling count data, in these cases usually the Binomial, Poisson or Multinomial distributions are used. Types of data and the distribution: Hence we would observe (Y t x t ) for 1 t T and from Distribution Regressors Response variables Binomial x t Y t (Y t N Y t ) (Y t1 Y t2 ) Poission x t Y t Y t Multinomial x t Y (Y t1 Y t2... Y tm ) ( i Y ti N) Distribution Binomial P (Y t1 k Y t2 N k) N k Probabilities (1 π(β x t )) N k π(β x t ) k Poission P (Y t k) λ(β x t) k exp( β x t) k Multinomial P (Y t1 k 1... Y tm k m ) N k 1...k m π1 (β x t ) k1... π m (β x t ) km the data we are able to select the distribution we want to fit, and thus estimate β. In this section we will be mainly dealing with count data where the regressors tend to be ordinal (not continuous regressors). This type of data normally comes in the form of a contingency table. One of the most common type of contingency table is the two by two table, and we will consider this in the Section below Two by Two Tables Consider the following 2 2 contingency table Male Female Total Blue Pink Total Given with the above table, it is natural to ask is whether there is an association between gender and colour preference. The standard method is test for independence. However, we could also pose question in a different way: are proportion of females who like blue the same 144