Lecture 6: More on the Exponential Family (Text Section 3.3)

Size: px

Start display at page:

Download "Lecture 6: More on the Exponential Family (Text Section 3.3)"

Jade Peters
3 years ago
Views:

1 Lecture 6: More on the Exponential Family (Text Section 3.3) Recall that, when θ is -dimensional and the range of Y does not depend on θ, a distribution that can be written in the form f Y (y; θ) = exp{a(y)b + c + d(y)} belongs to the -parameter exponential family of distributions. Properties of the exponential family We know that f(y; θ)dy = (replacing the integral with a summation if Y is discrete). Differentiating both sides w.r.t. θ, and switching the order of integration and differentiation, we find that d f(y; θ)dy = 0 dθ d f(y; θ)dy = 0. dθ By the same argument, For the exponential family, d f(y; θ)dy = 0 dθ d dθ f(y; θ) = [a(y)b + c ]f(y; θ) d dθ f(y; θ)dy = [a(y) + c ]f(y; θ)dy 0 = a(y)f(y; θ)dy + c 0 = E[a(Y )] + c E[a(Y )] = c. f(y; θ)dy Similarly, d dθ f(y; θ) = [a(y)b + c ]f(y; θ) + [a(y) + c ] f(y; θ) d dθ f(y; θ)dy = [a(y)b + c ]f(y; θ)dy + [a(y) + c ] f(y; θ)dy [ ] 0 = b E[a(Y )] + c + [] a(y) + c f(y; θ)dy

2 0 = b c + c + [] {a(y) E[a(Y )]} f(y; θ)dy 0 = b c + c + [] Var[a(Y )] Var[a(Y )] = b c c [] 3. If the distribution is written in canonical form, a(y ) = Y. In this case, the above expressions lead to simple ways of computing the mean and variance of Y. Example (Poisson distribution with mean µ, cont.): Recall that the canonical form of the Poisson distribution is f Y (y; µ) = exp{y log µ µ log y!} with a(y) = y, b(µ) = log µ, c(µ) = µ, and d(y) = log y!. We can immediately calculate and E[Y ] = c (µ) b (µ) = µ = µ Var[Y ] = b (µ)c (µ) c (µ)b (µ) [b (µ)] 3 = µ ( ) 0 µ 3 = µ. The equality of the mean and variance under the Poisson assumption is one of the fundamental properties of this distribution. Count data which have variance greater than their mean are referred to as overdispersed relative to the Poisson distribution. We will be studying this case later on in the course. Example (N(µ, σ ) distribution, σ known, cont.): Recall that the canonical form of the N(µ, σ ) distribution is f Y (y; µ, σ ) = exp { y ( ) µ µ σ σ y σ } log(πσ )

3 with a(y) = y, b(µ) = µ σ, c(µ) = µ σ, and d(y) = y σ log(πσ ). We then have and E[Y ] = c (µ) b (µ) = µ σ σ = µ Var[Y ] = b (µ)c (µ) c (µ)b (µ) [b (µ)] 3 = 0 + = σ. σ σ σ 6 Likelihood and Derivatives In the study of GLMs, we will use the likelihood (and its derivatives) of distributions in the exponential family. The likelihood based on the observation y (where y is a realization of the n-dimensional random vector Y) is: so The score function is n L(θ; y) = exp{a(y i )b + c + d(y i )} log L(θ; y) = [a(y i )b + c + d(y i )] = b a(y i ) + nc + d(y i ) U θ log L(θ; y) = b a(y i ) + nc. Instead of writing U as a function of the observed value y, we can think of it as function of the random vector Y. In this case, U is random as well. Note that [ ] E[U] = E a(y i ) + nc = n = 0. ( 3 c ) + nc

4 The variance of U is called the information, and will be denoted by J. We can compute [ ] J = Var a(y i ) + nc A useful property is that Proof: ( b = [] c c b ) n [] 3 ( b c ) = n c. J = E[U ] = E[U ]. U = b a(y i ) + nc E[U ] = ( ) b n c + nc E[U ] = J Motivation for GLMs In the linear regression model, we are used to thinking of modelling the observations directly, i.e. Y i = x ij β j + ɛ i, where the ɛ i s are iid with mean 0 and common variance σ. j= Equivalently, we can think of modelling the expected value of each observation, E[Y i ] = µ i, while assuming a distribution for Y i. In particular, we can write where Y i N(µ i, σ ), µ i = E[Y i ] = x ij β j. j= This is the approach that we take in the GLM framework. The advantage of modelling µ i rather than Y i is flexibility: we can use this idea to model non-normally distributed data. For example, µ i typically takes values in an interval, whereas Y i may be discrete. (In the count data case, µ i [0, ), while Y i {0,,,...}. In the binary data case, µ i [0, ], while Y i {0, }.) It s easy to model the continuous quantity µ i in terms of covariates, e.g. µ i = x ij β j. j= 4

5 However, it would be much more challenging to model discrete observations using the model Y i = x ij β j + ɛ i, j= since ɛ i would need to depend on p j= x ij β j in order for Y i to take on only discrete values. Therefore, we would not be able to use a simple, easily interpreted structure such as ɛ i N(0, σ ) in this case. For example, consider the case where Y i is binary, and we wish to model the relationship between Y i and a single predictor variable, x i. Let s start with the preliminary model where E[Y i ] = β 0 + β x i, and assume that β 0 + β x i µ i (0, ) for all i. If we further assume that Y i = µ i + ɛ i, we require that ɛ i = { µi, if Y i = 0 µ i, if Y i =. In other words, ɛ i must depend on µ i (and hence β 0 and β ). Thus, the concept of mean zero, common variance errors does not apply outside of the setting where responses are normally distributed. 5

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby