Badr Missaoui
Logistic Regression Outline Generalized linear models Deviance Logistic regression.
All models we have seen so far deal with continuous outcome variables with no restriction on their expectations, and (most) have assumed that mean and variance are unrelated (i.e. variance is constant). Many outcomes of interest do not satisfy this. Examples : binary outcomes, Poisson count outcomes. A Generalized Linear Model (GLM) is a model with two ingredients : a link function and a variance function. The link relates the means of the observations to predictors : linearization The variance function relates the means to the variances.
The data involve 462 males between the ages of 15 and 64. The outcome Y is the presence (Y = 1) or absence Y = 0 of heart disease Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -5.9207616 1.3265724-4.463 8.07e-06 *** sbp 0.0076602 0.0058574 1.308 0.190942 tobacco 0.0777962 0.0266602 2.918 0.003522 ** ldl 0.1701708 0.0597998 2.846 0.004432 ** adiposity 0.0209609 0.0294496 0.712 0.476617 famhistpresent 0.9385467 0.2287202 4.103 4.07e-05 *** typea 0.0376529 0.0124706 3.019 0.002533 ** obesity -0.0661926 0.0443180-1.494 0.135285 alcohol 0.0004222 0.0045053 0.094 0.925346 age 0.0441808 0.0121784 3.628 0.000286 ***
Motivation Classical linear model Y = Xβ + ε where ε N(0, σ 2 ). That means, Y N(Xβ, σ 2 ) In the GLM, we specify that Y P(Xβ)
We write the GLM as and E(Y i ) = µ i η i = g(µ i ) = X i β where the function g called a link function which belongs to an exponential family.
The exponential family density are specifying two components, the canonical parameter θ and the dispersion parameter φ. Let Y = (Y i ) i=1...n be a sequence of random variables. Y i has an exponential density if ( ) yi θ i b(θ i ) f Yi (y i ; θ i, φ) = exp + c(y i, φ) a i (φ) where the functions b, c are specific to each distribution and a i (φ) = φ/w i.
Law Law µ σ 2 B(m, p) p y (1 p) m y. ( ) m m y k=0 δ {k} mp mp(1 p) P(µ) µ y e µ. m k=0 } 1 k! δ k µ µ N (µ, σ 2 ) exp { (y µ)2.dy µ σ 2 2σ 2 } IG(µ, λ) λ exp { λ(y µ)2 dy 2µy. µ µ 3 /λ 2πy 3
We write l(y; θ, φ) = log f (y; θ, φ) for the log-likelihood function of Y. Using the facts that ( ) l E θ ( ) l Var θ = 0 = E ( 2 ) l θ 2 We have and E(y) = b (θ) Var(y) = b (θ)a(φ)
Gaussian case ] 1 f (y; θ, φ) = [ σ 2π exp (y µ)2 2σ 2 ( yµ µ 2 /2 = exp σ 2 1 ( )) y 2 2 σ 2 + log(2πσ2 ) We can write( θ = µ, φ = σ 2, a(φ) ) = φ, b(θ) = θ 2 /2 and c(y, φ) = 1 y 2 2 + log(2πσ 2 ) σ 2 Binomial case ( ) n f (y; θ, φ) = µ y (1 µ) n y y = exp ( y log We can write θ = log µ c(y, φ) = log ( ) n y 1 µ ( ) µ + n log(1 µ) + log 1 µ, b(θ) = n log(1 µ) and ( )) n y
Recall that in ordinary linear models, the MLE of β satisfies ˆβ = (X T X) 1 X T Y if X has full rank. In GLM, the MLE ˆβ does not exist in closed form and can be approximately estimated via iterative weighted least squares.
For n observations, the log-likelihood function is n L(β) = l(y i ; θ, φ) Computing i=1 l i = l i θ i µ i η i 1 = x ij β j θ i µ i η i β j g (µ i ) 1 b (θ i ) y i µ i φ/w i The likelihood equations are L i n 1 µ i = x ij β j g (µ i ) 2 (y i µ i ) = 0 j = 1,.., p Var(y i ) η i Put and i=1 { } W = diag g (µ i ) 2 Var(y i ) i=1,...,n { µ η = diag µi η i } i=1,...,n
These likelihood equations are X T W 1 µ (y µ) = 0 η These equations are non-linear in β and require an iterative method (e.g Newton-Raphson). The Fisher s Information matrix is and in general term I = X T W 1 X ( 2 ) L(β) [I] jk = E = β j β k n i=1 x ij x jk Var(y i ) ( µi η i ) 2
Let ˆµ 0 = Y be the initial estimate. Then, set ˆη 0 = g(ˆµ 0 ), and form the adjusted variable Z 0 = ˆη 0 + (Y ˆµ 0 ) η µ µ=ˆµ 0 Calculate ˆβ 1 by the least squares regression of Z 0 on X, that means So, Set ˆβ 1 = argmin β (Z 0 Xβ) T W 1 0 (Z 0 Xβ) ˆβ 1 = (X T W 1 0 X) 1 X T W 1 0 Z 0 ˆη 1 = X ˆ β 1, ˆµ 1 = g 1 (ˆη 1 ) Repeat until changes in ˆβ m are sufficiently small.
Estimation In theory, ˆβ m ˆβ as m, but in practice, the algorithm may fail to converge. Under some conditions, ˆβ N(β, I 1 (β)) In practice, the asymptotic covariance matrix of ˆβ is estimated by φ(x T Wm 1 X) 1 where W m is the weight matrix from the m th iteration. If φ is unknown, it is estimated by ˆφ = 1 n p n i=1 w i (y i ˆµ) 2 V (ˆµ) where V (ˆµ i ) = var(y i )/a(φ) = w i var(y i )/φ
Confidence interval [ ] 1 1 CI α (β i ) = ˆβ j u 1 α/2 n ˆσ βj ; ˆβ j + u 1 α/2 n ˆσ βj where u 1 α/2 is the 1 α/2 quantile of N(0, 1) and [ ] 1 I( ˆβ). ˆσ βj = 1 n To test the hypothesis if φ is unknown jj H 0 : β j = 0 against H 1 : β j 0 ˆβ j N(0, 1) φ(x T Wm 1 X) 1 (j, j) ˆβ j t n p ˆφ(X T Wm 1 X) 1 (j, j)
Goodness-of-Fit H 0 : the true model is M versus H 1 : the true is M sat The likelihood ratio test for this hypothesis is called the deviance. For any submodel M, dev(m) = 2(ˆl sat ˆl M ) Under H 0, dev(m) χ 2 p sat p.
Goodness-of-Fit The scaled deviance for GLM is D(y, ˆµ) = 2 [l(ˆµ sat, φ; y) l(ˆµ, φ; y)] = n { 2w i yi (θ(ˆµ sat i ) θ(ˆµ i )) b(ˆµ sat } i ) + b(ˆµ i /φ = i=1 n D (y i ; ˆµ i )/φ i=1 = D (y; ˆµ)/φ
Tests We use the deviance to compare two models having p 1 and p 2 parameters respectively, where p 1 < p 2. Let ˆµ 1 and ˆµ 2 denote the corresponding MLEs. If φ is unknown, D(y, ˆµ 1 ) D(y, ˆµ 2 ) χ 2 p 2 p 1 D (y, ˆµ 1 ) D (y, ˆµ 2 ) (p 2 p 1 ) ˆφ F 1 α,p2 p 1,n p 2
Goodness-of-Fit The deviance residuals for a given model are d i = sign(y i ˆµ i ) D (y i ; ˆµ i ) A poorly fitting point will make a large contribution to the deviance, so d i will be large.
Diagnostics The Pearson residuals are defined by r i = y i ˆµ i (1 hii )V (ˆµ) where h ii is the ith diagonal element of The deviance residuals are H = X(X T Wm 1 X) 1 X T Wm 1 ˆε i = sign(y i ˆµ i ) D (y i ; ˆµ i ) 1 h ii
Diagnostics The Anscombe residuals is defined as a transformation of the Pearson residual r A i = t(y i ) t(ˆµ i ) t (ˆµ i ) φv (ˆµ i )(1 h ii ) The aim in introducing the function t is to make the residuals as Gaussian as possible. We consider t(x) = x 0 V (µ) 1/3 dµ
Diagnostics Influential points using the Cook s distance C i = 1 p ( ˆβ (i) ˆβ) T X T W m X( ˆβ (i) ˆβ) r 2 i h ii p(1 h ii ) 2 The outliers points : if h ii > 2p/n or h ii > 3p/n, then we consider that ith point is an outlier.
Model Selection Model selection can be done using the AIC and BIC. Forward, Backward and stepwise approach can be used.
Logistic regression Logistic regression is a generalization of regression that is used when the outcome Y is binary 0, 1. As example, we assume that P(Y i = 1 X i ) = eβ 0+β 1 X i 1 + e β 0+β 1 X i Note that E(Y i X i ) = P(Y i = 1 X i )
Logistic regression Define the logit function ( ) z logit(z) = log 1 z We can write where π i = P(Y i = 1 X i ) logit(π i ) = β 0 + β 1 X i The extension to several covariates is logit(π i ) = β 0 + p β j x ij i=1
How do we estimate the parameters? Can be fit using maximum likelihood. The likelihood function is L(β) = n f (y i X i ; β) = L(β) = i=1 n i=1 The estimator ˆβ has to be found numerically. π y i i (1 π i ) 1 y i
Usually, we use the reweighted least squares First set a starting values of β (0) Compute ex i β (k) ˆπ i = 1 + e X i β(k) Define weighted matrix W whose i th diagonal is ˆπ i (1 ˆπ i ) Define the adjusted response vector Z = Xβ (k) + W 1 (Y ˆπ) Take ˆβ (k+1) = (X T WX) 1 X T WZ which is the weighted linear regression of Z on X
Model selection and diagnostics Diagnostics : the Pearson χ 2 Y i ˆπ i ˆπi (1 ˆπ i ) The deviance residuals [ sign(y i ˆπ i ) 2 Y i log ( Yi ) ( )] 1 Yi + (1 Y i ) log ˆπ i 1 ˆπ i
To fit this model, we use the glm command. Call: glm(formula = chd ~., family = binomial, data = SAheart) Deviance Residuals: Min 1Q Median 3Q Max -1.8320-0.8250-0.4354 0.8747 2.5503 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -5.9207616 1.3265724-4.463 8.07e-06 *** row.names -0.0008844 0.0008950-0.988 0.323042 sbp 0.0076602 0.0058574 1.308 0.190942 tobacco 0.0777962 0.0266602 2.918 0.003522 ** ldl 0.1701708 0.0597998 2.846 0.004432 ** adiposity 0.0209609 0.0294496 0.712 0.476617 famhistpresent 0.9385467 0.2287202 4.103 4.07e-05 *** typea 0.0376529 0.0124706 3.019 0.002533 ** obesity -0.0661926 0.0443180-1.494 0.135285 alcohol 0.0004222 0.0045053 0.094 0.925346 age 0.0441808 0.0121784 3.628 0.000286 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 596.11 on 461 degrees of freedom Residual deviance: 471.16 on 451 degrees of freedom AIC: 493.16 Number of Fisher Scoring iterations: 5
To fit this model, we use the glm command. Start: AIC=493.16 chd ~ row.names + sbp + tobacco + ldl + adiposity + famhist + typea + obesity + alcohol + age Df Deviance AIC - alcohol 1 471.17 491.17 - adiposity 1 471.67 491.67 - row.names 1 472.14 492.14 - sbp 1 472.88 492.88 <none> 471.16 493.16 - obesity 1 473.47 493.47 - ldl 1 479.65 499.65 - tobacco 1 480.27 500.27 - typea 1 480.75 500.75 - age 1 484.76 504.76 - famhist 1 488.29 508.29 etc... Step: AIC=487.69 chd ~ tobacco + ldl + famhist + typea + age Df Deviance AIC <none> 475.69 487.69 - ldl 1 484.71 494.71 - typea 1 485.44 495.44 - tobacco 1 486.03 496.03 - famhist 1 492.09 502.09 - age 1 502.38 512.38
Suppose Y i Binomial(n i, π i ) We can fit the logistic model as before Pearson residuals r i = logit(π i ) = X i β Y i n i ˆπ i ni ˆπ i (1 ˆπ i ) Deviation residuals d i = sign(y i Ŷi) 2 [ Y i log ( Yi ) ( )] ni Y i + (n i Y i ) log ˆµ i n i ˆµ i
Goodness-of-Fit test The Pearson test and deviance χ 2 = i D = i r 2 i d 2 i both have a χ 2 n p distribution if the model is correct.
To fit this model, we use the glm command. Call: glm(formula = cbind(y, n - y) ~ x, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -0.70832-0.29814 0.02996 0.64070 0.91132 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -14.73119 1.83018-8.049 8.35e-16 *** x 0.24785 0.03031 8.178 2.89e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 137.7204 on 7 degrees of freedom Residual deviance: 2.6558 on 6 degrees of freedom AIC: 28.233 Number of Fisher Scoring iterations: 4
To test the correctness of the model > pvalue = 1-pchisq(out$dev,out$df.residual) > print(pvalue) [1] 0.8506433 > r=resid(out,type="deviance") > p=out$linear.predictors > plot(p,r,pch=19,xlab="linear predictor", ylab="deviance residuals") > print(sum(r^2)) [1] 2.655771 > cooks.distance(out) 1 2 3 4 5 0.0004817501 0.3596628502 0.0248918197 0.1034462077 0.0242941942 6 7 8 0.0688081629 0.0014847981 0.0309767612 Note that the residuals give back the deviance test, and the p-value is large indicating no evidence of a lack of fit.