VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models unifies a family of regression models that includes (but is not limited to) logistic, Poisson, and linear regression.
Linear Regression: Modeling the Mean Recall that linear regression involves modeling the mean of some outcome variable as a function of one or more explanatory variables. That is, we have a sample Y 1,, Y n of independent measures, where the i th subject in our sample has p explanatory variables x i1, x i2,, x ip, and E(Y i ) = μ i. The linear regression model specifies that Y i = β 0 + β 1 x i1 + β 2 x i2 + + β p x ip + ε i, where ε i ~ N(0, σ 2 ), for i = 1,,n. Then E(Y i x i1, x i2,, x ip ) = β 0 + β 1 x i1 + + β p x ip, and Var(Y i ) = σ 2. This model assumes that the mean of the outcome variable changes linearly with respect to the explanatory variables.
The Three Components of a Generalized Linear Model Whereas with linear regression, we model the mean of the outcome variable directly, a generalized linear model is constructed to model the effects of the covariates on a function of the mean. There are hence three parts or components that comprise a generalized linear model: 1. The random component, which specifies the distribution of the outcome variable. 2. The systematic component, which represents a function of the covariates that will link to the outcome variable. 3. The link function, which determines how the mean of the outcome variable relates to the covariates.
Generalized Linear Models for Binary Data We have a sample Y 1,, Y n of independent binary outcome measurements. The i th subject in our sample has p explanatory variables x i1, x i2,, x ip. Suppose that P(Y i = 1) = π i and P(Y i = 0) = 1 π i. Hence, E(Y i ) = π i. The random component in this case is clearly binomial. For the purpose of this class, we will always assume that the systematic component is simply a linear combination of the covariates, or β 0 + β 1 x i1 + + β p x ip. The remaining question is: how do we model π i as a function of the covariates (i.e., what is the link function)?
Link Functions for the Binomial Distribution Suppose that we assume that E(Y i ) = π i = β 0 + β 1 x i1 + + β p x ip. We call this the identity link. Does this model have any practical shortcomings? Since the systematic component can take on any value, we often prefer using a link that t will constrain π i to the interval lbetween 0 and 1. The so-called logistic (also called the logit or log-odds) link g(π i ) = log[π i /(1 π i )] accomplishes this. There are other links (e.g., the probit), but we will focus mainly on the logit.
Example VI.A The Cache County Memory Study (CCMS) has followed approximately 5100 elderly l men and women continually since 1995. The project s focus has been to better understand genetic and environmental modifiers of dementia risk particularly Alzheimer s disease and cognitive health. The ε4 allele of the APOE gene is a reported risk factor for AD. The data in the following table summarize findings (thus far) from the CCMS relative to APOE and AD. AD No AD Total 1 ε4 Allele 308 1296 1604 No ε4 262 3096 3358 Total 570 4392 4962
Example VI.A Consider a generalized linear model in this case. Note that these data can be viewed as a sample of outcome measures Y 1,,YY 4962, where Y i = 1 if the i th individual was diagnosed with AD, and Y i = 0 otherwise. Moreover, we have a single covariate X i, which is 1 if the i th individual has at least one copy of the APOE ε4 allele, and 0 otherwise. The model looks something like this: logit(π i ) = log[π i /(1 π i )] = β 0 + β 1 X i
Example VI.A (cont d) First, how do we interpret the coefficients of this model? What is logit(π i X i = 1)? What is logit(π i X i = 0)? What is the log odds ratio with respect to AD risk, comparing those with at least one ε4 to those with no ε4? The regression coefficient of a binary covariate in a logistic regression model represents the log odds ratio comparing the group identified as 1 to the group identified as 0. More generally, the coefficient of any arbitrary covariate in a logistic regression model represents the log odds ratio for subjects who differ by one unit with respect to the covariate.
Fitting the Generalized Linear Model for the Alzheimer s Data in SAS We obtain parameter estimates for a generalized linear model using the method of maximum likelihood these estimates typically y cannot be computed in closed form. The following SAS program shows how to read the AD data into SAS, and then obtain a fit for the regression coefficients in the model of Example VI.A. options ls=79 nodate; data; input e4 ad count; cards; 1 1 308 1 0 1296 0 1 262 0 0 3096 ;; proc genmod descending; model ad=e4 / dis=bin link=logit type3; weight count; run; proc sort; by descending ad descending e4; run; f proc freq order=data; tables e4*ad / chisq relrisk; weight count; run;
The FREQ Procedure SAS Output Stat 5100 Linear Regression and Time Series Table of e4 by ad e4 ad Frequency Percent Row Pct Col Pct 1 0 Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 1 308 1296 1604 6.21 26.12 32.33 19.20 80.80 54.04 29.51 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 0 262 3096 3358 5.28 62.39 67.67 7.80 92.20 45.96 70.49 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 570 4392 4962 11.49 88.51 100.00 Statistics for Table of e4 by ad Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 1 138.7375 <.0001 Likelihood Ratio Chi-Square 1 129.98029802 <.0001 Continuity Adj. Chi-Square 1 137.6186 <.0001 Mantel-Haenszel Chi-Square 1 138.7095 <.0001 Phi Coefficient 0.1672 Contingency Coefficient 0.1649 Cramer's V 0.1672
SAS Output (cont d) Stat 5100 Linear Regression and Time Series Fisher's Exact Test ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Cell (1,1) Frequency (F) 308 Left-sided Pr <= F 1.0000 Right-sided id d Pr >= F 3.366E-30366E 30 Table Probability (P) 6.085E-30 Two-sided Pr <= P 4.889E-30 Estimates of the Relative Risk (Row1/Row2) Type of Study Value 95% Confidence Limits ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Case-Control (Odds Ratio) 2.8083 2.3527 3.3522 Cohort (Col1 Risk) 2.4611 2.1106 2.8697 Cohort (Col2 Risk) 0.8764 0.8540 0.8993
The GENMOD Procedure Model Information SAS Output (cont d) Data Set WORK.DATA1 Distribution Binomial Link Function Logit Dependent Variable ad Scale Weight Variable count Number of Observations Read 4 Number of Observations Used 4 Sum of Weights 4962 Number of Events 2 Number of Trials 4 Response Profile Ordered Total Value ad Frequency 1 1 570 2 0 4392 PROC GENMOD is modeling the probability that ad='1'. Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 2 3408.7579 1704.3790 Scaled Deviance 2 3408.7579 1704.3790 Pearson Chi-Square 2 4962.0000 2481.0000 Scaled Pearson X2 2 4962.0000 2481.0000 Log Likelihood -1704.3790 Algorithm converged. Stat 5100 Linear Regression and Time Series
SAS Output (cont d) Stat 5100 Linear Regression and Time Series Analysis Of Parameter Estimates Standard Wald 95% Chi- Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq Intercept 1 2 4695 0 0643 2 5956 2 3434 1473 15 < 0001 Intercept 1-2.4695 0.0643-2.5956-2.3434 1473.15 <.0001 e4 1 1.0326 0.0903 0.8556 1.2096 130.69 <.0001 Scale 0 1.0000 0.0000 1.0000 1.0000
Example VI.A (cont d) The SAS output t indicates that t ˆ 2.4695 and ˆ 1.0326. 0 2 1 According to the fit of the regression model, what are the estimated log odds of AD for someone with no APOE ε4 allele? What are the estimated log odds given for someone with at least one high-risk allele? Wh t i th ti t d l dd ti f AD i k i 4 What is the estimated log odds ratio of AD risk comparing ε4 carriers to non-carriers? How does this estimate compare with the sample odds ratio computed using the data in the 2 x 2 table?