1 VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models unifies a family of regression models that includes (but is not limited to) logistic, Poisson, and linear regression.
2 Linear Regression: Modeling the Mean Recall that linear regression involves modeling the mean of some outcome variable as a function of one or more explanatory variables. That is, we have a sample Y 1,, Y n of independent measures, where the i th subject in our sample has p explanatory variables x i1, x i2,, x ip, and E(Y i ) = μ i. The linear regression model specifies that Y i = β 0 + β 1 x i1 + β 2 x i2 + + β p x ip + ε i, where ε i ~ N(0, σ 2 ), for i = 1,,n. Then E(Y i x i1, x i2,, x ip ) = β 0 + β 1 x i1 + + β p x ip, and Var(Y i ) = σ 2. This model assumes that the mean of the outcome variable changes linearly with respect to the explanatory variables.
3 The Three Components of a Generalized Linear Model Whereas with linear regression, we model the mean of the outcome variable directly, a generalized linear model is constructed to model the effects of the covariates on a function of the mean. There are hence three parts or components that comprise a generalized linear model: 1. The random component, which specifies the distribution of the outcome variable. 2. The systematic component, which represents a function of the covariates that will link to the outcome variable. 3. The link function, which determines how the mean of the outcome variable relates to the covariates.
4 Generalized Linear Models for Binary Data We have a sample Y 1,, Y n of independent binary outcome measurements. The i th subject in our sample has p explanatory variables x i1, x i2,, x ip. Suppose that P(Y i = 1) = π i and P(Y i = 0) = 1 π i. Hence, E(Y i ) = π i. The random component in this case is clearly binomial. For the purpose of this class, we will always assume that the systematic component is simply a linear combination of the covariates, or β 0 + β 1 x i1 + + β p x ip. The remaining question is: how do we model π i as a function of the covariates (i.e., what is the link function)?
5 Link Functions for the Binomial Distribution Suppose that we assume that E(Y i ) = π i = β 0 + β 1 x i1 + + β p x ip. We call this the identity link. Does this model have any practical shortcomings? Since the systematic component can take on any value, we often prefer using a link that t will constrain π i to the interval lbetween 0 and 1. The socalled logistic (also called the logit or logodds) link g(π i ) = log[π i /(1 π i )] accomplishes this. There are other links (e.g., the probit), but we will focus mainly on the logit.
6 Example VI.A The Cache County Memory Study (CCMS) has followed approximately 5100 elderly l men and women continually since The project s focus has been to better understand genetic and environmental modifiers of dementia risk particularly Alzheimer s disease and cognitive health. The ε4 allele of the APOE gene is a reported risk factor for AD. The data in the following table summarize findings (thus far) from the CCMS relative to APOE and AD. AD No AD Total 1 ε4 Allele No ε Total
7 Example VI.A Consider a generalized linear model in this case. Note that these data can be viewed as a sample of outcome measures Y 1,,YY 4962, where Y i = 1 if the i th individual was diagnosed with AD, and Y i = 0 otherwise. Moreover, we have a single covariate X i, which is 1 if the i th individual has at least one copy of the APOE ε4 allele, and 0 otherwise. The model looks something like this: logit(π i ) = log[π i /(1 π i )] = β 0 + β 1 X i
8 Example VI.A (cont d) First, how do we interpret the coefficients of this model? What is logit(π i X i = 1)? What is logit(π i X i = 0)? What is the log odds ratio with respect to AD risk, comparing those with at least one ε4 to those with no ε4? The regression coefficient of a binary covariate in a logistic regression model represents the log odds ratio comparing the group identified as 1 to the group identified as 0. More generally, the coefficient of any arbitrary covariate in a logistic regression model represents the log odds ratio for subjects who differ by one unit with respect to the covariate.
9 Fitting the Generalized Linear Model for the Alzheimer s Data in SAS We obtain parameter estimates for a generalized linear model using the method of maximum likelihood these estimates typically y cannot be computed in closed form. The following SAS program shows how to read the AD data into SAS, and then obtain a fit for the regression coefficients in the model of Example VI.A. options ls=79 nodate; data; input e4 ad count; cards; ;; proc genmod descending; model ad=e4 / dis=bin link=logit type3; weight count; run; proc sort; by descending ad descending e4; run; f proc freq order=data; tables e4*ad / chisq relrisk; weight count; run;
10 The FREQ Procedure SAS Output Stat 5100 Linear Regression and Time Series Table of e4 by ad e4 ad Frequency Percent Row Pct Col Pct 1 0 Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total Statistics for Table of e4 by ad Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ ChiSquare <.0001 Likelihood Ratio ChiSquare <.0001 Continuity Adj. ChiSquare <.0001 MantelHaenszel ChiSquare <.0001 Phi Coefficient Contingency Coefficient Cramer's V
11 SAS Output (cont d) Stat 5100 Linear Regression and Time Series Fisher's Exact Test ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Cell (1,1) Frequency (F) 308 Leftsided Pr <= F Rightsided id d Pr >= F 3.366E30366E 30 Table Probability (P) 6.085E30 Twosided Pr <= P 4.889E30 Estimates of the Relative Risk (Row1/Row2) Type of Study Value 95% Confidence Limits ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ CaseControl (Odds Ratio) Cohort (Col1 Risk) Cohort (Col2 Risk)
12 The GENMOD Procedure Model Information SAS Output (cont d) Data Set WORK.DATA1 Distribution Binomial Link Function Logit Dependent Variable ad Scale Weight Variable count Number of Observations Read 4 Number of Observations Used 4 Sum of Weights 4962 Number of Events 2 Number of Trials 4 Response Profile Ordered Total Value ad Frequency PROC GENMOD is modeling the probability that ad='1'. Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson ChiSquare Scaled Pearson X Log Likelihood Algorithm converged. Stat 5100 Linear Regression and Time Series
13 SAS Output (cont d) Stat 5100 Linear Regression and Time Series Analysis Of Parameter Estimates Standard Wald 95% Chi Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq Intercept < 0001 Intercept <.0001 e <.0001 Scale
14 Example VI.A (cont d) The SAS output t indicates that t ˆ and ˆ According to the fit of the regression model, what are the estimated log odds of AD for someone with no APOE ε4 allele? What are the estimated log odds given for someone with at least one highrisk allele? Wh t i th ti t d l dd ti f AD i k i 4 What is the estimated log odds ratio of AD risk comparing ε4 carriers to noncarriers? How does this estimate compare with the sample odds ratio computed using the data in the 2 x 2 table?
More information