Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.) Logistic regression generalizes methods for 2-way tables Adds capability studying several predictors, but Limited to binary response variables Similar in intent to linear regression, but details are different Method for estimating joint association between several predictors and a response variable Typically useful in some class projects 1
2
Betting in a fair game An American roulette wheel has 38 slots: 1,2,3,, 36, 0, 00 If you place a $1 bet on 00 for a single spin of the wheel, you have 1/38 chance of winning in a single spin 1 way to win, 37 ways to lose, or The casino has 37 ways to win, 1 way to lose The odds of winning for the house are 37 to 1, 1 to 37 for you 3
Betting in roulette... For the game to be fair, Casino keeps your $1 if 00 does not come up Casino pays $37 if 00 comes up, and you keep your bet If X is your winnings from a $1 bet, E(X) = -1 (37/38) + 37 (1/38) = 0 Casinos stay in business by paying out 35 to 1, the casinos insure that roulette is not a fair game. In this case E(X) = -1 (37/38) + 35 (1/38) = -(2/38) = -0.053 4
Converting probabilities to odds and log(odds) In a game of chance, the odds of winning is the same as the ratio of money that should be bet by the two players. In roulette, the odds of your winning is the ratio of the probability of your winning to the probability of losing p/(1-p) = (1/38) / (37/38) = 1/37 Typically, odds are given to show the ratio of the payout: 37 to 1 in this case The values of an `odds range from 0 to Think of probabilities 0.01, 0.001, 0.0001, 0.99, 0.999, 0.9999,etc We will use a transformation of odds to log(odds) The values of log(odds) range from - to. 5
Odds vs log(odds) Transformation of p Why consider such a transformation? Answer: it transforms a 0 < p < 1 variable to a quantitative variable from - to + It is a simple algebraic operation to go back and forth between probabilities and log(odds) Odds Log odds p p/(1-p) log(p/(1-p)) 0.0 0-0.1 0.111-2.20 0.2 0.250-1.38 0.3 0.429-0.85 0.4 0.667-0.40 0.5 1.000 0 0.6 1.500 0.40 0.7 2.333 0.85 0.8 4.000 1.38 0.9 9.000 2.20 1.0 + + 6
Computing odds in data, an example The example on the next slide is very similar to IPS Example 8.1 (5th and 6th edition), but the numbers are from the 5th ed. Be careful when reading the example in the 6th ed. 7
Example: binge drinking survey Binger Men Women Total Yes 1630 22.7% 1684 17.0% 3314 No 5550 77.3% Total 7180 100% 8232 83.0% 9916 100% 13782 17096 8
Logistic Regression Idea behind logistic regression Let ˆp M be the proportion of men who are binge drinkers; log(ˆp M /(1 ˆp M )) is the log odds. Let ˆp F be the proportion of women who are binge drinkers; log(ˆp F /(1 ˆp F )) is the log odds. The ratio of the odds (called the odds ratio) of men to women being binge drinkers is ˆp M 1 ˆp M = ˆp F 1 ˆp F ( ˆpM )( 1 ˆpF 1 ˆp M ˆp F ) = 0.294 0.205 = 1.434 Now recall log(x/y) = log(x) log(y). 9
New page Idea behind logistic regression In the binge drinking table, log [( ˆpM )( 1 ˆpF 1 ˆp M ˆp F )] = log(0.294) log(0.205) = 1.225 ( 1.587) = 0.362 The log odds for males differs from the log odds for females by a constant. Logistic regresson is a model in which predictors induce changes in log(odds), similar to linear regression, where Predictors induce changes in mean of response variable. 10
Model for Logistic Regression Set the log odds to be a linear combination of the predictor variables This is the Logistic Regression Model Sometimes equivalently written as: 11
Logistic regression The Logistic Regression Model Predictor variables (x s) can be quantitative or binary More complex formulas for estimates than least squares Omnibus test of the model now a χ 2, not an F test We can test each predictor variable (x i ) for its contribution but now this is a z test, not a t test Assumptions of this model are quite complex and are not often checked Logistic regression model is widely used Coefficients can be derived directly in some 2-way tables Back to binge drinking example 12
Example: binge drinking survey Binger Men Women Total Yes 1630 22.7% 1684 17.0% 3314 19.4% No 5550 77.3% Total 7180 100% 8232 83.0% 9916 100% 13782 80.6% 17096 100% 13
The logistic model - binge drinking From the previous slide (odds of being a binge drinker) For men: Log odds = -1.225 For women: Log odds = -1.587 The logistic model for one predictor (gender) is Log (p /(1-p)) = Log odds = b0 + b1x1 where Y = 1 if a binge drinker; 0 otherwise and X1 = 1 if male; 0 if female So the logistic model is For men: Log odds = b0 + b1 = -1.225 For women: Log odds = b0 = -1.587 Solving b0 = -1.587 and b1 = -1.225 - b0 = 0.362 Thus the fitted logistic model for this example is Log odds = -1.587 + 0.362X1 14
The logistic model - binge drinking Working backwards to confirm this fitted model Log odds = log (p/(1-p)) = -1.587 + 0.362X 1 where X 1 = 1 if male and X 1 = 0 if female So for men Log odds = log (p/(1-p)) = -1.587 + 0.362(1) = -1.225 and odds = e -1.225 = 0.294 Thus the proportion of binge drinkers is odds / (odds +1) = 0.294 / 1.294 = 0.227 For women Log odds = log (p/(1-p)) = -1.587 + 0 and odds = e -1.587 = 0.205 Thus the proportion of binge drinkers is odds / (odds +1) = 0.205 / 1.205 = 0.17 15
Comparing two proportions Relative risk and odds ratio S F Total Group 1 a b a + b Group 2 c d c + d Total a + c b + d 16
Odds ratio As with RR, an odds ratio of 1 indicates the proportion of successes (events) is the same in both groups RR is easier to interpret (ratio of sample proportions) When successes are rare, RR and OR are very similar When successes are common, RR and OR are similar only if they are close to 1 OR tends to overstate differences Example: binge drinking 17
Odds ratio and logistic regression Odds ratio is the key output from a logistic regression An OR is calculated for each predictor variable OR measures the strength of the effect on p (probability of `success ) Example: binge drinking Log odds = -1.587 + 0.362X 1 where X 1 = 1 if M, 0 if F For men: Log odds = b 0 + b 1 = -1.225 For women: Log odds = b 0 = -1.587 Let OR be the odds ratio men to women Log (OR) = log (odds for men) log (odds for women) = (b 0 + b 1 ) (b 0 ) = b 1 So OR = e b 1 For the binge drinking example OR = e 0.362 = 1.436 18
Inference for logistic regression parameters A 95% confidence interval for the coefficient β 1 is given by b 1 ± 1.96 s.e.(b 1 ) A 95% confidence interval for the odds ratio e β 1 is given by e (b 1± 1.96 s.e.(b 1 )) To test the null hypothesis H 0 : β 1 =0(i.e., no association between the response variable and the predictor variable X 1 ), use Z = b 1 s.e.(b 1 ) Z has (approximately) a N(0, 1) distribution when H 0 is true. 19
Binge drinking: expanding data from a 2x2 table to a rectangular data file The 2x2 table Let Binge = 1 if binger Let Sex = 1 if male 0 otherwise Stata commands input Binge Sex Count 1 1 1630 0 1 5550 1 0 1684 0 0 8232 end expand Count Binger Men Women Total Yes 1630 1684 3314 No 5550 8232 13782 Total 7180 9916 17096 This creates a rectangular data file with 17,096 rows: 1 1 1 1 etc. 20
Binge drinking logistic regression Stata has 2 commands - logistic and logit logistic displays ORs; logit displays model coefficients Note: b 0 = -1.587 and b 1 = 0.362 as earlier 21
Binge drinking logistic regression Logistic command displays the odds ratio Notes: OR = 1.436 as earlier 95% CI for OR (1.33, 1.55) excludes OR = 1 z for the Wald test = 9.31 (P < 0.001) 22
Multiple logistic regression Example: Intensive Care Unit (ICU) Study of 200 patients admitted to the adult ICU at Baystate Medical Center in Springfield, MA Response variable Survival until hospital discharge (Surv) Surv = 1 if died, 0 if survived Predictor variables Age, in years (Age) Sex = 1 if female, 0 if male (Sex) Race = 1 if white, 0 otherwise (Race) Heart rate at ICU admission, beats/min. (HRate) Level of consciousness at ICU admission (LOC) LOC = 1 if deep stupor of coma, 0 otherwise Source: Hosmer & Lemeshow (2000) Wiley & Sons 23
Density Heart Rate at ICU Admission 0.01.02.03 Density 20 40 60 80 100 Age 0.005.015.02 100 150 200.005.01 Density 0 Density.01.02.015.03.02 20 40 60 Age 80 100 0 50 100 150 Heart Rate at ICU Admission 200. table LOC Level of Consciousness at ICU Admission Freq. No Coma or Deep Stupor 185 Deep Stupor 5 Coma 10. table surv surv Freq. 0 160 1 40 24
Example ICU, summary of the data Response variable - survival to hospital discharge Surv N = 200, 20% died Predictor variables Sex 38% female Age Average is 57.5 yrs., range from 16 to 92 yrs. Race 87.5% white HRate Average is 99, range from 39 to 192 beats/min. LOC 7.5% in deep stupor or coma 25
ICU - predictors of death before discharge A logistic regression with all 5 predictors Age and level of consciousness (LOC) both significant Re-estimate the model keeping only the significant terms 26
ICU - predictors of death before discharge The final logistic model (using logistic command) Age and LOC are significant Odds ratio For Age 1.028 (95% CI is 1.004 to 1.064) P = 0.022 For LOC 36.16 (95% CI is 7.63 to 171.24) P < 0.001 27
ICU - predictors of death before discharge The final logistic model (using logit command) Shows the logistic model coefficients Log odds = Log (p/1-p) = -3.46 + 0.028 Age + 3.59 LOC Note: e 0.028 = 1.028 (OR) and e 3.59 = 36.16 (OR for LOC) 28
Interpretation of coefficients of the logistic regression model The sign of the β i term indicates whether p increases or decreases as x increases ICU Example: both β i terms were positive so risk of death increases with age and presence of deep stupor or coma The magnitude of the β i term gives the additive change in log odds when there is +1 unit change in the predictor variable, holding other predictors fixed 29
Interpretation of coefficients The magnitude of the odds ratio term (= e βi ) gives the multiplicative change in odds for +1 change in predictor ICU Example: the odds of death increases multiplicatively by 2.8% (OR = 1.028) for each year increase in age To see this, exponentiate both sides of logistic model and note (p/1-p) = e β0 + β1[x + 1] = (e β0 )(e β1x )(e β1 ), where e β1 = OR 30
Final Thoughts on Logistic Regression Some of you will find logistic regression useful in a project, so last p-set has logistic regression problem. Not covered on final exam, because we have not had time to digest it. Logistic regression extends the analysis of two-way tables Response variable must still be binary. Predictors can now be categorical or quantitative. Logistic regression is an example of a class of regression models much more general than linear regression. These models are covered in detail in Stat 138 and Stat 149 31