Alternatives to logistic regression Laura Rosella, PhD Scientist, Public Health Ontario
Acknowledgments Course: Categorical Data Analysis for Epidemiologic Studies (Course director: Laura Rosella, PhD) Dr. Marcelo Urquia, SMH 2
Objectives To understand the pros and cons of the logistic regression approach To discuss the appropriate use of logistic regression To identify alternatives to logistic regression and discuss their strengths and weaknesses To provide an example to walk-through the approaches Goal: Thoughtful use of logistic regression 3
Binomial Logistic Regression Model Binomial regression is based on the binomial distribution logit π y = ln π(y) 1 π(y)
Binomial Logistic Regression Model logit π y = ln π(y) 1 π(y) ODDS RATIO
Binomial Logistic Regression Model logit π y = ln π(y) 1 π(y) Logit (i.e. log-odds) function serves to bound outcome between and 1 LOGIT
ln π(y) 1 π(y) = α + βx Logistic regression is a linear model in the log-odds scale For x it is the linear increase in log-odds or the exponential increase in odds
Epi 101 Exposure Disease Present Disease Absent Present a b Absent c d Relative Risk (RR) = ( a a+b ) ( c c+d ) i.e. risk in exposed / risk in the unexposed Odds Ratio (OR) = (a b ) ( c d ad or ) bc i.e ratio of the odds of developing outcome in the exposed compared to the unexposed Consensus: relative risk is preferred over the odds ratio for most prospective investigations 8
The strengths of the logistic regression approach Logistic Regression can be applied to many different study designs (cohort, case-control, cross-sectional) The Odds Ratio (OR) provides a good approximation of the Relative Risk when the outcome is rare. Fairly easy to run using many different statistical software packages too easy? Multivariate
The problem with logistic regression The OR overestimates the Relative Risk when the outcome is common (rule of thumb > 10%) Despite advice on the rare event rate assumption consumers of health research literature often interpret the OR as a Relative Risk (RR), leading to its potential exaggeration Logistic regression became easy to use and very popular and there is a perception that alternative methods do not exist But there are easy and potentially more appropriate outcomes when you want to estimate relative risk
Example Relative Risk=2 at Prevalence among non-exposed=0.1, 0.2 and 0.3 Y=1 Y=0 Po 0.1 X=1 2 8 10 RR 2 X=0 9 81 90 OR 2.3 11 89 100 Y=1 Y=0 Po 0.2 X=1 4 6 10 RR 2 X=0 18 72 90 OR 2.7 22 78 100 Relative Risk=3 at Prevalence among non-exposed=0.1, 0.2 and 0.3 Y=1 Y=0 Po 0.1 X=1 3 7 10 RR 3 X=0 9 81 90 OR 3.9 12 88 100 Y=1 Y=0 Po 0.2 X=1 6 4 10 RR 3 X=0 18 72 90 OR 6 24 76 100 Y=1 Y=0 Po 0.3 X=1 6 4 10 RR 2 X=0 27 63 90 OR 3.5 33 67 100 Y=1 Y=0 Po 0.3 X=1 9 1 10 RR 3 X=0 27 63 90 OR 21 36 64 100
Zhang & Yu s simple formula, JAMA 1998 Formula can be used to correct the adjusted OR derived from logistic regression to derive an treatment effect that better represents the true relative risk Zhang and Yu, 1998, JAMA
Limitations of Zhang and Yu s formula Trade-off between simplicity and precision Not very reliable in the presence of covariates produces Confidence Intervals narrower than they should be May slightly overestimate the RR when confounding exists Ignores covariance between the estimated incidence and estimated odds ratio SHOULD NOT BE USED ON AN ADJUSTED OR: Using the formula in this manner is incorrect and will produce a biased estimate when confounding is present
Other alternatives Log-Binomial regression Poisson regression (and Negative Binomial) Poisson with robust variance estimator (modified Poisson) Cox regression
Hypothetical working example WCGS cohort study; cohort of men in the 1960s followed up to study CVD risk factors Outcome: HBP (indicate if study participants have HBP at follow-up) Exposure: Obese Over = 1 if they were classified as obese at baseline, = 0 if not
proc freq data =talk; tables over*hbp/nopercent nocol relrisk; run; HBP at follow-up Total Obese Yes No Yes 49 37 86 No 644 2424 3068 Total 693 2461 3094 The OR and RR for those who weren t classified as obese at baseline: OR = ( 49x2424)/(37x644) = 4.99 RR = (49/86)/(644/3068) = 2.71 HBP 22%
Logistic regression proc genmod data = talk descending; model hbp = over/ dist = binomial link = logit; estimate 'Beta' over 1-1/ exp; title1 Logistic Regression'; run; Contrast Estimate Results Estimate Confidence Limits Exp(Beta) 4.9847 3.2244 7.7060 proc logistic data = talk descending; model hbp = over; title1 'Logistic Regression'; run;
Log-Binomial Logistic Log binomial Logit: Log(P j /(1-P j ))=β o +β 1 X j Log: Log(P j )=β o +β 1 X j X=0 Log(P o /(1-P o ))= β o X=1 Log(P 1 /(1-P 1 ))=β o +β 1 X β 1 =Log(P 1 /(1-P 1 ))- Log(P o /(1-P o ))=Log(OR) X=0 Log(P o )= β o X=1 Log(P 1 )=β o +β 1 X β 1 =Log(P 1 )- Log(P o )=Log(RR) OR=e β1 RR=e β1
Log-binomial regression proc genmod data = talk descending; model hbp = over/ dist = binomial link = log; estimate 'Beta' over 1-1/ exp; title1 Log Binomial Regression'; run; Contrast Estimate Results Estimate Confidence Limits Exp(Beta) 2.7144 2.2311 3.3023
Poisson Regression Model specifies the outcome log(rate) as a linear predictor of covariates Used when the outcomes of interest are rates (and rate ratios) Using a Poisson model without robust error variances will result in a confidence interval that is too wide (i.e. tends to overestimate the variance) 21
Poisson regression proc genmod data = talk descending; model hbp = over/ dist = poisson link = log; estimate 'Beta' over 1-1/ exp; title1 'Poisson Regression'; run; Contrast Estimate Results Estimate Confidence Limits Exp(Beta) 2.7144 2.0301 3.6292
Poisson regression with robust variance (modified Poisson) proc genmod data = talk; class id; model hbp = over/ dist = poisson link = log; repeated subject = id/ type = unstr; estimate 'Beta' over 1-1/ exp; title1 'Poisson Regression Robust Variance'; run; Contrast Estimate Results Estimate Confidence Limits Exp(beta) 2.7144 2.2311 3.3023
Cox regression data talk; set talk; time=1; run; proc phreg data=talk; model time*hbp(0)= over /rl; run; Analysis of Maximum Likelihood Estimates HazardRatio Confidence Limits hbp 2.714 2.030 3.629
Comparison (crude OR) Model Estimate (95% CI) Logistic regression OR: 4.99 (3.22, 7.71) Zhang and Yu s formula RR: 2.71 (2.20, 3.20) Log-binomial regression RR: 2.71 (2.23, 3.30) Poisson regression RR: 2.71 (2.03, 3.63) Poisson regression with robust variance RR: 2.71 (2.23, 3.39) Cox regression RR: 2.71 (2.03, 3.63)
Comparison (adjusted OR) McNutt et al, AJE 2003;157:940-943
Pros and cons Alternative Pros Cons Zhang s and Yu formula Easy to use Ignores covariance, 10-15% bias in multivariable analyses. Underestimates CIs Log-binomial regression Natural approximation to binomial distribution Small standard error Poisson regression Poisson regression with robust variance (Modified Poisson) Cox regression Good approximation to binomial distribution when N is large Good approximation to binomial distribution when N is large Small standard error Good approximation to binomial distribution May result in convergence problems increase iterations or try modified Poisson Conservative CIs May estimate probabilities greater than 1 May estimate probabilities greater than 1 Does not estimate probabilities (no intercept)
What to do? If alternative regression methods are not feasible 1. Zhang and Yu s approximation (acknowledging the limitations) 2. Interpret OR as OR, not as RR If alternative regression methods are feasible 1. Log binomial regression 2. Modified Poisson regression (Robust variance) 3. Ordinary Poisson or Cox regression
Other consequences Etiologic fraction (EF). EF is the proportion of the cases that the exposure had played a causal role in its development EF = (I E I O )/I E, where I E =incidence in exposed and I O =incidence in non-exposed PAF = (I T I O )/I T, where I T =incidence in the population Also PAF = (P E *(RR-1))/(P E *(RR-1)+1), where P E =prevalence of the exposure in the population Ideally (i.e., in the absence of confounding, measurement error and ignorance), the sum of all EFs or PAFs is expected to be 1 (or 100%) Based on Risk, not odds! If OR are used instead of RR, EF and PAF may be inflated Use of OR may artefactually increase EF and PAFs
Why do we use odds-ratios in case-control studies?
Why do we use odds-ratios in case-control studies? Cohort Study Exposed Not Exposed (X) Disease Outcome (Y) In statistical terms Y is the random variable
Why do we use odds-ratios in case-control studies? Cohort Study Case Control Study Exposed Not Exposed (X) Disease Outcome (Y) Look back Disease Outcome (Y) In statistical terms Y is the random variable Exposed Not Exposed (X) In statistical terms X is the random variable
Why do we use odds-ratios in case-control studies? When sampling design is retrospective we can construct conditional distributions for the exposure (X) within the levels of the outcome variable We cannot estimate probabilities with this type of design... However the odds ratio can be computed the same way when it is defined as X given Y as it is for Y given X
Interpretations in case control versus cohort Interpretation of the regression coefficients (i.e. The log of the odds ratio) is identical In a case control study the intercept is not readily interpretable for epidemiology due to the nature of the sampling of the study Therefore the probability is also not directly interpretable
Thoughtful use of logistic regression In case control studies, it is an excellent choice because relative risk is not directly estimable In cohort or cross-sectional studies remember that: Odds Ratio is used as a surrogate of the relative risk (cohort) or prevalence rate ratio (cross-sectional) When the frequency of the outcome is high (e.g. > 10% or >20%) the odds ratio is biased (usually biased upwards) Consider alternative approaches and/or transformations of the odds ratio estimate
Further readings I Alternatives to logistic regression Zhang J, Yu KF. What's the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes JAMA. 1998 Nov 18;280(19):1690-1. http://jama.ama-assn.org/content/280/19/1690.long Spiegelman D, Hertzmark E. Easy SAS calculations for risk or prevalence ratios and differences. Am J Epidemiol. 2005 Aug 1;162(3):199-200. Epub 2005 Jun 29. http://aje.oxfordjournals.org/content/162/3/199.long McNutt LA, Wu C, Xue X, Hafner JP. Estimating the relative risk in cohort studies and clinical trials of common outcomes.am J Epidemiol. 2003 May 15;157(10):940-3. http://aje.oxfordjournals.org/content/157/10/940.long Zou G. A modified poisson regression approach to prospective studies with binary data. Am J Epidemiol. 2004 Apr 1;159(7):702-6. http://aje.oxfordjournals.org/content/159/7/702.long UCLA Stat Computing > SAS > FAQ > How can I estimate relative risk in SAS using proc genmod for common outcomes in cohort studies? http://www.ats.ucla.edu/stat/sas/faq/relative_risk.htm
About proper use of EF, PAF, etc. Further readings II Northridge ME. Public health methods--attributable risk as a link between causality and public health action. Am J Public Health. 1995 Sep;85(9):1202-4. http://www.ncbi.nlm.nih.gov/pmc/articles/pmc1615585/?tool=pubmed Nice discussion about the interpretation and usefulness for public health Rockhill B, Newman B, Weinberg C. Use and misuse of population attributable fractions. Am J Public Health. 1998 Jan;88(1):15-9. http://www.ncbi.nlm.nih.gov/pmc/articles/pmc1508384/?tool=pubmed Presents appropriate formulae for unadjusted and adjusted RR, and for multicategory exposures
laura.rosella@oahpp.ca 38