Alternatives to logistic regression. Laura Rosella, PhD Scientist, Public Health Ontario

Similar documents
A Simple Method for Estimating Relative Risk using Logistic Regression. Fredi Alexander Diaz-Quijano

VI. Introduction to Logistic Regression

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Case-control studies. Alfredo Morabia

13. Poisson Regression Analysis

Guide to Biostatistics

Prevalence odds ratio or prevalence ratio in the analysis of cross sectional data: what is to be done?

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

11. Analysis of Case-control Studies Logistic Regression

Generalized Linear Models

Lesson Outline Outline

SAS Software to Fit the Generalized Linear Model

Statistical Rules of Thumb

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Introduction to mixed model and missing data issues in longitudinal studies

STATISTICAL ANALYSIS OF SAFETY DATA IN LONG-TERM CLINICAL TRIALS

Calculating the number needed to be exposed with adjustment for confounding variables in epidemiological studies

LOGIT AND PROBIT ANALYSIS

Advanced Quantitative Methods for Health Care Professionals PUBH 742 Spring 2015

Dealing with Missing Data

A Population Based Risk Algorithm for the Development of Type 2 Diabetes: in the United States

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University

Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc.

Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.)

Logistic Regression (1/24/13)

The CRM for ordinal and multivariate outcomes. Elizabeth Garrett-Mayer, PhD Emily Van Meter

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Ordinal Regression. Chapter

SAS and R calculations for cause specific hazard ratios in a competing risks analysis with time dependent covariates

Missing data and net survival analysis Bernard Rachet

Multiple logistic regression analysis of cigarette use among high school students

Lecture 19: Conditional Logistic Regression

Introduction to Fixed Effects Methods

Regression with a Binary Dependent Variable

Logistic regression modeling the probability of success

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Statistics 305: Introduction to Biostatistical Methods for Health Sciences

Bayes Theorem & Diagnostic Tests Screening Tests

Multinomial and Ordinal Logistic Regression

Department/Academic Unit: Public Health Sciences Degree Program: Biostatistics Collaborative Program

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

SP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY

Are you looking for the right interactions? Statistically testing for interaction effects with dichotomous outcome variables

Study Design and Statistical Analysis

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Nominal and ordinal logistic regression

Imputing Missing Data using SAS

Logit and Probit. Brad Jones 1. April 21, University of California, Davis. Bradford S. Jones, UC-Davis, Dept. of Political Science

II. DISTRIBUTIONS distribution normal distribution. standard scores

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

Methods for Meta-analysis in Medical Research

Organizing Your Approach to a Data Analysis

IS 30 THE MAGIC NUMBER? ISSUES IN SAMPLE SIZE ESTIMATION

An Article Critique - Helmet Use and Associated Spinal Fractures in Motorcycle Crash Victims. Ashley Roberts. University of Cincinnati

Lecture 1: Introduction to Epidemiology

How to set the main menu of STATA to default factory settings standards

Simple linear regression

Calculating Effect-Sizes

Poisson Models for Count Data

Multivariate Logistic Regression

Lecture 14: GLM Estimation and Logistic Regression

Binary Logistic Regression

Certified in Public Health (CPH) Exam CONTENT OUTLINE

Statistics in Retail Finance. Chapter 6: Behavioural models

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes

data visualization and regression

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Linear Classification. Volker Tresp Summer 2015

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Strategies for Identifying Students at Risk for USMLE Step 1 Failure

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Comparing return to work outcomes between vocational rehabilitation providers after adjusting for case mix using statistical models

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

LOGISTIC REGRESSION ANALYSIS

III. INTRODUCTION TO LOGISTIC REGRESSION. a) Example: APACHE II Score and Mortality in Sepsis

Sample Size Planning, Calculation, and Justification

Confounding in health research

Regression Modeling Strategies

Chi Squared and Fisher's Exact Tests. Observed vs Expected Distributions

Paper D Ranking Predictors in Logistic Regression. Doug Thompson, Assurant Health, Milwaukee, WI

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

PS 271B: Quantitative Methods II. Lecture Notes

Diabetes Prevention in Latinos

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

Multiple Imputation for Missing Data: A Cautionary Tale

Use of the Chi-Square Statistic. Marie Diener-West, PhD Johns Hopkins University

The Cross-Sectional Study:

GLM, insurance pricing & big data: paying attention to convergence issues.

SAMPLE SIZE TABLES FOR LOGISTIC REGRESSION

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Randomized trials versus observational studies

P (B) In statistics, the Bayes theorem is often used in the following way: P (Data Unknown)P (Unknown) P (Data)

Introduction to Statistics and Quantitative Research Methods

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Association Between Variables

SUMAN DUVVURU STAT 567 PROJECT REPORT

Chapter 29 The GENMOD Procedure. Chapter Table of Contents

Principles of Hypothesis Testing for Public Health

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

Transcription:

Alternatives to logistic regression Laura Rosella, PhD Scientist, Public Health Ontario

Acknowledgments Course: Categorical Data Analysis for Epidemiologic Studies (Course director: Laura Rosella, PhD) Dr. Marcelo Urquia, SMH 2

Objectives To understand the pros and cons of the logistic regression approach To discuss the appropriate use of logistic regression To identify alternatives to logistic regression and discuss their strengths and weaknesses To provide an example to walk-through the approaches Goal: Thoughtful use of logistic regression 3

Binomial Logistic Regression Model Binomial regression is based on the binomial distribution logit π y = ln π(y) 1 π(y)

Binomial Logistic Regression Model logit π y = ln π(y) 1 π(y) ODDS RATIO

Binomial Logistic Regression Model logit π y = ln π(y) 1 π(y) Logit (i.e. log-odds) function serves to bound outcome between and 1 LOGIT

ln π(y) 1 π(y) = α + βx Logistic regression is a linear model in the log-odds scale For x it is the linear increase in log-odds or the exponential increase in odds

Epi 101 Exposure Disease Present Disease Absent Present a b Absent c d Relative Risk (RR) = ( a a+b ) ( c c+d ) i.e. risk in exposed / risk in the unexposed Odds Ratio (OR) = (a b ) ( c d ad or ) bc i.e ratio of the odds of developing outcome in the exposed compared to the unexposed Consensus: relative risk is preferred over the odds ratio for most prospective investigations 8

The strengths of the logistic regression approach Logistic Regression can be applied to many different study designs (cohort, case-control, cross-sectional) The Odds Ratio (OR) provides a good approximation of the Relative Risk when the outcome is rare. Fairly easy to run using many different statistical software packages too easy? Multivariate

The problem with logistic regression The OR overestimates the Relative Risk when the outcome is common (rule of thumb > 10%) Despite advice on the rare event rate assumption consumers of health research literature often interpret the OR as a Relative Risk (RR), leading to its potential exaggeration Logistic regression became easy to use and very popular and there is a perception that alternative methods do not exist But there are easy and potentially more appropriate outcomes when you want to estimate relative risk

Example Relative Risk=2 at Prevalence among non-exposed=0.1, 0.2 and 0.3 Y=1 Y=0 Po 0.1 X=1 2 8 10 RR 2 X=0 9 81 90 OR 2.3 11 89 100 Y=1 Y=0 Po 0.2 X=1 4 6 10 RR 2 X=0 18 72 90 OR 2.7 22 78 100 Relative Risk=3 at Prevalence among non-exposed=0.1, 0.2 and 0.3 Y=1 Y=0 Po 0.1 X=1 3 7 10 RR 3 X=0 9 81 90 OR 3.9 12 88 100 Y=1 Y=0 Po 0.2 X=1 6 4 10 RR 3 X=0 18 72 90 OR 6 24 76 100 Y=1 Y=0 Po 0.3 X=1 6 4 10 RR 2 X=0 27 63 90 OR 3.5 33 67 100 Y=1 Y=0 Po 0.3 X=1 9 1 10 RR 3 X=0 27 63 90 OR 21 36 64 100

Zhang & Yu s simple formula, JAMA 1998 Formula can be used to correct the adjusted OR derived from logistic regression to derive an treatment effect that better represents the true relative risk Zhang and Yu, 1998, JAMA

Limitations of Zhang and Yu s formula Trade-off between simplicity and precision Not very reliable in the presence of covariates produces Confidence Intervals narrower than they should be May slightly overestimate the RR when confounding exists Ignores covariance between the estimated incidence and estimated odds ratio SHOULD NOT BE USED ON AN ADJUSTED OR: Using the formula in this manner is incorrect and will produce a biased estimate when confounding is present

Other alternatives Log-Binomial regression Poisson regression (and Negative Binomial) Poisson with robust variance estimator (modified Poisson) Cox regression

Hypothetical working example WCGS cohort study; cohort of men in the 1960s followed up to study CVD risk factors Outcome: HBP (indicate if study participants have HBP at follow-up) Exposure: Obese Over = 1 if they were classified as obese at baseline, = 0 if not

proc freq data =talk; tables over*hbp/nopercent nocol relrisk; run; HBP at follow-up Total Obese Yes No Yes 49 37 86 No 644 2424 3068 Total 693 2461 3094 The OR and RR for those who weren t classified as obese at baseline: OR = ( 49x2424)/(37x644) = 4.99 RR = (49/86)/(644/3068) = 2.71 HBP 22%

Logistic regression proc genmod data = talk descending; model hbp = over/ dist = binomial link = logit; estimate 'Beta' over 1-1/ exp; title1 Logistic Regression'; run; Contrast Estimate Results Estimate Confidence Limits Exp(Beta) 4.9847 3.2244 7.7060 proc logistic data = talk descending; model hbp = over; title1 'Logistic Regression'; run;

Log-Binomial Logistic Log binomial Logit: Log(P j /(1-P j ))=β o +β 1 X j Log: Log(P j )=β o +β 1 X j X=0 Log(P o /(1-P o ))= β o X=1 Log(P 1 /(1-P 1 ))=β o +β 1 X β 1 =Log(P 1 /(1-P 1 ))- Log(P o /(1-P o ))=Log(OR) X=0 Log(P o )= β o X=1 Log(P 1 )=β o +β 1 X β 1 =Log(P 1 )- Log(P o )=Log(RR) OR=e β1 RR=e β1

Log-binomial regression proc genmod data = talk descending; model hbp = over/ dist = binomial link = log; estimate 'Beta' over 1-1/ exp; title1 Log Binomial Regression'; run; Contrast Estimate Results Estimate Confidence Limits Exp(Beta) 2.7144 2.2311 3.3023

Poisson Regression Model specifies the outcome log(rate) as a linear predictor of covariates Used when the outcomes of interest are rates (and rate ratios) Using a Poisson model without robust error variances will result in a confidence interval that is too wide (i.e. tends to overestimate the variance) 21

Poisson regression proc genmod data = talk descending; model hbp = over/ dist = poisson link = log; estimate 'Beta' over 1-1/ exp; title1 'Poisson Regression'; run; Contrast Estimate Results Estimate Confidence Limits Exp(Beta) 2.7144 2.0301 3.6292

Poisson regression with robust variance (modified Poisson) proc genmod data = talk; class id; model hbp = over/ dist = poisson link = log; repeated subject = id/ type = unstr; estimate 'Beta' over 1-1/ exp; title1 'Poisson Regression Robust Variance'; run; Contrast Estimate Results Estimate Confidence Limits Exp(beta) 2.7144 2.2311 3.3023

Cox regression data talk; set talk; time=1; run; proc phreg data=talk; model time*hbp(0)= over /rl; run; Analysis of Maximum Likelihood Estimates HazardRatio Confidence Limits hbp 2.714 2.030 3.629

Comparison (crude OR) Model Estimate (95% CI) Logistic regression OR: 4.99 (3.22, 7.71) Zhang and Yu s formula RR: 2.71 (2.20, 3.20) Log-binomial regression RR: 2.71 (2.23, 3.30) Poisson regression RR: 2.71 (2.03, 3.63) Poisson regression with robust variance RR: 2.71 (2.23, 3.39) Cox regression RR: 2.71 (2.03, 3.63)

Comparison (adjusted OR) McNutt et al, AJE 2003;157:940-943

Pros and cons Alternative Pros Cons Zhang s and Yu formula Easy to use Ignores covariance, 10-15% bias in multivariable analyses. Underestimates CIs Log-binomial regression Natural approximation to binomial distribution Small standard error Poisson regression Poisson regression with robust variance (Modified Poisson) Cox regression Good approximation to binomial distribution when N is large Good approximation to binomial distribution when N is large Small standard error Good approximation to binomial distribution May result in convergence problems increase iterations or try modified Poisson Conservative CIs May estimate probabilities greater than 1 May estimate probabilities greater than 1 Does not estimate probabilities (no intercept)

What to do? If alternative regression methods are not feasible 1. Zhang and Yu s approximation (acknowledging the limitations) 2. Interpret OR as OR, not as RR If alternative regression methods are feasible 1. Log binomial regression 2. Modified Poisson regression (Robust variance) 3. Ordinary Poisson or Cox regression

Other consequences Etiologic fraction (EF). EF is the proportion of the cases that the exposure had played a causal role in its development EF = (I E I O )/I E, where I E =incidence in exposed and I O =incidence in non-exposed PAF = (I T I O )/I T, where I T =incidence in the population Also PAF = (P E *(RR-1))/(P E *(RR-1)+1), where P E =prevalence of the exposure in the population Ideally (i.e., in the absence of confounding, measurement error and ignorance), the sum of all EFs or PAFs is expected to be 1 (or 100%) Based on Risk, not odds! If OR are used instead of RR, EF and PAF may be inflated Use of OR may artefactually increase EF and PAFs

Why do we use odds-ratios in case-control studies?

Why do we use odds-ratios in case-control studies? Cohort Study Exposed Not Exposed (X) Disease Outcome (Y) In statistical terms Y is the random variable

Why do we use odds-ratios in case-control studies? Cohort Study Case Control Study Exposed Not Exposed (X) Disease Outcome (Y) Look back Disease Outcome (Y) In statistical terms Y is the random variable Exposed Not Exposed (X) In statistical terms X is the random variable

Why do we use odds-ratios in case-control studies? When sampling design is retrospective we can construct conditional distributions for the exposure (X) within the levels of the outcome variable We cannot estimate probabilities with this type of design... However the odds ratio can be computed the same way when it is defined as X given Y as it is for Y given X

Interpretations in case control versus cohort Interpretation of the regression coefficients (i.e. The log of the odds ratio) is identical In a case control study the intercept is not readily interpretable for epidemiology due to the nature of the sampling of the study Therefore the probability is also not directly interpretable

Thoughtful use of logistic regression In case control studies, it is an excellent choice because relative risk is not directly estimable In cohort or cross-sectional studies remember that: Odds Ratio is used as a surrogate of the relative risk (cohort) or prevalence rate ratio (cross-sectional) When the frequency of the outcome is high (e.g. > 10% or >20%) the odds ratio is biased (usually biased upwards) Consider alternative approaches and/or transformations of the odds ratio estimate

Further readings I Alternatives to logistic regression Zhang J, Yu KF. What's the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes JAMA. 1998 Nov 18;280(19):1690-1. http://jama.ama-assn.org/content/280/19/1690.long Spiegelman D, Hertzmark E. Easy SAS calculations for risk or prevalence ratios and differences. Am J Epidemiol. 2005 Aug 1;162(3):199-200. Epub 2005 Jun 29. http://aje.oxfordjournals.org/content/162/3/199.long McNutt LA, Wu C, Xue X, Hafner JP. Estimating the relative risk in cohort studies and clinical trials of common outcomes.am J Epidemiol. 2003 May 15;157(10):940-3. http://aje.oxfordjournals.org/content/157/10/940.long Zou G. A modified poisson regression approach to prospective studies with binary data. Am J Epidemiol. 2004 Apr 1;159(7):702-6. http://aje.oxfordjournals.org/content/159/7/702.long UCLA Stat Computing > SAS > FAQ > How can I estimate relative risk in SAS using proc genmod for common outcomes in cohort studies? http://www.ats.ucla.edu/stat/sas/faq/relative_risk.htm

About proper use of EF, PAF, etc. Further readings II Northridge ME. Public health methods--attributable risk as a link between causality and public health action. Am J Public Health. 1995 Sep;85(9):1202-4. http://www.ncbi.nlm.nih.gov/pmc/articles/pmc1615585/?tool=pubmed Nice discussion about the interpretation and usefulness for public health Rockhill B, Newman B, Weinberg C. Use and misuse of population attributable fractions. Am J Public Health. 1998 Jan;88(1):15-9. http://www.ncbi.nlm.nih.gov/pmc/articles/pmc1508384/?tool=pubmed Presents appropriate formulae for unadjusted and adjusted RR, and for multicategory exposures

laura.rosella@oahpp.ca 38