Propensity Scores: What Do They Do, How Should I Use Them, and Why Should I Care?

Similar documents

Randomized trials versus observational studies

Hormones and cardiovascular disease, what the Danish Nurse Cohort learned us

Critical Appraisal of Article on Therapy

New Cholesterol Guidelines: Carte Blanche for Statin Overuse Rita F. Redberg, MD, MSc Professor of Medicine

Apixaban Plus Mono vs. Dual Antiplatelet Therapy in Acute Coronary Syndromes: Insights from the APPRAISE-2 Trial

AVOIDING BIAS AND RANDOM ERROR IN DATA ANALYSIS

With Big Data Comes Big Responsibility

Big data size isn t enough! Irene Petersen, PhD Primary Care & Population Health

Multiple Imputation for Missing Data: A Cautionary Tale

Cohort Studies. Sukon Kanchanaraksa, PhD Johns Hopkins University

Imputation and Analysis. Peter Fayers

Main Effect of Screening for Coronary Artery Disease Using CT

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out

Problem of Missing Data

Prognostic impact of uric acid in patients with stable coronary artery disease

2013 ACO Quality Measures

6/5/2014. Objectives. Acute Coronary Syndromes. Epidemiology. Epidemiology. Epidemiology and Health Care Impact Pathophysiology

Systolic Blood Pressure Intervention Trial (SPRINT) Principal Results

Confounding in health research

Introduction to Fixed Effects Methods

The Cross-Sectional Study:

Multiple Regression: What Is It?

U.S. Food and Drug Administration

Guide to Biostatistics

A Basic Introduction to Missing Data

Sample Size Planning, Calculation, and Justification

Missing data and net survival analysis Bernard Rachet

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

FULL COVERAGE FOR PREVENTIVE MEDICATIONS AFTER MYOCARDIAL INFARCTION IMPACT ON RACIAL AND ETHNIC DISPARITIES

The Swedish approach: Quality Assurance with Clinical Quality Registries the RIKS-HIA example

PRACTICE PROBLEMS FOR BIOSTATISTICS

ESSENTIA HEALTH AS AN ACO (ACCOUNTABLE CARE ORGANIZATION)

EXPANDING THE EVIDENCE BASE IN OUTCOMES RESEARCH: USING LINKED ELECTRONIC MEDICAL RECORDS (EMR) AND CLAIMS DATA

Long term anticoagulant therapy in patients with atrial fibrillation at high risk of stroke: a new scenario after RE-LY trial

C. The null hypothesis is not rejected when the alternative hypothesis is true. A. population parameters.

Health risks and benefits from calcium and vitamin D supplementation: Women's Health Initiative clinical trial and cohort study

The American Cancer Society Cancer Prevention Study I: 12-Year Followup

PS 271B: Quantitative Methods II. Lecture Notes

Data Analysis, Research Study Design and the IRB

Missing Data Sensitivity Analysis of a Continuous Endpoint An Example from a Recent Submission

Getting the most from blood pressure medicines

Statistics 2014 Scoring Guidelines

Review Jeopardy. Blue vs. Orange. Review Jeopardy

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Dealing with Missing Data

ADVANCE: a factorial randomised trial of blood pressure lowering and intensive glucose control in 11,140 patients with type 2 diabetes

Mortality Assessment Technology: A New Tool for Life Insurance Underwriting

Objectives. Preoperative Cardiac Risk Stratification for Noncardiac Surgery. History

MANAGEMENT OF LIPID DISORDERS: IMPLICATIONS OF THE NEW GUIDELINES

Implementing Propensity Score Matching Estimators with STATA

Regression Modeling Strategies

Multivariate Logistic Regression

Organizing Your Approach to a Data Analysis

THE INTERNET STROKE CENTER PRESENTATIONS AND DISCUSSIONS ON STROKE MANAGEMENT

STROKE PREVENTION IN ATRIAL FIBRILLATION. TARGET AUDIENCE: All Canadian health care professionals. OBJECTIVE: ABBREVIATIONS: BACKGROUND:

06 Validation of risk prediction model

Simple example of collinearity in logistic regression

Propensity scores for the estimation of average treatment effects in observational studies

Descriptive Inferential. The First Measured Century. Statistics. Statistics. We will focus on two types of statistical applications

Social inequalities in all cause and cause specific mortality in a country of the African region

Improving Quality of Care for Medicare Patients: Accountable Care Organizations

An Overview and Guide to Healthy Living with Type 2 Diabetes

DISCLOSURES RISK ASSESSMENT. Stroke and Heart Disease -Is there a Link Beyond Risk Factors? Daniel Lackland, MD

Self-Check and Review Chapter 1 Sections

Appendix: Description of the DIETRON model

A Few Basics of Probability

Metabolic Syndrome Overview: Easy Living, Bitter Harvest. Sabrina Gill MD MPH FRCPC Caroline Stigant MD FRCPC BC Nephrology Days, October 2007

Introduction to mixed model and missing data issues in longitudinal studies

Why Sample? Why not study everyone? Debate about Census vs. sampling

Overcoming Resistance to Stopping Tobacco Use: A Motivational Approach

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Karen B. Hirschman, PhD MSW Research Assistant Professor School of Nursing. Geriatric Grand Rounds Friday, December 9, 2011 TRANSITIONS

THE TOP TEN CAUSES OF DEATH

Dual Antiplatelet Therapy. Stephen Monroe, MD FACC Chattanooga Heart Institute

13. Poisson Regression Analysis

ESCMID Online Lecture Library. by author

Journal Club: Niacin in Patients with Low HDL Cholesterol Levels Receiving Intensive Statin Therapy by the AIM-HIGH Investigators

Chapter 3. Sampling. Sampling Methods

A Bayesian hierarchical surrogate outcome model for multiple sclerosis

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Malmö Preventive Project. Cardiovascular Endpoints

See page 331 of HEDIS 2013 Tech Specs Vol 2. HEDIS specs apply to plans. RARE applies to hospitals. Plan All-Cause Readmissions (PCR) *++

SULFONYLUREA USE AND RISK OF HIP FRACTURES AMONG ELDERLY MEN AND WOMEN WITH TYPE 2 DIABETES

Additional sources Compilation of sources:

Medicare Shared Savings Program Quality Measure Benchmarks for the 2015 Reporting Year

Handling missing data in Stata a whirlwind tour

Introduction to study design

Transcription:

Propensity Scores: What Do They Do, How Should I Use Them, and Why Should I Care? Thomas E. Love, PhD Center for Health Care Research & Policy Case Western Reserve University thomaslove@case.edu ASA Cleveland Chapter, December 2003

The Plan for This Evening Why should I care about propensity scores? They provide a means for adjusting for selection bias in observational studies of causal effects. What do propensity scores do? They summarize all of the background (covariate) information about treatment selection into a scalar. How should I use propensity scores? Whenever you want to make a causal inference in comparing exposures, especially when you have a lot of information on how the exposures were selected.

Looking for Causal Treatment Effects Surgery-Drug Treatment Effect Effect = = -

Ceteris Paribus Other Things Being Equal Suppose we are comparing a treated group to a control group, and we want to know if the treatment has a causal effect on the outcome. To be fair, we must compare treated and controls who are similar in terms of everything that affects the outcome, except for the receipt of treatment. How do we, as statisticians, typically recommend that people do this?

Importance of Randomization 1 1 2 1 2 1 2 1 2 1 1 2 2 2 2 1 2 1 2 1 Randomization ensures that subjects receiving different treatments are comparable.

RCT Subject Selection MRFIT Excluded 96.4% of potential eligibles 361,662 men Age: 35-57 EXCLUSIONS 25,545 12,866 6428 RANDOMIZATION 6438 Low Risk of CHD Hx of MI Diabetes Geographic Mobility Cholesterol > 350 DBP > 115 336,117 EXCLUSIONS Body Wgt > 150% Angina Evidence of MI Strange Diet 12,679

How Are Experiments Designed? Most important feature of experiments: We must decide on the way data will be collected before observing the outcome. Experiments are honest Want to use statistics and data to illuminate our understanding, not just to support our priors. If we could try out lots of designs and look at what that does for the answer, we would, and we d then always confirm our prior beliefs.

We Usually Assign Treatments At Random Except sometimes, we can t. Or we just don t. And sometimes, these are the situations we care most about. So how can we assess causal effects in observational studies?

Observational Studies 1 2 1 1 2 2 2 2 1 2 1 1 1 2 2 2 1 2 2 1 1 1 1 1 In an observational study, the researcher does not randomly allocate the treatments.

On Designing Observational Studies Exert as much experimental control as possible Carefully consider the selection process Actively collect data to reveal potential biases Care in design and implementation will be rewarded with useful and clear study conclusions Elaborate analytical methods will not salvage poor design or implementation of a study. -- NAS report (Rosenbaum (2002) p. 368)

Hormone Replacement Therapy Should Be Recommended for Nearly All Women Col NF, Eckman MH, et al. (1997) JAMA, 277: 1140-47. CHD risk for patients using HRT was 0.60 times as high as the risk for patients not using HRT. Do not use estrogen/progestin to prevent chronic disease Fletcher SW, Colditz G. (2002) JAMA, 288: 366-67. Manson JE et al. for WHI Investigators (2003) NEJM, 349: 523-534. CHD risk for patients using HRT was 1.29 times as high as the risk for patients not using HRT.

What Propensity Scores Are, and Why They Are Important We can gather up all the background info that we have on our subjects before treatments are assigned, that might plausibly affect which treatment they get We build a model to predict the probability that they will receive the treatment instead of the control. Propensity Score = Pr (Treated background info) Groups of subjects with similar propensity scores can then be expected to have similar values of all of the background information, in the aggregate.

The Propensity Score Pr(treatment given covariates) Definition: The conditional probability of receiving a given exposure (treatment) given a vector of measured covariates. Usually estimated using logistic regression: PS ln = β + β PS X + β X +... + 0 1 1 2 2 1 PS = exp 1+ exp β ( ) β + β X +... + β X 0 1 1 p p ( ) β + β X +... + β X 0 1 1 p p p X p

The Ten Commandments of Propensity Model Development

Thou Shalt Value Parsimony Thou Shalt Examine Thy Predictors For Collinearity Thou Shalt Test All Thy Predictors For Statistical Significance Thou Shalt Have Ten Times As Many Subjects As Predictors Thou Shalt Carefully Examine Thy Regression Coefficients (Beta Weights)

Thou Shalt Perform Bootstrap Analyses To Assess Shrinkage Thou Shalt Perform Regression Diagnostics and Examine Residuals With Care Thou Shalt Hold Out A Sample of Thy Data for Cross-Validation Thou Shalt Perform External Validation on a New Sample of Data

Thou Shalt Ignore Commandments 1 through 9 And Instead Simply Ensure That The Model Adequately Balances The Covariates Apologies to Joe Schafer

Should Adjustments Be Made for All Observed Covariates? If not, how should covariates be selected? No real reason to avoid adjustment for a true covariate a variable describing subjects before treatment. In practice, though, this can increase both cost and complexity unnecessarily. There are issues of data quality and completeness to consider.

Screening for Covariates in the Propensity Score Model Most common method: TERRIBLE IDEA! Compare treated to non-treated on many covariates. Adjust only for significant differences. No reason that absence of significance implies imbalance is small enough to be ignored. Doesn t consider covariate-to-outcome relationship. This process considers covariates one at a time, while the PS adjustments will control the covariates simultaneously.

OK, What Should We Do? Give the data multiple opportunities to call attention to potential problems. Select a tentative list of covariates for adjustments using problem knowledge and exploratory comparisons of treatment groups. Select tentative adjustment method and apply it to the covariates excluded from the list, identifying large imbalances after adjustment. Reconsider the tentative list in light of this.

What Propensity Scores Can and Cannot Do, in a Slide If we match treated subjects to controls with similar propensity scores, we can behave as if they had been randomly assigned to treatments. Or, if we use regression to adjust for propensity to get treatment, we can compare treated to controls without worrying about the impact of any baseline differences on selection to treatment or control. But if our propensity model misses an important reason why subjects are selected to treatment or control, we ll be in trouble.

Rich and Poor Covariate Sets With a rich set of covariates, adjustments for hidden covariates may be less critical. With less rich covariate sets, we may need to do more say, try to find an instrument. Conclusion after the initial design stage may be that the treatment and control groups are too far apart to produce reliable effect estimates without heroic modeling assumptions.

Why Work This Hard in the Initial Stage of Design? No harm, no foul. Since no outcome data are available to the PS, nothing based on the PS here biases estimation of treatment effects. Balancing covariates / PS makes subsequent model-based adjustments more reliable. Model adjustments can be extremely unreliable when the treatment groups are far apart on covariates. Helps covariance, relative risk, IV adjustments, etc.

Covariate of Interest 1 0 Not Treated Treated How Much Overlap In The Covariates Do We Want? If those who receive treatment don t overlap (in terms of covariates) with those who receive the control, we ve got nothing to compare. Modeling, no matter how sophisticated, can t help us to develop information out of thin air.

What if Treated and Untreated Groups Overlap, but minimally? Covariate of Interest 1 Not much help. The information available to infer treatment effect will reside almost entirely in the few patients who overlap. Need to think hard about whether useful inferences will be possible. 0 Not Treated Treated

How Much Overlap In The Propensity Scores Do We Want? Propensity to receive treatment 1 Propensity to receive treatment 1 Propensity to receive treatment 1 0 Not Treated Treated 0 Not Treated Treated 0 Not Treated Treated

Aspirin Use and Mortality 6174 consecutive adults undergoing stress echocardiography for evaluation of known or suspected coronary disease. 2310 (37%) were taking aspirin (treatment). Main Outcome: all-cause mortality Median follow-up: 3.1 years Univariate Analysis: 4.5% of aspirin patients died, and 4.5% of non-aspirin patients died Unadjusted Hazard Ratio: 1.08 (0.85, 1.39) Gum PA et al. (2001) JAMA, 1187-1194.

Variable Men Baseline Characteristics By Aspirin Clinical history: diabetes hypertension Use (in %) (before matching) prior coronary artery disease congestive heart failure Medication use: Beta-blocker ACE inhibitor Aspirin (n = 2310) 77.0 16.8 53.0 69.7 5.5 35.1 13.0 No Aspirin (n = 3864) 56.1 11.2 40.6 20.1 4.6 14.2 11.4 P value <.001 <.001 <.001 <.001.12 <.001 <.001 Baseline characteristics appear very dissimilar: 25 of 31 covariates have p <.001, 28 of 31 have p <.05. Aspirin user covariates indicate higher mortality risk.

Baseline Characteristics By Aspirin Use [%] (after matching) Variable Aspirin (n = 1351) No Aspirin (n = 1351) P value Men 70.4 72.1.33 Clinical history: diabetes 15.0 15.3.83 hypertension 50.3 51.7.46 prior coronary artery disease 48.3 48.8.79 congestive heart failure 5.8 6.6.43 Medication use: Beta-blocker 26.1 26.5.79 ACE inhibitor 15.5 15.8.79 Baseline characteristics similar in matched users and non-users. 30 of 31 covariates show NS difference between matched users and non-users. [Peak exercise capacity for men is p =.01]

Using Standardized Differences to Measure Covariate Balance Standardized Differences greater than 10% in absolute value indicate serious imbalance 100( x ) Treatment xcontrol d 2 2 streatment + scontrol 2 = for continuous variables d 100 p T ( p ) Treatment pcontrol ( 1 p ) + p ( 1 p ) = for binary variables T 2 C C

Standardized Differences > 10% Before Match: Indicate Serious Imbalance 811/2310 (35.1%) Aspirin users used β-blockers 550/3864 (14.2%) non-aspirin users used β-blockers Standardized Difference is 49.9% P value for difference is <.001 After Match: 352/1351 (26.1%) Aspirin users used β-blockers 358/1351 (26.5%) non-aspirin users used β-blockers Standardized Difference is 1.0% P value for difference is.79

Covariate Balance for Aspirin Study PriorCAD PriorPCInterv PriorCABG LipidLow MayoRisk Age BetaBl %Men StressIsche EchoLV PriorQMI Diltiazem Hypertension Ischemic Diabetes RSysBP Nifedipine ACEinh Digoxin CongestHF ChestPain AtrialFib Fitness Tobacco RDiasBP BMI HRRecov PeakExMen PeakExWom RestHrtRate EjectionFrac Before Match -50 0 50 100 Asp - No Asp Standardized Difference (%)

Covariate Balance for Aspirin Study PriorCAD PriorPCInterv PriorCABG LipidLow MayoRisk Age BetaBl %Men StressIsche EchoLV PriorQMI Diltiazem Hypertension Ischemic Diabetes RSysBP Nifedipine ACEinh Digoxin CongestHF ChestPain AtrialFib Fitness Tobacco RDiasBP BMI HRRecov PeakExMen PeakExWom RestHrtRate EjectionFrac Before Match After Match Stdzd Diff = 16% -50 0 50 100 Asp - No Asp Standardized Difference (%)

Absolute Standardized Differences PriorCAD PriorPCInterv PriorCABG LipidLow MayoRisk Age BetaBl %Men EjectionFrac RestHrtRate PeakExWom StressIsche EchoLV PriorQMI Diltiazem Hypertension PeakExMen Ischemic HRRecov BMI Diabetes RSysBP Nifedipine ACEinh RDiasBP Tobacco Digoxin CongestHF Fitness ChestPain AtrialFib Before Match After Match 0 20 40 60 80 100 120 Asp - NoAsp Absolute Standardized Difference (%)

Incomplete vs. Inexact Matching Trade-off between Failing to match all treated subjects (incomplete) Matching dissimilar subjects (inexact matching) There can be a severe bias due to incomplete matching it s often better to match all treated subjects, then follow with analytical adjustments for residual imbalances in the covariates. In practice, concern has been inexactness. Certainly worthwhile to define the comparison group and carefully explore why subjects match.

What if Treated and Untreated Groups Don t Overlap Completely? Propensity Score 1 Inferences for the causal effects of treatment on the subjects with no overlap cannot be drawn without heroic modeling assumptions. Usually, we d exclude these treated subjects, and explain separately. 0 Not Treated Treated

Which Aspirin Users Get Matched? Generally, characteristics of unmatched aspirin users tend to indicate high propensity scores. Overall, 37% of patients were taking aspirin. The rate was much higher in some populations 67% of Prior CAD patients were taking aspirin. So prior CAD patients had higher propensity scores for aspirin use. Of the unmatched aspirin users, 99.8% (957/959) had prior coronary artery disease. So it s likely that the unmatched users tended towards larger propensity scores than the matched users.

Who s Getting Matched Here? Where Do The Propensity Scores Overlap? Propensity to Use Aspirin 1 Caveat: This simulation depicts what often happens. Aspirin users we won t be able to match effectively 0 Non-users Users Aspirin users for whom we might realistically find a good match in our pool of non-users

Matching with Propensity Scores 1351 aspirin subjects (58%) matched to nonaspirin subjects big improvement in covariate balance. Matched group looks like an RCT Matching still incomplete, but results on PS matched group mirrored the results for the covariate-adjusted group as a whole Resulting matched pairs analyzed using standard statistical methods, e.g. Kaplan-Meier, Cox proportional hazards models.

Estimating The Hazard Ratios Approach n Hazard Ratio 95% CI Full sample, no adjustment 6174 1.08 (.85, 1.39) Full sample with no PS, adjusted for all covariates 6174 0.67 (.51,.87) PS-Matched sample 2702 0.53 (.38,.74) PS-Matched, adjusted for PS and all covariates 2702 0.56 (.40,.78) During follow-up 153 (6%) of the 2702 propensity scorematched patients died. Aspirin use was associated with a lower risk of death in matched group (4% vs. 8%, p =.002).

Aspirin Conclusions / Caveats Patients included in this study may be a more representative sample of real world patients than an RCT would provide. PS matching is still not randomization: can only account for the factors measured, and only as well as the instruments can measure them. No information on aspirin dose, aspirin allergy, or duration of treatment, or on medication adjustments.

Propensity Score Matching in the Design of an Observational Study Pair up treated and control subjects with similar values of the propensity score, discarding all unmatched units. Not limited to 1-1 matches can do 1-many. Can find an optimal match, without discarding any units, then follow with adjustments. Technically more valid, but difficult sell in practice. Common: One-one Mahalanobis matching within calipers defined by logit(propensity).

Adjusting for Severity vs. Adjusting for Selection vs. Both Relative Risk 7 6 5 4 3 2 Crude Severity Selection Both Effect of pneumonia as a complication of stroke (on 30-day mortality) Katzan, Cebul et al. (2003) Neurology. Relative risk of death by 30 days for patients with pneumonia vs. patients without. Severity risk-adjusted Selection PS-adjusted

Adjusting for Severity vs. Adjusting for Selection vs. Both Relative Risk 2.0 1.8 1.6 1.4 Effect of rehabilitation on discharge to home from nursing home Murray et al. (2003) Arch Phys Med Rehab. Severity risk-adjusted Selection PS-adjusted Selection bias blunts the rehab effect size. 1.2 Crude Severity Selection Both

On Planning an Observational Study (Rosenbaum, 2002) A convincing OS is the result of active observation, a search for those rare circumstances in which tangible evidence may be obtained to distinguish treatment effects from the most plausible biases. Experimental control is replaced in a good OS by careful choice of environment. Design is crucial! Options narrow as an investigation proceeds. These are reasonable methods with large samples, especially if we have a good selection model using multiple covariates.

What should always be done in an OS and often isn t? Collect data so as to be able to model selection Demonstrate selection bias need for PS Ensure covariate overlap for comparability Evaluate covariate balance after PS application Specify relevant post-adjustment population carefully Model or estimate treatment effect in light of PS adjustment / matching / stratification Estimate sensitivity of results to potential hidden bias

What I Discussed This Evening According to the Abstract 1. Investigating the issue of selection bias well, and how selection bias can affect results, 2. Assessing the degree of overlap with regard to background characteristics in the two exposure groups, to determine if an effect can be sensibly estimated using the available data, 3. Estimating treatment effects adjusting for observed differences in background characteristics using propensity scores, 4. Assessing whether the propensity score worked well, in the sense of producing fair comparison groups.

Where You Can Go for More Information http://www.chrp.org/propensity Free issue of Health Services and Outcomes Research Methodology (December 2001 v2 issues 3-4) go to http://www.kluweronline.com/issn/1387-3741 I especially recommend the article by DB Rubin (using PS in designing a complex observational study) and the article by Landrum and Ayanian (PS vs. Instrumental Variables) Rosenbaum PR (2002) Observational Studies, Springer. Email me at thomaslove@case.edu