Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD

Size: px

Start display at page:

Download "Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD"

Sybil Golden
8 years ago
Views:

1 Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD Optum Labs Cambridge, MA, USA Statistical Methods and Machine Learning ISPOR International Meeting, Montreal, Canada

Crown, PhD Optum Labs Cambridge, MA, USA Statistical

2 Overview Explosion in Data Availability Traditional Methods for Analyzing Observational Data Machine Learning Methods Widely used outside of health care especially in consumer retail Many methods Model development and testing approach Is more data better? Traditional focus on prediction versus estimation of treatment effects How Can We Find the Best of Both? Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 3 The Growing Availability of Data

better? Traditional focus on prediction versus estimation of treatment effects How Can We Find the Best of Both?

3 Market Context Velocity Complexity Tests and Treatments (Medical, Lab, Pharmacy Claims, Standardized Costs) Health Risk Assessments Socioeconomic (Race, Income, Education, Language, ) Vital Signs Medication Orders Admissions, Discharges, Transfers Patient Health Survey (PHQ-9) Health Survey Measurement (SF-12, SF-36) Care Coaching Engagements Evidence Based Medicine (Recommended Care Pathways) Mobile Applications / Social Networking Medical Research Genomic Volume Future Variety Gartner model, adapted Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 5 Examples of Data Partnerships m Multi-Stakeholder Life Sciences/Data and Analytics PCORI CDRN PCORI PCORnet FDA Sentinel m Government Life Sciences/Payer Delivery System/Partners Payer/IT Life Sciences/PBM Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 6

Pathways) Mobile Applications / Social Networking Medical Research Genomic Volume Future Variety Gartner model, adapted Confidential property of Optum.

4 Traditional Health Services Research and Epidemiological Methods Statistical Analysis of Observational Data Good methods for developing well-matched control groups but no magic bullets--e.g., propensity score. These methods control only for observables. Do not control for endogeneity or confounding. Johnson, M., Crown, W., Martin, B., Dormuth, C., Siebert U Good Research Practices for Comparative Effectiveness Research: Analytic Methods to Improve Causal Inference from Nonrandomized Studies of Treatment Effects Using Secondary Data Sources. Report of the ISPOR Retrospective Database Analysis Task Force Part III. Value in Health 12(8): Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 8

Good Research Practices for Comparative Effectiveness Research: Analytic Methods to Improve Causal Inference from Nonrandomized Studies of Treatment Effects Using Secondary Data Sources.

5 Machine Learning Methods Machine Learning Methods Many methods: Classification Trees Neural Networks Random Forests Ridge and Lasso Regression Support Vector Machines And many others Hastie T., Tibshirani R., Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2 nd Edition. New York: Springer. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 10

, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2 nd Edition.

6 Basic Approach Use learning datasets to develop highly accurate classification algorithm. Apply algorithm to another dataset to predict classification. Rules should be as simple as possible while maintaining accuracy. Should be able to classify data without human intervention Should be efficient with very large datasets Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 11 Rob Schapire. Machine Learning Algorithms for Classification.

Should be able to classify data without human intervention Should be efficient with very large datasets Confidential property of Optum.

7 K-Fold Cross-Validation Randomly divide the full dataset into learning/validation datasets Randomly divide the learning/validation data into K equal subsamples (typically 5 or 10) For each subsample K, fit the data using the other K-1 subsamples Estimate the prediction error (e.g., sum of squared errors) for subsample K using the models estimated from the other K-1 subsamples Pick the model specification that generates the lowest average cross validation error Estimate the final model using the full dataset Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 13 Classification and Regression Trees Advantages Easily handle huge datasets Can include both qualitative and quantitative predictor variables Very good for missing or sparse data Small trees are easy to interpret Disadvantages Large trees are difficult to interpret Overall prediction performance tends to be poor Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 14

the other K-1 subsamples Estimate the prediction error (e.g.

8 Rob Schapire. Machine Learning Algorithms for Classification. Rob Schapire. Machine Learning Algorithms for Classification.

9 Approach Pick a rule to subset data Using the rule, divide data into subsets Keep repeating until remaining subsets are almost pure (e.g, measured by entropy or gini index) Usual approach is to build a very large tree and then prune it back Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 17 Neural Networks Y 1 Y 2 Outcome Layer Y i =e z /(1+e z ) Z 1 Z 2 Z 3 Z 4 Hidden Layer Z i = f(b k X k ) X 1 X 2 X 3 Input Layer Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 18

Do not distribute or reproduce without express permission from Optum.

10 Prediction Is Not the Same as Estimating Treatment Effects Some machine learning methods (e.g., Ridge and Lasso regression) use regression methods with a penalty term to adjust for the danger of overfitting. Enables application of the machine learning approach to the estimation of treatment effects. But we know that results from observational studies can be sensitive to spurious correlations and methodological approach Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 19 Things with strong trends will tend to be highly correlated (1) Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 20

But we know that results from observational studies can be sensitive to spurious correlations and methodological approach Confidential property of Optum.

11 Things With Strong Trends Will Tend to Be Highly Correlated (2) Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 21 Small Sample Sizes Can Generate Some Really Weird Findings Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 22

12 But Even With Big Data You Have To Be Careful! MI Outcome (Unmatched) MI Outcome (After Matching) HR=2.11 ( ) 111% (46%-204%) Risk Increase HR=0.69 ( ) 31% (7%-48%) Risk Reduction Cumulative Incidence Statin Initiators Statin Non-Initiators Cumulative Incidence Statin Non-Initiators Statin Initiators Months of Follow-Up Months of Follow-Up Seeger, John, Alexander Walker, Paige Williams, Gordon Saperia, Frank Sacks (2003) A Propensity Score-Matched Cohort Study of the Effect of Statins, Mainly Fluvastatin, on the Occurrence of Acute Myocardial Infarction. Am J. Cardiol 92: Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 23 Bigger Samples Don t Reduce Bias N=200 N=10,000 (,z)=0 (z,e)= Estimation Error iv ols iv ols Crown, W., Henk, H., VanNess D. Some Cautions on the Use of Instrumental Variables (IV) Estimators in Outcomes Research: How Bias in IV Estimators is Affected by Instrument Strength, Instrument Contamination, and Sample Size. Value in Health 14: , Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 24

Seeger, John, Alexander Walker, Paige Williams, Gordon Saperia, Frank Sacks (2003) A Propensity Score-Matched Cohort Study of the Effect of Statins, Mainly Fluvastatin, on the Occurrence of Acute

13 EHR/Claims Linkages Can Help Reduce Missing Variable Bias Relevant information Claims alone EHR alone Linked data Clinical data and severity measures + + Retail/specialty drugs across treatment settings + + Leakage + + Patient-reported outcomes Selection biases due to payer type + + Longitudinality of patient follow-up + ++ Self-pay data + + Coding biases + + Unstructured data + + Timing of events + + Continuous coverage + ++ Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 25 Data from disparate sources can be linked and de-identified Name Address Birthdate SSN Phone, etc. Direct identifiers (EMR / Clinical) Primary hash Shared Salt Code (same for all contributors) Data is then hashed by contributors at their site. Name Address Birthdate Member ID Phone, etc. Direct identifiers (Insurer example) Primary hash Secondary hash Uses Confidential Salt De-identification Statistically de-identified views Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 26

events + + Continuous coverage + ++ Confidential property of Optum. Do not distribute or reproduce without express permission from Optum.

14 Summary Rapid expansion in data (volume, velocity, and variety) Machine learning approaches focus on prediction but some can also be used to estimate treatment effects Machine learning methods offer opportunities for speed to answer but traditional challenges with observational data do not go away More data doesn t help with bias problems unless it helps with control variables through data linkage For treatment effect estimation still need to think about possible sources of bias and their implications for methodology and data used for model building Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 27 Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD Optum Labs Cambridge, MA, USA

linkage For treatment effect estimation still need to think about possible sources of bias and their implications for methodology and data used for model building Confidential property of Optum.

HIPAA and Big Data Twenty Third National HIPAA Summit. March 17, 2015 Mitchell W. Granberg, Optum Chief Privacy Officer

HIPAA and Big Data Twenty Third National HIPAA Summit March 17, 2015 Mitchell W. Granberg, Optum Chief Privacy Officer Overview HIPAA and Big Data Big Data Definitions Big Data and Health Care Benefits