Targeted Learning with Big Data



Similar documents
Many research questions in epidemiology are concerned. Estimation of Direct Causal Effects ORIGINAL ARTICLE

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

Personalized Predictive Medicine and Genomic Clinical Trials

Logistic Regression (1/24/13)

Statistics Graduate Courses

Statistics in Retail Finance. Chapter 6: Behavioural models

Bootstrapping Big Data

BIG DATA: CONVENTIONAL METHODS MEET UNCONVENTIONAL DATA

Average Redistributional Effects. IFAI/IZA Conference on Labor Market Policy Evaluation

Missing data and net survival analysis Bernard Rachet

Machine Learning Methods for Causal Effects. Susan Athey, Stanford University Guido Imbens, Stanford University

Introduction to General and Generalized Linear Models

False Discovery Rates

Organizing Your Approach to a Data Analysis

Adaptive Online Gradient Descent

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Statistical issues in the analysis of microarray data

Analysis of Longitudinal Data for Inference and Prediction. Patrick J. Heagerty PhD Department of Biostatistics University of Washington

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

STATISTICS COURSES UNDERGRADUATE CERTIFICATE FACULTY. Explanation of Course Numbers. Bachelor's program. Master's programs.

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Multivariate Normal Distribution

Regression Modeling Strategies

STATISTICA Formula Guide: Logistic Regression. Table of Contents

MATHEMATICAL METHODS OF STATISTICS

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Towards running complex models on big data

Statistical Machine Learning

Data Mining - Evaluation of Classifiers

Linda Staub & Alexandros Gekenidis

Nominal and ordinal logistic regression

Handling missing data in Stata a whirlwind tour

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD

Basics of Statistical Machine Learning

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

CAUSALLY MOTIVATED ATTRIBUTION FOR ONLINE ADVERTISING

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Detection of changes in variance using binary segmentation and optimal partitioning

Least Squares Estimation

Exact Nonparametric Tests for Comparing Means - A Personal Summary

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

TOWARD BIG DATA ANALYSIS WORKSHOP

The Data Mining Process

Segmentation models and applications with R

STA 4273H: Statistical Machine Learning

SAS Certificate Applied Statistics and SAS Programming

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

A Basic Introduction to Missing Data

GI01/M055 Supervised Learning Proximal Methods

Elements of statistics (MATH0487-1)

A LONGITUDINAL AND SURVIVAL MODEL WITH HEALTH CARE USAGE FOR INSURED ELDERLY. Workshop

With Big Data Comes Big Responsibility

Predicting Customer Default Times using Survival Analysis Methods in SAS

R 2 -type Curves for Dynamic Predictions from Joint Longitudinal-Survival Models

School of Public Health and Health Services Department of Epidemiology and Biostatistics

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

A simple to implement algorithm for natural direct and indirect effects in survival studies with a repeatedly measured mediator

Introduction to Online Learning Theory

Jordan Chamberlain Brooks

Simple and efficient online algorithms for real world applications

Data mining on the EPIRARE survey data

Polynomial Neural Network Discovery Client User Guide

PS 271B: Quantitative Methods II. Lecture Notes

CS229 Project Report Automated Stock Trading Using Machine Learning Algorithms

Study Design and Statistical Analysis

Lecture 15 Introduction to Survival Analysis

Classification and Regression by randomforest

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

ALLERGY/MOLD Gene & Variation rsid # Risk Allele Your Alleles and Results HLA rs C TT -/- HLA rs T GG -/-

Machine Learning Logistic Regression

Bayesian Penalized Methods for High Dimensional Data

Differential privacy in health care analytics and medical research An interactive tutorial

Data, Measurements, Features

Statistics for BIG data

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Data Mining. Nonlinear Classification

An Application of the G-formula to Asbestos and Lung Cancer. Stephen R. Cole. Epidemiology, UNC Chapel Hill. Slides:

Statistical Models in R

Data Mining: An Overview. David Madigan

Tree based ensemble models regularization by convex optimization

An extension of the factoring likelihood approach for non-monotone missing data

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

Problem of Missing Data

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS

Some Essential Statistics The Lure of Statistics

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

How to assess the risk of a large portfolio? How to estimate a large covariance matrix?

Transcription:

Targeted Learning with Big Data Mark van der Laan UC Berkeley Center for Philosophy and History of Science Revisiting the Foundations of Statistics in the Era of Big Data: Scaling Up to Meet the Challenge February 20, 2014

Outline 1 Targeted Learning 2 Two stage methodology: Super Learning+ TMLE 3 Definition of Estimation Problem for Causal Effects of Multiple Time Point Interventions 4 Variable importance analysis examples of Targeted Learning 5 Scaling up Targeted Learning to handle Big Data 6 Concluding remarks

Outline 1 Targeted Learning 2 Two stage methodology: Super Learning+ TMLE 3 Definition of Estimation Problem for Causal Effects of Multiple Time Point Interventions 4 Variable importance analysis examples of Targeted Learning 5 Scaling up Targeted Learning to handle Big Data 6 Concluding remarks

Foundations of the statistical estimation problem Observed data: Realizations of random variables with a probability distribution. Statistical model: Set of possible distributions for the data-generating distribution, defined by actual knowledge about the data. e.g. in an RCT, we know the probability of each subject receiving treatment. Statistical target parameter: Function of the data-generating distribution that we wish to learn from the data. Estimator: An a priori-specified algorithm that takes the observed data and returns an estimate of the target parameter. Benchmarked by a dissimilarity-measure (e.g., MSE) w.r.t target parameter. Inference: Establish limit distribution and corresponding statistical inference.

Causal inference Non-testable assumptions in addition to the assumptions defining the statistical model. (e.g. the no unmeasured confounders assumption). Defines causal quantity and establishes identifiability under these assumptions. This process generates interesting statistical target parameters. Allows for causal interpretation of statistical parameter/estimand. Even if we don t believe the non-testable causal assumptions, the statistical estimation problem is still the same, and estimands still have valid statistical interpretations.

Targeted learning Define valid (and thus LARGE) statistical semi parametric models and interesting target parameters. Exactly deals with statistical challenges of high dimensional and large data sets (Big Data). Avoid reliance on human art and nonrealistic (e.g., parametric) models Plug-in estimator based on targeted fit of the (relevant part of) data-generating distribution to the parameter of interest Semiparametric efficient and robust Statistical inference Has been applied to: static or dynamic treatments, direct and indirect effects, parameters of MSMs, variable importance analysis in genomics, longitudinal/repeated measures data with time-dependent confounding, censoring/missingness, case-control studies, RCTs, networks.

Targeted Learning Book Springer Series in Statistics van der laan & Rose targetedlearningbook.com

First Chapter by R.J.C.M. Starmans Models, Inference, and Truth provides historical philosophical perspective on Targeted Learning. Discusses the erosion of the notion of model and truth throughout history and the resulting lack of unified approach in statistics. It stresses the importance of a reconciliation between machine learning and statistical inference, as provided by Targeted Learning.

Outline 1 Targeted Learning 2 Two stage methodology: Super Learning+ TMLE 3 Definition of Estimation Problem for Causal Effects of Multiple Time Point Interventions 4 Variable importance analysis examples of Targeted Learning 5 Scaling up Targeted Learning to handle Big Data 6 Concluding remarks

Two stage methodology Super learning (SL) van der Laan et al. (2007),Polley et al. (2012),Polley and van der Laan (2012) Uses a library of candidate estimators (e.g. multiple parametric models, machine learning algorithms like neural networks, RandomForest, etc.) Builds data-adaptive weighted combination of estimators using cross validation Targeted maximum likelihood estimation (TMLE) van der Laan and Rubin (2006) Updates initial estimate, often a Super Learner, to remove bias for the parameter of interest Calculates final parameter from updated fit of the data-generating distribution

Super learning No need to chose a priori a particular parametric model or machine learning algorithm for a particular problem Allows one to combine many data-adaptive estimators into one improved estimator. Grounded by oracle results for loss-function based cross-validation (Van Der Laan and Dudoit (2003),van der Vaart et al. (2006)). Loss function needs to be bounded. Performs asymptotically as well as best (oracle) weighted combination, or achieves parametric rate of convergence.

Super learning Figure: Relative Cross-Validated Mean Squared Error (compared to main terms least squares regression)

Super learning

TMLE algorithm

TMLE algorithm: Formal Template

Outline 1 Targeted Learning 2 Two stage methodology: Super Learning+ TMLE 3 Definition of Estimation Problem for Causal Effects of Multiple Time Point Interventions 4 Variable importance analysis examples of Targeted Learning 5 Scaling up Targeted Learning to handle Big Data 6 Concluding remarks

General Longitudinal Data Structure We observe n i.i.d. copies of a longitudinal data structure O = (L(0), A(0),..., L(K), A(K), Y = L(K + 1)), where A(t) denotes a discrete valued intervention node, L(t) is an intermediate covariate realized after A(t 1) and before A(t), t = 0,..., K, and Y is a final outcome of interest. For example, A(t) = (A 1 (t), A 2 (t)) could be a vector of two binary indicators of censoring and treatment, respectively.

Likelihood and Statistical Model The probability distribution P 0 of O can be factorized according to the time-ordering as K+1 K P 0 (O) = P 0 (L(t) Pa(L(t))) P 0 (A(t) Pa(A(t))) t=0 K+1 t=0 Q 0 g 0, Q 0,L(t) (O) t=0 K g 0,A(t) (O) t=0 where Pa(L(t)) ( L(t 1), Ā(t 1)) and Pa(A(t)) ( L(t), Ā(t 1)) denote the parents of L(t) and A(t) in the time-ordered sequence, respectively. The g 0 -factor represents the intervention mechanism: e..g, treatment and right-censoring mechanism. Statistical Model: We make no assumptions on Q 0, but could

Statistical Target Parameter: G-computation Formula for Post-Intervention Distribution Let P d (l) = K+1 t=0 Q d L(t) ( l(t)), (1) where Q d L(t) ( l(t)) = Q L(t) (l(t) l(t 1), Ā(t 1) = d(t 1)). Let L d = (L(0), L d (1),..., Y d = L d (K + 1)) denote the random variable with probability distribution P d. This is the so called G-computation formula for the post-intervention distribution corresponding with the dynamic intervention d.

Example: When to switch a failing drug regimen in HIV-infected patients Observed data on unit O = (L(0), A(0), L(1), A(1),..., L(K), A(K), A 2 (K)Y ), where L(0) is baseline history, A(t) = (A 1 (t), A 2 (t)), A 1 (t) is indicator of switching drug regimen, A 2 (t) is indicator of being right-censored, t = 0,..., K, and Y is indicator of observing death by time K + 1. Define interventions nodes A(0),..., A(K) and interventions dynamic rules d θ that switch when CD4-count drops below θ, and enforces no-censoring. Our target parameter is defined as projection of (E(Y d θ(t)) : t, d) onto a working model m β (θ) with parameter β.

A Sequential Regression G-computation Formula (Bang, Robins, 2005) By the iterative conditional expectation rule (tower rule), we have E P d Y d = E... E(E(Y d L d (K)) L d (K 1))... L(0)). In addition, the conditional expectation, given L d (K) is equivalent with conditioning on L(K), Ā(K 1) = d(k 1).

In this manner, one can represent E P d Y d as an iterative conditional expectation, first take conditional expectation, given L d (K) (equivalent with L(K), Ā(K 1)), then take the conditional expectation, given L d (K 1) (equivalent with L(K 1), Ā(K 2)), and so on, until the conditional expectation given L(0), and finally take the mean over L(0). We developed a targeted plug-in estimator/tmle of general summary measures of dose-response curves (EY d : d D) (Petersen et al., 2013, van der Laan, Gruber 2012).

Outline 1 Targeted Learning 2 Two stage methodology: Super Learning+ TMLE 3 Definition of Estimation Problem for Causal Effects of Multiple Time Point Interventions 4 Variable importance analysis examples of Targeted Learning 5 Scaling up Targeted Learning to handle Big Data 6 Concluding remarks

Variable Importance: Problem Description (Diaz, Hubbard) Around 800 patients that entered the emergency room with severe trauma About 80 physiological and clinical variables were measured at 0, 6, 12, 24, 48, and 72 hours after admission Objective is predicting the most likely medical outcome of a patient (e.g., survival), and provide an ordered list of the covariates that drive this prediction (variable importance). This will help doctors decide what variables are relevant at each time point. Variables are subject to missingness Variables are continuous, variable importance parameter is Ψ(P 0 ) E 0 {E 0 (Y A + δ, W ) E 0 (Y A, W )} for user-given value δ.

Variable Importance: Results * * * * * * * * * * HCT HGB APC CREA ISS PTT GCS HR SBP PLTS RR BUN PT DDIM PF12NM INR FVIII BDE PC ATIII FV 0.04 0.02 0 0.02 0.04 Hour 72 Hour 48 Hour 24 Hour 12 Hour 06 Hour 00 Figure: Effect sizes and significance codes

TMLE with Genomic Data 570 case-control samples on spina bifida We want to identify associated genes to spina bifida from 115 SNPs. In the original paper Shaw et. al. 2009, a univariate analysis was performed. The original analysis missed rs1001761 and rs2853532 in TYMS gene because they are closely linked with counteracting effects on spina bifida. With TMLE, signals from these two SNPs were recovered. In TMEL, Q0 was obtained from LASSO, and g(w ) is obtained from a simple regression of SNP on its two flanking markers to account for confounding effects of neighborhood markers.

TMLE p-values for SNPs 1 0 1 2 3 4 5 6 log10 p value rs1889292 hcv1766946 rs2274976 rs1476413 rs4846051 rs1801131 rs1801133 rs4846052 rs2066470 hcv1766974 rs535107 rs4659724 hcv8859565 hcv26588156 hcv31399971 hcv12007611 hcv26588163 hcv26588169 hcv31399989 rs3768139 hcv27498405 rs1770449 hcv26588208 rs1805087 hcv22275218 rs1266164 rs2229276 hcv26588230 hcv26588238 hcv22274168 hcv12005976 hcv26588247 hcv1137467 hcv1137468 hcv1137488 hcv1137493 rs828858 hcv9517519 hcv8884765 hcv197439 rs1801394 rs326120 hcv3068164 rs2303080 rs10064631 hcv3068152 rs2287779 rs16879334 rs162048 rs3776455 rs10380 rs1802059 rs9332 hcv15801046 hcv978680 hcv3164581 rs592052 hcv978690 hcv978693 rs597560 hcv978711 hcv978719 hcv11646606 hcv978728 hcv7443219 hcv11647016 hcv459618 hcv276186 hcv32167963 hcv16035252 hcv11434524 hcv11434466 hcv3103315 hcv7439335 hcv7439907 hcv27863089 hcv652007 hcv7599687 hcv7599646 rs3918211 rs2071010 rs1540087 hcv2139455 rs651646 rs514933 rs2298444 rs1950902 hcv11660794 rs2236225 rs2236224 rs1256142 hcv15953873 rs10137921 hcv1376145 hcv1376147 hcv11462908 rs502396 rs1001761 rs2847149 rs2853532 hcv11956188 hcv9516974 hcv16190627 hcv1605450 hcv1605453 hcv773632 hcv773552 hcv1605191 hcv1605192 hcv1325113 hcv11872896 hcv1325117 rs3788189 rs3788190 rs2330183 MTHFR MTHFR MTHFR MTHFR MTHFR MTHFR MTHFR MTHFR MTHFR MTHFR MTHFR MTHFD2 MTHFD2 MTHFD2 MTHFD2 MTHFD2 MTHFD2 MTHFD2 MTHFD2 R R R R R R R R R R R R R BHMT2 BHMT2 BHMT2 BHMT2 BHMT2 BHMT2 BHMT2 BHMT BHMT BHMT BHMT BHMT BHMT BHMT BHMT DHFR DHFR DHFR DHFR DHFR DHFR DHFR DHFR DHFR NOS3 NOS3 NOS3 FOLR1 FOLR1 FOLR1 FOLR2 FOLR2 FOLR2 MTHFD1 MTHFD1 MTHFD1 MTHFD1 MTHFD1 MTHFD1 MTHFD1 MTHFD1 MTHFD1 MTHFD1 TYMS TYMS TYMS TYMS CBS CBS CBS CBS CBS CBS CBS CBS CBS RFC1 RFC1 RFC1 RFC1 RFC1 RFC1

Outline 1 Targeted Learning 2 Two stage methodology: Super Learning+ TMLE 3 Definition of Estimation Problem for Causal Effects of Multiple Time Point Interventions 4 Variable importance analysis examples of Targeted Learning 5 Scaling up Targeted Learning to handle Big Data 6 Concluding remarks

Targeted Learning of Data Dependent Target Parameters (vdl, Hubbard, 2013) Define algorithms that map data into a target parameter: Ψ Pn : M IR d, thereby generalizing the notion of target parameters. Develop methods to obtain statistical inference for Ψ Pn (P 0 ) or 1/V V v=1 Ψ P n,v (P 0 ), where P n,v is the empirical distribution of parameter-generating sample corresponding with v-th sample split. We have developed cross-validated TMLE for the latter data dependent target parameter, without any additional conditions. In particular, this generalized framework allows us to generate a subset of target parameters among a massive set of candidate target parameters, while only having to deal with multiple testing for the data adaptively selected set of target parameters. Thus, much more powerful than regular multiple testing for a fixed set of null hypotheses.

Online Targeted MLE: Ongoing work Order data if not ordered naturally. Partition in subsets numbered from 1 to K. Initiate initial estimator and TMLE based on first subset. Update initial estimator based on second subset, and update TMLE based on second subset. Iterate till last subset. Final estimator is average of all stage specific TMLEs. In this manner, for each subset number of calculations is bounded by number of observations in subset and total computation time increases linearly in number of subsets. One can still prove asymptotic efficiency of this online TMLE.

Outline 1 Targeted Learning 2 Two stage methodology: Super Learning+ TMLE 3 Definition of Estimation Problem for Causal Effects of Multiple Time Point Interventions 4 Variable importance analysis examples of Targeted Learning 5 Scaling up Targeted Learning to handle Big Data 6 Concluding remarks

Concluding remarks Sound foundations of statistics are in place (Data is random variable, Model, Target Parameter, Inference based on Limit Distribution), but these have eroded over many decades. However, MLE had to be revamped into TMLE to deal with large models. Big Data asks for development of fast TMLE without giving up on statistical properties: e.g., Online TMLE. Big Data asks for research teams that consists of top statisticians and computer scientists, beyond subject matter experts. Philosophical soundness of proposed methods are hugely important and should become a norm.

References I H. Bang and J. Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61:962 972, 2005. M. Petersen, J. Schwab, S. Gruber, N. Blaser, M. Schomaker, and M. van der Laan. Targeted minimum loss based estimation of marginal structural working models. Journal of Causal Inference, submitted, 2013. E. Polley and M. van der Laan. SuperLearner: Super Learner Prediction, 2012. URL http://cran.r-project.org/package=superlearner. R package version 2.0-6. E. Polley, S. Rose, and M. van der Laan. Super learning. Chapter 3 in Targeted Learning, van der Laan,Rose (2012), 2012.

References II M. Schnitzer, E. Moodie, M. van der Laan, R. Platt, and M. Klein. Marginal structural modeling of a survival outcome with targeted maximum likelihood estimation. Biometrics, to appear, 2013. M. Van Der Laan and S. Dudoit. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. UC Berkeley Division of Biostatistics Working Paper Series, page 130, 2003. M. J. van der Laan and S. Gruber. Targeted Minimum Loss Based Estimation of Causal Effects of Multiple Time Point Interventions. The International Journal of Biostatistics, 8(1), Jan. 2012. ISSN 1557-4679. doi: 10.1515/1557-4679.1370.

References III M. J. van der Laan and D. Rubin. Targeted Maximum Likelihood Learning. The International Journal of Biostatistics, 2(1), Jan. 2006. ISSN 1557-4679. doi: 10.2202/1557-4679.1043. M. J. van der Laan, E. C. Polley, and A. E. Hubbard. Super learner. Statistical applications in genetics and molecular biology, 6(1), Jan. 2007. ISSN 1544-6115. doi: 10.2202/1544-6115.1309. A. van der Vaart, S. Dudoit, and M. van der Laan. Oracle inequalities for multi-fold cross-validation. Stat Decis, 24(3): 351 371, 2006.