BIG DATA: CONVENTIONAL METHODS MEET UNCONVENTIONAL DATA Harvard Medical School & Harvard School of Public Health sharon@hcp.med.harvard.edu October 14, 2014 1 / 7
THE SETTING Unprecedented advances in data acquisition technologies The omics technologies Imaging data Telecommunication data Social networking data Medical record data and registries Features of Big Data Number of variables (P) > number of people (N) Different data types & resolutions natural language processed, claims, laboratory Potential to grow Most focus on solutions to storing, indexing, querying, and accessing Big Data Less focus on statistical inference: turning data into knowledge 2 / 7
BIG ISSUE - 1 Selecting correct approach for confounding adjustment when re are many potential confounders Rarely know exact confounders required to satisfy no unmeasured confounding assumptions Rarely know identity of subgroups exhibiting heterogeneous treatment effects Level of uncertainty is: Substantially increased in big data settings Typically ignored in computations Require approaches to account for such uncertainties in making regulatory decisions 3 / 7
BIG ISSUE - 2 How much data pooling permitted for making safety and effectiveness decisions? All empirical studies pool information Survival analysis: event times are averaged or pooled across patients receiving device A and compared to pooled event times among device B patients Pooling across different units Pool information from different countries to learn about device effectiveness in a particular subpopulation Pool information from many different manufacturer devices to learn about a specific manufacturers device Require a clear understanding of oretical assumptions, an approach to quantify amount of pooling, and implications of pooling 4 / 7
BIG ISSUE - 2 (cont.) Pool information from many different manufacturer devices to learn about a specific manufacturers device i = 1, 2,, n j patients implanted with manufacturer j s device j = 1, 2,, J manufacturers of device y ji = mean outcome for patient i implanted with device j y ji N(α j, σ 2 y,j) and α j N(µ α, σ 2 α) For each manufacturer j, estimate of α j is ( No ) Pooling ˆα j = ω j µ α + (1 ω j )ȳ j 0 ω j = σ 2 y n j σ 2 α + σ2 y n j 1 ( ) Complete Pooling 5 / 7
BIG ISSUE - 3 The role of missing data in big data Risk of missing data is higher in big data Standard strategies for filling-in missing data have not been tested Multiple imputation Partially or completely missing variables Different missingness mechanisms Not collected in one registry vs patients too sick to have variable measured Mixture models for missingness mechanism Missing data strategies in big data settings require systematic study 6 / 7
OTHER BIG ISSUES New oretical underpinnings of asymptotic ory: Large p, small n: what happens when p goes to infinity faster than n? Large p, large n; what happens when p and n go to infinity at same rate? Dimensionality and sparsity issues - how to reduce dimensionality? Global sparsity: in genomics, expression levels for thousands of genes but only a handful are likely to be predictive of a specific phenotypic trait (LASSO methods) Local sparsity: partition of p-dimensional space such that, within each region, outcome depends upon a small number of p variables (regression trees) Mixture sparsity: data arises from several simple models (mixture models) How to measure strength of evidence? p-values are driven by sample size Bayes factors are a good solution 7 / 7