BIG DATA: CONVENTIONAL METHODS MEET UNCONVENTIONAL DATA

Size: px

Start display at page:

Download "BIG DATA: CONVENTIONAL METHODS MEET UNCONVENTIONAL DATA"

Roy Nelson
10 years ago
Views:

1 BIG DATA: CONVENTIONAL METHODS MEET UNCONVENTIONAL DATA Harvard Medical School & Harvard School of Public Health October 14, / 7

2 THE SETTING Unprecedented advances in data acquisition technologies The omics technologies Imaging data Telecommunication data Social networking data Medical record data and registries Features of Big Data Number of variables (P) > number of people (N) Different data types & resolutions natural language processed, claims, laboratory Potential to grow Most focus on solutions to storing, indexing, querying, and accessing Big Data Less focus on statistical inference: turning data into knowledge 2 / 7

people (N) Different data types & resolutions natural language processed, claims, laboratory Potential to grow Most focus on

3 BIG ISSUE - 1 Selecting correct approach for confounding adjustment when re are many potential confounders Rarely know exact confounders required to satisfy no unmeasured confounding assumptions Rarely know identity of subgroups exhibiting heterogeneous treatment effects Level of uncertainty is: Substantially increased in big data settings Typically ignored in computations Require approaches to account for such uncertainties in making regulatory decisions 3 / 7

exhibiting heterogeneous treatment effects Level of uncertainty is: Substantially increased in big data settings

4 BIG ISSUE - 2 How much data pooling permitted for making safety and effectiveness decisions? All empirical studies pool information Survival analysis: event times are averaged or pooled across patients receiving device A and compared to pooled event times among device B patients Pooling across different units Pool information from different countries to learn about device effectiveness in a particular subpopulation Pool information from many different manufacturer devices to learn about a specific manufacturers device Require a clear understanding of oretical assumptions, an approach to quantify amount of pooling, and implications of pooling 4 / 7

among device B patients Pooling across different units Pool information from different countries to learn about device effectiveness in a particular subpopulation

5 BIG ISSUE - 2 (cont.) Pool information from many different manufacturer devices to learn about a specific manufacturers device i = 1, 2,, n j patients implanted with manufacturer j s device j = 1, 2,, J manufacturers of device y ji = mean outcome for patient i implanted with device j y ji N(α j, σ 2 y,j) and α j N(µ α, σ 2 α) For each manufacturer j, estimate of α j is ( No ) Pooling ˆα j = ω j µ α + (1 ω j )ȳ j 0 ω j = σ 2 y n j σ 2 α + σ2 y n j 1 ( ) Complete Pooling 5 / 7

j patients implanted with manufacturer j s device j = 1, 2,, J manufacturers of device y ji = mean outcome for patient

6 BIG ISSUE - 3 The role of missing data in big data Risk of missing data is higher in big data Standard strategies for filling-in missing data have not been tested Multiple imputation Partially or completely missing variables Different missingness mechanisms Not collected in one registry vs patients too sick to have variable measured Mixture models for missingness mechanism Missing data strategies in big data settings require systematic study 6 / 7

variables Different missingness mechanisms Not collected in one registry vs patients too sick to have variable

7 OTHER BIG ISSUES New oretical underpinnings of asymptotic ory: Large p, small n: what happens when p goes to infinity faster than n? Large p, large n; what happens when p and n go to infinity at same rate? Dimensionality and sparsity issues - how to reduce dimensionality? Global sparsity: in genomics, expression levels for thousands of genes but only a handful are likely to be predictive of a specific phenotypic trait (LASSO methods) Local sparsity: partition of p-dimensional space such that, within each region, outcome depends upon a small number of p variables (regression trees) Mixture sparsity: data arises from several simple models (mixture models) How to measure strength of evidence? p-values are driven by sample size Bayes factors are a good solution 7 / 7

Global sparsity: in genomics, expression levels for thousands of genes but only a handful are likely to be predictive of a specific phenotypic trait (LASSO methods) Local sparsity: partition of

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional