BIG DATA: CONVENTIONAL METHODS MEET UNCONVENTIONAL DATA



Similar documents
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Index. Registry Report

Electronic health records to study population health: opportunities and challenges

BIG DATA AND HIGH DIMENSIONAL DATA ANALYSIS

Sample Size Designs to Assess Controls

Statistics Graduate Courses

Gerard Mc Nulty Systems Optimisation Ltd BA.,B.A.I.,C.Eng.,F.I.E.I

Problem of Missing Data

Targeted Learning with Big Data

False Discovery Rates

Fixed-Effect Versus Random-Effects Models

Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects

A Basic Introduction to Missing Data

Statistical Challenges with Big Data in Management Science

AVOIDING BIAS AND RANDOM ERROR IN DATA ANALYSIS

The PCORI Methodology Report. Appendix A: Methodology Standards

Marketing Mix Modelling and Big Data P. M Cain

Organizing Your Approach to a Data Analysis

Childhood leukemia and EMF

Dealing with Missing Data

Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Financial Time Series Analysis (FTSA) Lecture 1: Introduction

Advances in Loss Data Analytics: What We Have Learned at ORX

Introduction to Data Mining

Employers costs for total benefits grew

U.S. Army Research, Development and Engineering Command. Cyber Security CRA Overview

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Data Mining. Nonlinear Classification

Towards running complex models on big data

Personalized Predictive Medicine and Genomic Clinical Trials

Principles of Data Mining by Hand&Mannila&Smyth

CSC 342 Semester I: H ( G)

The primary goal of this thesis was to understand how the spatial dependence of

The Optimality of Naive Bayes

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

PREDICTIVE ANALYTICS: PROVIDING NOVEL APPROACHES TO ENHANCE OUTCOMES RESEARCH LEVERAGING BIG AND COMPLEX DATA

Challenges, Tools and Examples for Big Data Inference

Missing data and net survival analysis Bernard Rachet

Presenting data: how to convey information most effectively Centre of Research Excellence in Patient Safety 20 Feb 2015

The Consequences of Missing Data in the ATLAS ACS 2-TIMI 51 Trial

DATA MINING IN FINANCE

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

A Proven Approach to Stress Testing Consumer Loan Portfolios

Determining Measurement Uncertainty for Dimensional Measurements

Knowledge Discovery and Data Mining

Machine Learning Methods for Causal Effects. Susan Athey, Stanford University Guido Imbens, Stanford University

The Data Mining Process

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Introduction to the Practice of Statistics Fifth Edition Moore, McCabe Section 4.4 Homework

Penalized regression: Introduction

Optimal and Worst-Case Performance of Mastery Learning Assessment with Bayesian Knowledge Tracing

Logistic Regression (1/24/13)

Data deluge (and it s applications) Gianluigi Zanetti. Data deluge. (and its applications) Gianluigi Zanetti

Principles of Systematic Review: Focus on Alcoholism Treatment

Bootstrapping Big Data

Big Data: a new era for Statistics

Sensitivity Analysis in Multiple Imputation for Missing Data

Optimization applications in finance, securities, banking and insurance

Guideline on missing data in confirmatory clinical trials

Chapter 8: Quantitative Sampling

Handling missing data in Stata a whirlwind tour

Stock Market Liquidity and the Business Cycle

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Clinical Research Infrastructure

M.Sc. Health Economics and Health Care Management

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Qualitative and Quantitative Assessment of Uncertainty in Regulatory Decision Making. Charles F. Manski

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

A Bayesian hierarchical surrogate outcome model for multiple sclerosis

Big Data, Statistics, and the Internet

Statistical issues in the analysis of microarray data

Network Analysis. BCH 5101: Analysis of -Omics Data 1/34

Evaluating Current Practices in Shelf Life Estimation

Understanding Media Asset Management A Plain English Guide for Printing Communications Professionals

Regression Modeling Strategies

Establishing the Scope for The Business Case Structure to Evaluate Advanced Metering

Case Study Call Centre Hypothesis Testing

Statistics for BIG data

Section 6: Model Selection, Logistic Regression and more...

Confirmation Bias as a Human Aspect in Software Engineering

Big Data An Opportunity or a Distraction? Signal or Noise?

Managing Portfolios of DSM Resources and Reducing Regulatory Risks: A Case Study of Nevada

17. SIMPLE LINEAR REGRESSION II

MD - Data Mining

Analysis and Design of Software Systems Practical Session 01. System Layering

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Transcription:

BIG DATA: CONVENTIONAL METHODS MEET UNCONVENTIONAL DATA Harvard Medical School & Harvard School of Public Health sharon@hcp.med.harvard.edu October 14, 2014 1 / 7

THE SETTING Unprecedented advances in data acquisition technologies The omics technologies Imaging data Telecommunication data Social networking data Medical record data and registries Features of Big Data Number of variables (P) > number of people (N) Different data types & resolutions natural language processed, claims, laboratory Potential to grow Most focus on solutions to storing, indexing, querying, and accessing Big Data Less focus on statistical inference: turning data into knowledge 2 / 7

BIG ISSUE - 1 Selecting correct approach for confounding adjustment when re are many potential confounders Rarely know exact confounders required to satisfy no unmeasured confounding assumptions Rarely know identity of subgroups exhibiting heterogeneous treatment effects Level of uncertainty is: Substantially increased in big data settings Typically ignored in computations Require approaches to account for such uncertainties in making regulatory decisions 3 / 7

BIG ISSUE - 2 How much data pooling permitted for making safety and effectiveness decisions? All empirical studies pool information Survival analysis: event times are averaged or pooled across patients receiving device A and compared to pooled event times among device B patients Pooling across different units Pool information from different countries to learn about device effectiveness in a particular subpopulation Pool information from many different manufacturer devices to learn about a specific manufacturers device Require a clear understanding of oretical assumptions, an approach to quantify amount of pooling, and implications of pooling 4 / 7

BIG ISSUE - 2 (cont.) Pool information from many different manufacturer devices to learn about a specific manufacturers device i = 1, 2,, n j patients implanted with manufacturer j s device j = 1, 2,, J manufacturers of device y ji = mean outcome for patient i implanted with device j y ji N(α j, σ 2 y,j) and α j N(µ α, σ 2 α) For each manufacturer j, estimate of α j is ( No ) Pooling ˆα j = ω j µ α + (1 ω j )ȳ j 0 ω j = σ 2 y n j σ 2 α + σ2 y n j 1 ( ) Complete Pooling 5 / 7

BIG ISSUE - 3 The role of missing data in big data Risk of missing data is higher in big data Standard strategies for filling-in missing data have not been tested Multiple imputation Partially or completely missing variables Different missingness mechanisms Not collected in one registry vs patients too sick to have variable measured Mixture models for missingness mechanism Missing data strategies in big data settings require systematic study 6 / 7

OTHER BIG ISSUES New oretical underpinnings of asymptotic ory: Large p, small n: what happens when p goes to infinity faster than n? Large p, large n; what happens when p and n go to infinity at same rate? Dimensionality and sparsity issues - how to reduce dimensionality? Global sparsity: in genomics, expression levels for thousands of genes but only a handful are likely to be predictive of a specific phenotypic trait (LASSO methods) Local sparsity: partition of p-dimensional space such that, within each region, outcome depends upon a small number of p variables (regression trees) Mixture sparsity: data arises from several simple models (mixture models) How to measure strength of evidence? p-values are driven by sample size Bayes factors are a good solution 7 / 7