Xiao-Li Meng Department of Statistics, Harvard University Thanks to many students and colleagues

Size: px

Start display at page:

Download "Xiao-Li Meng Department of Statistics, Harvard University Thanks to many students and colleagues"

Jodie Morton
8 years ago
Views:

1 Statistical Paradises and Paradoxes in Big Data Xiao-Li Meng Department of Statistics, Harvard University Thanks to many students and colleagues 1

2 Paradises Much larger general pipeline: Statistics Concentration (Major) Size at Harvard College Much better airplane conversations Golden era for methodological research Emerging theoretical foundations 2

3 Berkeley Group: Integrating Stat/Prob with CS, ML, IS, and Math Rigorous theory of the trade-off between statistical and computational efficiency, under confidentiality, etc., based on classical statistical decision theory. Wide-ranging statistical machine learning theory, methodology, algorithms, using empirical process, signal processing & information theory (e.g., MDL principle). Automated Targeted Learning and Super Learning built upon well-established semiparametric and nonparametric theory. Algebraic statistics, e.g., studying statistical hypothesis testing via algebraic geometry and computational and combinatorial techniques. 3

4 BFF group: Integrating Bayes, Frequentist, and Fiducial perspectives Fusion learning via confidence distributions (CD) Combining results from multiple analyses under possibly different perspectives 4

5 Jianqing Fan s Group (Princeton): Bringing statistical theory and methods to the forefront of Big Data Fan et al. (2014) Challenges of Big Data Analysis National Science Review (China) 1: Salient features of Big Data Heterogeneity (Individuality) Noise accumulation Spurious correlation Incidental endogeneity FanBigDataReview.pdf 5

6 Great Promises and Grand Challenges Multi-Resolution Inference Multi-Phase Inference Multi-Source Inference o Meng (2014) A Trio of Inference Problems That Could Win You a Nobel Prize in Statistics (if you help fund it). COPSS 50 th Anniversary Volume. o Blocker and Meng (2013) The Potential and Perils of Preprocessing: Building New Foundations. Bernoulli, 19, o Xie and Meng (2016) Dissecting Multiple Imputation from a Multiphase Inference Perspective: What Happens When God s, Imputer s and Analyst s Models are Uncongenial? (With discussion). Statistica Sinica, to appear. 6

7 OnTheMap Project of US Census Bureau Developed by LED (Local Employment Dynamic). Users zoom into any region of the US for paired employeeemployer information. Used diverse data sources: surveys and administrative datasets with confidential information. Thanks to Jeremy Wu of C. B. 7

8 Multi-Resolution 8

9 Multi-Phase To protect confidentiality, the displayed data are synthetic: draws from a posterior. Each data source itself has gone through multiple clean up processes, most of which are gray boxes or even 9

10 Multi-Source Built from more than 20 data sources in the LEHD (Longitudinal Employer-Household Dynamics) system. Survey Samples: Monthly survey of 60,000 households covering only 0.05% of households. Administrative Records: Unemployment insurance wage records covering more than 90% of the US workforce; Never intended for inference purposes. Census Data: Quarterly census of earnings and wages covering 98% of US jobs. 10

11 A Trio of NP-Hard Inference Problems Multi-Resolution: How do we infer estimands with resolution far exceeding any possible estimators? Is it possible for such inference to be qualitatively robust even if it cannot be quantitatively robust? Multi-Phase: (Big) Data are almost never collected, preprocessed, and analyzed in a single phase. What theory and methods accommodate this multi-phase setup? Multi-Source: Which one is better: a survey sample covering 1% or an administrative record covering 95% of the population? How should we combine information from these sources? Is it worth combining? 11

12 So which one is better for estimating the population mean: a 1% simple random sample (SRS) or a 95% administrative (observational) dataset (AD)? 1. 1% SRS 2. 95% AD 3. It depends! 4. Is this a trick question? 0% 0% 0% 0% 1% SRS 95% AD It depends! Is this a tric... 12

13 A fundamental principle of statistics: Variance-Bias Tradeoff Total Error = Variance + Bias 2 probabilistic SRS [(1-f s )/n]s Large non-prob data 0 + r 2 [(1-f a )/f a )] S 2 f is the fraction in the population: f=n/n r is the correlation between the (honest) responded/recorded value X and the probability of response/recording, P(X) Big Data Paradox the larger the data, the more pronounced the bias 13

14 For estimating a population mean, if r=0.1, how large does an AD, as a percentage of US population, need to be in order to produce a more accurate sample average than a SRS with n=100 does? 1. <0.5% (1.6M) 2. 5% (16M) 3. 10% (32M) 4. 20% (64M) 5. 50% (160M) 6. 75% (240M) 7. 90% (288M) 8. >95% (303M) 0% 0% 0% 0% 0% 0% 0% 0% <0.5% (1.6M) 5% (16M) 10% (32M) 20% (64M) 50% (160M) 75% (240M) 90% (288M) 14 >95% (303M)

15 Big Data: Big Size or Big Fraction? Size matters, but only after having quality Importance of combining non-probabilistic samples with probabilistic ones, however small the latter are. More does NOT guarantee better: I got more data, my model is more refined, but my estimator is getting worse! Am I just dumb? (Meng and Xie, 2014, Economics Review, ) 15

16 So when/why do we need Big Data? Individualized treatments (e.g., medical; educational; marketing; news) Inference/prediction with very weak signal to noise ratio (e.g., climate change) Understand deeply connected (spatial) networks and (temporal) dynamics 16

17 What does Big Data mean for you? We see you and others more clearly 2015/11/1 17

18 Gift: Treatment for you based only on data from people like you. Curse: No one is perfectly like you. 2015/11/1 18

19 Personalized Treatment: Sounds heavenly, but where on Earth did they find the right guinea pig for me? Liu and Meng (2014) A Fruitful Resolution to Simpson s Paradox via Multi-Resolution Inference, The American Statistician, /11/1 19

20 A Painful Problem 2015/11/1 20

21 Kidney Stone Treatment C. R. Charig, D. R. Webb, S. R. Payne, O. E. Wickham (March 1986) Br Med J (Clin Res Ed) 292 (6524): Treatment A 78% (273/350) Treatment A Treatment B 83% (289/350) Treatment B Small Stone Large Stone 93% (81/87) 73% (192/263) 87% (234/270) 69% (55/80) A: Open Surgery; B: Percutaneous Nephrolithotomy 2015/11/1 21

22 Treatment A 73% successful Large Stones 93% Small Stones 78% Overall Successful Unsuccessful Treatment B 69% successful Large Stones 87% Small Stones 83% Overall Uneven distribution of stone sizes across treatments makes overall success rate misleading. 22

23 Simpson s Paradox Dealing with the disparities between aggregated analysis and disaggregated analyses Determining the right level (primary resolution) for analysis Understanding the bias-variance (relevancerobustness) trade-off 23

24 So what would be the right resolution? Let s take a CarTalk challenge (7/111/2015) 24

25 From Cartalk: You are tested positive for D by a test with 95% accuracy. What s the chance you actually have D, given the prevalence of D is 0.1%? % % % % % % 7. Could be anything 8. I have no idea. 1-5% 0% 0% 0% 0% 0% 0% 0% 0% 5-10% 10-25% 25-50% 50-75% C o u n t d o w n 75-95% Could be anyth I have no idea... 25

26 It could be anything depending on the meaning of accuracy and Need to know how accurate the test is among those with no disease (specificity) AND among those with the disease (sensitivity) The probability could be 1 if sensitivity = 100% For rare disease, overall accuracy ~ specificity Then the answer is less than 2%, if this was a random screening test 26

27 100,000 People for Screening 1,000 with Symptoms 0.1% 99.9% 10% 90% 100 D 99,900 no D 100 D 900 no D 95% 5% 95% 5% 95 pos 5 neg % 95% pos neg 5% 95% 4,995 pos 94,005 neg 45 pos 855 neg 95/(95+4,995) = 1.87% 95/(95+45) = 67.9% Conditioning is the Soul of Statistics --- Joe Blitzstein 27

28 Bayes Theorem When the facts change, I change my opinion. What do you do, sir? ~ John Maynard Keynes 28

29 Useful Statistical Principles/Concepts for Data Science Data Selection and Replication Mechanisms: Randomization, sampling, experiments, observational studies, missing data mechanisms; latent variable/constructs; potential outcome; confidentiality protections Conditioning vs. Marginalizing: Disaggregation vs. aggregation, sub-population analysis, individualized inference, Simpson s paradox, ecological fallacy Bias-Variance Trade-off: Efficiency vs. Robustness, Relevance vs. Robustness; model predictability vs. fitness Inferences principles/perspectives: Likelihood principle; Bayesian thinking; fiducial argument for objectivity; uncertainty quantifications. 2015/11/1 29

30 A Traditional Statistical Theme/Aim: Seeking representative samples to infer about populations A Big-Data Statistical Theme/Aim: Constructing approximating populations to infer about individuals Targeted Individual Approx. Population 2015/11/1 30

31 One more V for Big Data: Veracity 31

32 I find your presentation 1. Inspiring and thought provoking 2. informative and I learned a few things 3. confusing and not very helpful 4. what a waste of my time! Inspiring and... 0% 0% 0% 0% informative an... confusing and... what a waste o... 32

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom 1 Learning Goals 1. Be able to explain the difference between the p-value and a posterior