The Reusable Holdout: Preserving Statistical Validity in Adaptive Data Analysis

Similar documents
Data Mining. Nonlinear Classification

Knowledge Discovery and Data Mining

Differential privacy in health care analytics and medical research An interactive tutorial

Knowledge Discovery and Data Mining

Lecture 10: Regression Trees

Logistic Regression for Spam Filtering

How I won the Chess Ratings: Elo vs the rest of the world Competition

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Model Validation Techniques

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Chapter 6: Episode discovery process

Supervised Learning (Big Data Analytics)

Why do statisticians "hate" us?

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Evaluation & Validation: Credibility: Evaluating what has been learned

Chapter 6. The stacking ensemble approach

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Using multiple models: Bagging, Boosting, Ensembles, Forests

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Shroudbase Technical Overview

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Knowledge Discovery and Data Mining

Multiple Linear Regression in Data Mining

Cross Validation. Dr. Thomas Jensen Expedia.com

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Azure Machine Learning, SQL Data Mining and R

Intelligent Heuristic Construction with Active Learning

CS473 - Algorithms I

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Neural Network Add-in

Data Mining - Evaluation of Classifiers

Time Domain and Frequency Domain Techniques For Multi Shaker Time Waveform Replication

Simulating Chi-Square Test Using Excel

Introduction to LAN/WAN. Network Layer

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Data Mining Practical Machine Learning Tools and Techniques

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

Decision Trees from large Databases: SLIQ

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Model Selection. Introduction. Model Selection

MONT 107N Understanding Randomness Solutions For Final Examination May 11, 2010

Algorithm & Flowchart & Pseudo code. Staff Incharge: S.Sasirekha

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Gerry Hobbs, Department of Statistics, West Virginia University

COMBINED NEURAL NETWORKS FOR TIME SERIES ANALYSIS

Comparison of Data Mining Techniques used for Financial Data Analysis

Cross-Validation. Synonyms Rotation estimation

Data Mining Applications in Higher Education

An Enterprise Dynamic Thresholding System

Lecture 13: Validation

Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Big Data, Start Small! Dr. Frank Säuberlich, Director Advanced Analytics (Teradata International) 26 th May 2015

Custom Web Development Guidelines

II. RELATED WORK. Sentiment Mining

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Name: Date: Use the following to answer questions 3-4:

Monotonicity Hints. Abstract

Pearson's Correlation Tests

Measurement Information Model

Using Ensemble of Decision Trees to Forecast Travel Time

Detecting Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo

not possible or was possible at a high cost for collecting the data.

Boosting.

D A T A M I N I N G C L A S S I F I C A T I O N

Towards better accuracy for Spam predictions

A Game Theoretical Framework for Adversarial Learning

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Model Combination. 24 Novembre 2009

Analytical Test Method Validation Report Template

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Introduction to Logistic Regression

Security in Outsourcing of Association Rule Mining

SAS Software to Fit the Generalized Linear Model

Online Algorithms: Learning & Optimization with No Regret.

Mining the Software Change Repository of a Legacy Telephony System

Semi-Supervised Support Vector Machines and Application to Spam Filtering

A Property & Casualty Insurance Predictive Modeling Process in SAS

Protein Protein Interaction Networks

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Factoring & Primality

Privacy-preserving Data-aggregation for Internet-of-things in Smart Grid

CART 6.0 Feature Matrix

Data Mining Lab 5: Introduction to Neural Networks

Active Learning SVM for Blogs recommendation

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

How To Perform An Ensemble Analysis

Ensemble Data Mining Methods

T Non-discriminatory Machine Learning

DYNAMIC RANGE IMPROVEMENT THROUGH MULTIPLE EXPOSURES. Mark A. Robertson, Sean Borman, and Robert L. Stevenson

Transcription:

The Reusable Holdout: Preserving Statistical Validity in Adaptive Data Analysis Moritz Hardt IBM Research Almaden Joint work with Cynthia Dwork, Vitaly Feldman, Toni Pitassi, Omer Reingold, Aaron Roth

False discovery a growing concern Trouble at the Lab The Economist

Most published research findings are probably false. John Ioannidis P- hacking is trying multiple things until you get the desired result. Uri Simonsohn She is a p- hacker, she always monitors data while it is being collected. Urban Dictionary The p value was never meant to be used the way it's used today. Steven Goodman

Preventing false discovery Decade old subject in Statistics Powerful results such as Benjamini-Hochberg work on controlling False Discovery Rate Lots of tools: Cross-validation, bootstrapping, holdout sets Theory focuses on non-adaptive data analysis

Non-adaptive data analysis Data analyst Specify exact experimental setup e.g., hypotheses to test Collect data Run experiment Observe outcome Can t reuse data after observing outcome.

Adaptive data analysis Data analyst Specify exact experimental setup e.g., hypotheses to test Collect data Run experiment Observe outcome Revise experiment

Adaptivity Data dredging, data snooping, fishing, p-hacking, post-hoc analysis, garden of the forking paths Some caution strongly against it: Pre-registration specify entire experimental setup ahead of time Humphreys, Sanchez, Windt (2013), Monogan (2013)

Adaptivity Garden of Forking Paths The most valuable statistical analyses often arise only after an iterative process involving the data Gelman, Loken (2013)

From art to science Can we guarantee statistical validity in adaptive data analysis? Our results: To a surprising extent, yes. Our hope: To inform discourse on false discovery.

A general approach Main result: The outcome of any differentially private analysis generalizes*. * If we sample fresh data, we will observe roughly the same outcome. Moreover, there are powerful differentially private algorithms for adaptive data analysis.

Intuition Differential privacy is a stability guarantee: Changing one data point doesn t affect the outcome much Stability implies generalization Overfitting is not stable

Does this mean I have to learn how to use differential privacy? Resoundingly, no! Thanks to our reusable holdout method

Standard holdout method Data training data unrestricted access holdout good for one validation Data analyst Non- reusable: Can t use information from holdout in training stage adaptively

One corollary: a reusable holdout Data training data unrestricted access reusable holdout can be used many times adaptively Data analyst essentially as good as using fresh data each time!

More formally Domain X. Unknown distribution D over X Data set S of size n sampled i.i.d. from D What the holdout will do: Given a function q : X [0,1], estimate the expectation ED[q] from sample S Definition: An estimate a is valid if a ED[q] < 0.01 Enough for many statistical purposes, e.g., estimating quality of a model on distribution D

Example: Model Validation We trained predictive model f : Z Y and want to know its accuracy Put X = Z Y. Joint distribution D over data x labels Estimate accuracy of classifier using the function q(z,y) = 1{ f(z) = y } ES[q] = accuracy with respect to sample S ED[q] = true accuracy with respect to unknown D f

A reusable holdout: Thresholdhout Theorem. Thresholdout gives valid estimates for any sequence of adaptively chosen functions until n 2 overfitting* functions occurred. * Function q overfits if ES[q]-ED[q] > 0.01. Example: Model is good on S, bad on D.

Thresholdout Input: Data S, holdout H, threshold T > 0, tolerance σ > 0 Given function q: Sample η, η from N(0,σ 2 ) If avgh[q] - avgs[q] > T + η: output avgh[q] + η Otherwise: output avgs[q]

An illustrative experiment Data set with 2n = 20,000 rows and d = 10,000 variables. Class labels in {-1,1} Analyst performs stepwise variable selection: 1. Split data into training/holdout of size n 2. Select best k variables on training data 3. Only use variables also good on holdout 4. Build linear predictor out of k variables 5. Find best k = 10,20,30,

No correlation between data and labels data are random gaussians labels are drawn independently at random from {- 1,1} Thresholdout correctly detects overfitting!

High correlation 20 attributes are highly correlated with target remaining attributes are uncorrelated Thresholdout correctly detects right model size!

Conclusion Powerful new approach for achieving statistical validity in adaptive data analysis building on differential privacy! Reusable holdout: Broadly applicable Complete freedom on training data Guaranteed accuracy on the holdout No need to understand Differential Privacy Computationally fast and easy to apply

Go read this paper for a proof:

Thank you.