Differential privacy in health care analytics and medical research An interactive tutorial



Similar documents
Shroudbase Technical Overview

Differential Privacy Tutorial Simons Institute Workshop on Privacy and Big Data. Katrina Ligett Caltech

A Practical Application of Differential Privacy to Personalized Online Advertising

Logistic Regression (1/24/13)

The Data Mining Process

The Reusable Holdout: Preserving Statistical Validity in Adaptive Data Analysis

GENETIC DATA ANALYSIS

Simple and efficient online algorithms for real world applications

Supervised Learning (Big Data Analytics)

Challenges of Data Privacy in the Era of Big Data. Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014

Lecture 4 Online and streaming algorithms for clustering

Private False Discovery Rate Control

Statistical Machine Learning

Healthcare data analytics. Da-Wei Wang Institute of Information Science

2.1 Complexity Classes

Data Mining Practical Machine Learning Tools and Techniques

Chapter 6: Episode discovery process

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

Java Modules for Time Series Analysis

Building In-Database Predictive Scoring Model: Check Fraud Detection Case Study

The Scientific Data Mining Process

Gamma Distribution Fitting

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Machine Learning Final Project Spam Filtering

A General Approach to Variance Estimation under Imputation for Missing Survey Data

Introduction to Logistic Regression

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

Globally Optimal Crowdsourcing Quality Management

1 Maximum likelihood estimation

Offline sorting buffers on Line

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects

MODELING RANDOMNESS IN NETWORK TRAFFIC

Advanced Big Data Analytics with R and Hadoop

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Big Data Analytics CSCI 4030

Server Load Prediction

Lecture 10: CPA Encryption, MACs, Hash Functions. 2 Recap of last lecture - PRGs for one time pads

Factoring & Primality

Advanced In-Database Analytics

Online Supplement to Polygenic Influence on Educational Attainment. Genotyping was conducted with the Illumina HumanOmni1-Quad v1 platform using

Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the

Combining Data from Different Genotyping Platforms. Gonçalo Abecasis Center for Statistical Genetics University of Michigan

Protein Protein Interaction Networks

Adaptive Online Gradient Descent

Factors for success in big data science

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

SAS Certificate Applied Statistics and SAS Programming

Employer Health Insurance Premium Prediction Elliott Lui

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

testo dello schema Secondo livello Terzo livello Quarto livello Quinto livello

SNPbrowser Software v3.5

Predicting borrowers chance of defaulting on credit loans

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Influences in low-degree polynomials

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Chapter 20: Data Analysis

The Multiplicative Weights Update method

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Data Mining. Nonlinear Classification

D A T A M I N I N G C L A S S I F I C A T I O N

Introduction to Data Mining

So, how do you pronounce. Jilles Vreeken. Okay, now we can talk. So, what kind of data? binary. * multi-relational

Lecture 6 Online and streaming algorithms for clustering

Hexaware E-book on Predictive Analytics

An Introduction to Machine Learning

Testing Big data is one of the biggest

No Free Lunch in Data Privacy

Introduction to Online Learning Theory

Descriptive Statistics

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis

Inference of Probability Distributions for Trust and Security applications

Scheduling Home Health Care with Separating Benders Cuts in Decision Diagrams

Some Essential Statistics The Lure of Statistics

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

0.1 Phase Estimation Technique

Bargaining Solutions in a Social Network

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Knowledge Discovery and Data Mining

Week 1: Introduction to Online Learning

Decompose Error Rate into components, some of which can be measured on unlabeled data

SIMPLIFIED PERFORMANCE MODEL FOR HYBRID WIND DIESEL SYSTEMS. J. F. MANWELL, J. G. McGOWAN and U. ABDULWAHID

How I won the Chess Ratings: Elo vs the rest of the world Competition

CS346: Advanced Databases

Data Warehousing und Data Mining

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

8.1 Min Degree Spanning Tree

INTRUSION PREVENTION AND EXPERT SYSTEMS

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

How To Bet On An Nfl Football Game With A Machine Learning Program

Choice under Uncertainty

Chapter 11 Monte Carlo Simulation

Transcription:

Differential privacy in health care analytics and medical research An interactive tutorial Speaker: Moritz Hardt Theory Group, IBM Almaden February 21, 2012

Overview 1. Releasing medical data: What could go wrong? 2. Differential privacy a formal privacy guarantee 3. A framework for differentially private data analysis 4. Open problems and discussion

Insert your favorite application: Health care analytics, medical decision support, preventing epidemics, drug development, medical data (e.g., patient records, diagnostic data, DNA sequences, drug purchases, insurance data) Goal: Protect privacy of individuals while allowing useful analyses

What could go wrong? Many examples, several about medical data Here: The Case of Genome-Wide Association Studies (GWAS)

GWAS Setup: 1. NIH takes DNA of, say, 100 test candidates with common phenotype (e.g., certain disease) 2. NIH releases minor allele frequencies of test population at, say, 100,000 positions (SNPs) Goal: Find association between SNPs and phenotype

Attack on GWAS data [Homer et al.] Can infer membership in test group of an individual with known DNA from published data! SNP MAF 1 0.02 2 0.03 3 0.05 100000 0.02 Test population SNP MA 1 NO 2 NO 3 YES 100000 YES Moritz s DNA SNP MAF 1 0.01 2 0.04 3 0.04 100000 0.01 Reference population (HapMap data, public)

Attack on GWAS data [Homer et al.] Can infer membership in test group of an individual with known DNA from published data! SNP MAF 1 0.02 2 0.03 3 0.05 100000 0.02 Test population probably SNP MA 1 NO 2 NO 3 YES 100000 YES Moritz s DNA SNP MAF 1 0.01 2 0.04 3 0.04 100000 0.01 Reference population (HapMap data, public)

Intersting characteristics Only innocuous looking data was released Data was HIPAA compliant Data curator is trusted (NIH) Attack uses background knowledge (HapMap data set) available in public domain Attack uses unanticipated algorithm Need rigorous privacy guarantee!

Differential Privacy [Dwork-McSherry-Nissim-Smith-06] Rigorous privacy guarantee with important properties (e.g., handles attacks with background information, composes nicely, is robust) hundreds of papers in theory, databases, statistics, systems, programming languages Downloadable libraries (PINQ), implemented algorithms Intuition: Presence or absence of single individual in the data set cannot be inferred from output of the algorithm.

Differential Privacy [Dwork-McSherry-Nissim-Smith-06] A database is a collection of rows (tuples) from a universe U. Databases D, D are neighboring if they differ in only one row (i.e., one individual). Example: D = {GWAS test population}, D = D {Moritz s DNA} Definition. Randomized algorithm M is ε-differentially private if for all neighboring D, D and every set S of outcomes: Pr M D S 1 + ε Pr{M D S} Think of ε = 0.01

Differential Privacy [Dwork-McSherry-Nissim-Smith-06] A database is a collection of rows (tuples) from a universe U. Databases D, D are neighboring if they differ in only one row (i.e., one individual). Example: D = {GWAS test population}, D = D {Moritz s DNA} Definition. Randomized algorithm M is ε-differentially private if for all neighboring D, D and every set S of outcomes: Pr M D S exp(ε) Pr{M D S} Think of ε = 0.01

Definition. Randomized algorithm M is ε-differentially private if for all neighboring D, D and every set of outcomes S: Pr M D S exp ε Pr{M D S} Density ratio bounded by exp ε M(D) M(D ) Outputs

What does it mean? Suposse database teaches that smoking causes cancer Smoker S s insurance goes up This is true even if S is not in the database! Learning that smoking causes cancer is the whole point Differential privacy: No harm caused by participation in a data set automatically resilient against GWAS-style attacks

Is too strong? Some tasks you can do while guaranteeing DP Statistical queries on the data set data cubes, contingency tables, range queries Learning algorithms SQ learning, SVM, PCA, logistic regression, decision trees, clustering, online learning Monitor data streams heavy hitters, norm estimation In this talk: Focus on statistical queries

Statistical Queries with Differential Privacy Trusted Curator Set Q of statistical queries data set D Synopsis S: Q R E.g., list of answers, synthetic data set, data structure Analyst Requirement: Synopsis satisfies differential privacy Maximize usefulness of Synopsis Here: accuracy of answers

Statistical queries Example: How many people in D smoke and have cancer? Generally, specified by predicate f: U {0,1} Answer f D = i D f(i) ranges between 0 and n Fact: For every neighboring D, D we have f D f D 1

Laplacian Mechanism [DMNS06] Given query f: 1. Compute true answer f(d) 2. Output f D + Lap 1 ε

Laplacian Mechanism [DMNS06] Given query f: 1. Compute true answer f(d) 2. Output f D + Lap 1 ε exp( ε z ) Density ratio bounded by 1 + ε M(D) M(D ) f(d) f(d ) Outputs

Laplacian Mechanism [DMNS06] Given query f: 1. Compute true answer f(d) 2. Output f D + Lap 1 ε exp( ε z ) Handle k queries by adding Lap to each answer Error scales as O k ε Need k D per query! k ε independently

Laplacian Mechanism [DMNS06] Given query f: 1. Compute true answer f(d) 2. Output f D + Lap 1 ε exp( ε z ) Handle k queries by adding Lap to each answer Error scales as O k ε per query! k ε independently Need k D Can we answer more queries? YES!

Multiplicative Weights Approach* [H-Rothblum 10, Gupta-H-Roth-Ullman 11, H-McSherry-Ligett 12] Nearly optimal accuracy High accuracy on huge query sets can have k poly(n) Simple and scalable implementation gives significant empirical improvements on realworld queries and data sets Output is synthetic data set * Previous work using different ideas: Blum-Ligett-Roth 08, Dwork-Naor-Reingold- Rothblum-Vadhan 09, Dwork-Rothblum-Vadhan 10, Roth-Roughgarden 10

Histogram View Represent D as normalized histogram vector x R N where N = U x i = fraction of i s in D 1 must map to similar output distributions!... 1 2 3 4 5 N Statistical query becomes vector f 0,1 N Answer f x = f, x (inner product)

Algorithm: Input: Histogram x, query set Q Idea: Maintain differentially private histogram x t with x 0 uniform, increase quality of x t iteratively For t = 1 T: 1. Find bad query: Find query f t which maximizes f t x t 1 f t x 2. Improve histogram: Obtain x t from x t 1 using multiplicative weights update defined by f t Output: x T

Algorithm: Input: Histogram x, query set Q Idea: Maintain differentially private histogram x t with x 0 uniform, increase quality of x t iteratively For t = 1 T: 1. Find bad query: Find query f t which maximizes f t x t 1 f t x 2. Improve histogram: Obtain x t from x t 1 using multiplicative weights update defined by f t Output: x T

Algorithm: Input: Histogram x, query set Q Idea: Maintain differentially private histogram x t with x 0 uniform, increase quality of x t iteratively For t = 1 T: 1. Find bad query: Find query f t which maximizes f t x t 1 f t x 2. Improve histogram: Obtain x t from x t 1 using multiplicative weights update defined by f t Output: x T

f t x f t x t 1 Finding maximal violations (exponential mechanism [MT 07]) f 1 f 2 f 3 f k

f t x f t x t 1 Finding maximal violations (exponential mechanism [MT 07]) Add Lap 1 ε f 1 f 2 f 3 f k

f t x f t x t 1 Finding maximal violations (exponential mechanism [MT 07]) Add Lap 1 ε f 1 f 2 f 3 f k Pick maximal violation

f t x f t x t 1 Finding maximal violations (exponential mechanism [MT 07]) Add Lap 1 ε f 1 f 2 f 3 f k Pick maximal violation Lemma. Expected violation is MAX - log k /ε

Algorithm: Input: Histogram x, query set Q Idea: Maintain differentially private histogram x t with x 0 uniform, increase quality of x t iteratively For t = 1 T: 1. Find bad query: Find query f t which maximizes f t x t 1 f t x 2. Improve histogram: Obtain x t from x t 1 using multiplicative weights update defined by f t Output: x T

Multiplicative Weights Update Estimate x t 1 Input x Query f t 1 0... 1 2 3 N 4 5 Before update Suppose f t x t 1 too small!

Multiplicative Weights Update Estimate x t 1 Input x Query f t 1 0... 1 2 3 N 4 5 After update x t i = x t 1 i (1 + ηf t i )/Z η suitably chosen Z = normalization

Algorithm: Input: Histogram x, query set Q Idea: Maintain differentially private histogram x t with x 0 uniform, increase quality of x t iteratively For t = 1 T: 1. Find bad query: Find query f t which maximizes f t x t 1 f t x 2. Improve histogram: Obtain x t from x t 1 using multiplicative weights update defined by f t Output: x T

Privacy/Utility analysis Privacy: Algorithm is composition of T ε-differentially private steps. Hence, εt-differentially private over all by Composition theorem [DMNS06,DRV10] Utility: Why can we choose T small? Potential function RE(x t x) 0, log N Lemma: Potential drops with every update by roughly (f t x f t x t 1 ) 2 Conclusion: O α 2 log N steps give accuracy α

Implementation Computational bottleneck: Enumerating over all N coordinates N could be exponential (high-dimensional data) Parallelizable, scalable implementation with heuristic tweaks [H-McSherry-Ligett 12] Handles huge data sets with >100 attributes (so N 2 100 ) Works well for data cubes and range queries (important statistical query classes) on real-world data

One-dimensional range queries on blood transfustion data set y-axis = average squared error per query lower bound on previous approach [Li-Miklau11] 1.00E+05 1.00E+06 1.00E+04 1.00E+03 1.00E+02 1.00E+01 1.00E+05 1.00E+04 1.00E+03 1.00E+02 1.00E+01 MW+EM SVD 1.00E+00 0 1000 2000 3000 4000 5000 1.00E+00 0 0.02 0.04 0.06 0.08 0.1 0.12 x-axis = Number of queries (randomly chosen queries, epsilon fixed at 0.1) x-axis = epsilon value (2000 randomly chose queries)

Two-dimensional range queries on blood transfustion data set y-axis = average squared error per query lower bound on previous approach [Li-Miklau11] 1.00E+05 1.00E+04 1.00E+03 1.00E+02 1.00E+01 1.00E+07 1.00E+06 1.00E+05 1.00E+04 1.00E+03 1.00E+02 1.00E+01 MW+EM SVD 1.00E+00 0 1000 2000 3000 4000 5000 1.00E+00 0 0.02 0.04 0.06 0.08 0.1 0.12 x-axis = Number of queries (randomly chosen queries, epsilon fixed at 0.1) x-axis = epsilon value (2000 randomly chose queries)

Scalability MW successfully ignores irrelevant attributes! 90000 80000 70000 60000 50000 40000 30000 20000 10000 0 Adult Dataset, binary attributes 0 10 20 30 40 Elapsed Milliseconds Maximum Error 120000 100000 80000 60000 40000 20000 0 Adult Dataset, binary attributes + 50 random attributes 50 60 70 80 90 Very Similar shapes Elapsed Milliseconds Maximum Error

Scalability Most time spent evaluating queries, MW logic negligible 1600 1400 1200 1000 800 600 400 200 0 Synthetic Data, varying # of attributes 10 20 30 40 50 60 70 80 90 100 Milliseconds in MW Milliseconds in Total Synthetic Data, varying # of attributes 800000 700000 600000 500000 400000 300000 200000 100000 0 100 200 300 400 500 600 700 800 900 1000 Milliseconds in MW Milliseconds in Total

Summary Differential privacy gives a formal privacy guarantee that you can trust Powerful algorithms achieve differential privacy at surprising levels of utility Would be exciting to try out differential privacy in an industry scale application!

Guiding future research Finding good models Who is sharing data with whom? Where do we need strong privacy guarantees? Finding good assumptions What assumptions are commonly true for data arising in practice? Finding good problems What algorithms/tools should we focus on?

Thank you

Differential Privacy and HIPAA HIPAA does not imply sufficient privacy protection Stronger guarantees are needed (increasingly clear to policy makers, see IOM report Beyond the HIPAA Privacy Rule ) Stronger guarantees means less utility in some settings Does differential privacy comply with current HIPAA regulations? Yes and no. Statistician Provision gives some room for differential privacy Safe Harbor Provision preferred in practice for legal reasons