Differential privacy in health care analytics and medical research An interactive tutorial

Transcription

1 Differential privacy in health care analytics and medical research An interactive tutorial Speaker: Moritz Hardt Theory Group, IBM Almaden February 21, 2012

2 Overview 1. Releasing medical data: What could go wrong? 2. Differential privacy a formal privacy guarantee 3. A framework for differentially private data analysis 4. Open problems and discussion

3 Insert your favorite application: Health care analytics, medical decision support, preventing epidemics, drug development, medical data (e.g., patient records, diagnostic data, DNA sequences, drug purchases, insurance data) Goal: Protect privacy of individuals while allowing useful analyses

4 What could go wrong? Many examples, several about medical data Here: The Case of Genome-Wide Association Studies (GWAS)

5 GWAS Setup: 1. NIH takes DNA of, say, 100 test candidates with common phenotype (e.g., certain disease) 2. NIH releases minor allele frequencies of test population at, say, 100,000 positions (SNPs) Goal: Find association between SNPs and phenotype

6 Attack on GWAS data [Homer et al.] Can infer membership in test group of an individual with known DNA from published data! SNP MAF Test population SNP MA 1 NO 2 NO 3 YES YES Moritz s DNA SNP MAF Reference population (HapMap data, public)

7 Attack on GWAS data [Homer et al.] Can infer membership in test group of an individual with known DNA from published data! SNP MAF Test population probably SNP MA 1 NO 2 NO 3 YES YES Moritz s DNA SNP MAF Reference population (HapMap data, public)

8 Intersting characteristics Only innocuous looking data was released Data was HIPAA compliant Data curator is trusted (NIH) Attack uses background knowledge (HapMap data set) available in public domain Attack uses unanticipated algorithm Need rigorous privacy guarantee!

9 Differential Privacy [Dwork-McSherry-Nissim-Smith-06] Rigorous privacy guarantee with important properties (e.g., handles attacks with background information, composes nicely, is robust) hundreds of papers in theory, databases, statistics, systems, programming languages Downloadable libraries (PINQ), implemented algorithms Intuition: Presence or absence of single individual in the data set cannot be inferred from output of the algorithm.

10 Differential Privacy [Dwork-McSherry-Nissim-Smith-06] A database is a collection of rows (tuples) from a universe U. Databases D, D are neighboring if they differ in only one row (i.e., one individual). Example: D = {GWAS test population}, D = D {Moritz s DNA} Definition. Randomized algorithm M is ε-differentially private if for all neighboring D, D and every set S of outcomes: Pr M D S 1 + ε Pr{M D S} Think of ε = 0.01

11 Differential Privacy [Dwork-McSherry-Nissim-Smith-06] A database is a collection of rows (tuples) from a universe U. Databases D, D are neighboring if they differ in only one row (i.e., one individual). Example: D = {GWAS test population}, D = D {Moritz s DNA} Definition. Randomized algorithm M is ε-differentially private if for all neighboring D, D and every set S of outcomes: Pr M D S exp(ε) Pr{M D S} Think of ε = 0.01

12 Definition. Randomized algorithm M is ε-differentially private if for all neighboring D, D and every set of outcomes S: Pr M D S exp ε Pr{M D S} Density ratio bounded by exp ε M(D) M(D ) Outputs

13 What does it mean? Suposse database teaches that smoking causes cancer Smoker S s insurance goes up This is true even if S is not in the database! Learning that smoking causes cancer is the whole point Differential privacy: No harm caused by participation in a data set automatically resilient against GWAS-style attacks

14 Is too strong? Some tasks you can do while guaranteeing DP Statistical queries on the data set data cubes, contingency tables, range queries Learning algorithms SQ learning, SVM, PCA, logistic regression, decision trees, clustering, online learning Monitor data streams heavy hitters, norm estimation In this talk: Focus on statistical queries

15 Statistical Queries with Differential Privacy Trusted Curator Set Q of statistical queries data set D Synopsis S: Q R E.g., list of answers, synthetic data set, data structure Analyst Requirement: Synopsis satisfies differential privacy Maximize usefulness of Synopsis Here: accuracy of answers

16 Statistical queries Example: How many people in D smoke and have cancer? Generally, specified by predicate f: U {0,1} Answer f D = i D f(i) ranges between 0 and n Fact: For every neighboring D, D we have f D f D 1

17 Laplacian Mechanism [DMNS06] Given query f: 1. Compute true answer f(d) 2. Output f D + Lap 1 ε

18 Laplacian Mechanism [DMNS06] Given query f: 1. Compute true answer f(d) 2. Output f D + Lap 1 ε exp( ε z ) Density ratio bounded by 1 + ε M(D) M(D ) f(d) f(d ) Outputs

19 Laplacian Mechanism [DMNS06] Given query f: 1. Compute true answer f(d) 2. Output f D + Lap 1 ε exp( ε z ) Handle k queries by adding Lap to each answer Error scales as O k ε Need k D per query! k ε independently

20 Laplacian Mechanism [DMNS06] Given query f: 1. Compute true answer f(d) 2. Output f D + Lap 1 ε exp( ε z ) Handle k queries by adding Lap to each answer Error scales as O k ε per query! k ε independently Need k D Can we answer more queries? YES!

21 Multiplicative Weights Approach* [H-Rothblum 10, Gupta-H-Roth-Ullman 11, H-McSherry-Ligett 12] Nearly optimal accuracy High accuracy on huge query sets can have k poly(n) Simple and scalable implementation gives significant empirical improvements on realworld queries and data sets Output is synthetic data set * Previous work using different ideas: Blum-Ligett-Roth 08, Dwork-Naor-Reingold- Rothblum-Vadhan 09, Dwork-Rothblum-Vadhan 10, Roth-Roughgarden 10

22 Histogram View Represent D as normalized histogram vector x R N where N = U x i = fraction of i s in D 1 must map to similar output distributions! N Statistical query becomes vector f 0,1 N Answer f x = f, x (inner product)

23 Algorithm: Input: Histogram x, query set Q Idea: Maintain differentially private histogram x t with x 0 uniform, increase quality of x t iteratively For t = 1 T: 1. Find bad query: Find query f t which maximizes f t x t 1 f t x 2. Improve histogram: Obtain x t from x t 1 using multiplicative weights update defined by f t Output: x T

26 f t x f t x t 1 Finding maximal violations (exponential mechanism [MT 07]) f 1 f 2 f 3 f k

27 f t x f t x t 1 Finding maximal violations (exponential mechanism [MT 07]) Add Lap 1 ε f 1 f 2 f 3 f k

28 f t x f t x t 1 Finding maximal violations (exponential mechanism [MT 07]) Add Lap 1 ε f 1 f 2 f 3 f k Pick maximal violation

29 f t x f t x t 1 Finding maximal violations (exponential mechanism [MT 07]) Add Lap 1 ε f 1 f 2 f 3 f k Pick maximal violation Lemma. Expected violation is MAX - log k /ε

31 Multiplicative Weights Update Estimate x t 1 Input x Query f t N 4 5 Before update Suppose f t x t 1 too small!

32 Multiplicative Weights Update Estimate x t 1 Input x Query f t N 4 5 After update x t i = x t 1 i (1 + ηf t i )/Z η suitably chosen Z = normalization

34 Privacy/Utility analysis Privacy: Algorithm is composition of T ε-differentially private steps. Hence, εt-differentially private over all by Composition theorem [DMNS06,DRV10] Utility: Why can we choose T small? Potential function RE(x t x) 0, log N Lemma: Potential drops with every update by roughly (f t x f t x t 1 ) 2 Conclusion: O α 2 log N steps give accuracy α

35 Implementation Computational bottleneck: Enumerating over all N coordinates N could be exponential (high-dimensional data) Parallelizable, scalable implementation with heuristic tweaks [H-McSherry-Ligett 12] Handles huge data sets with >100 attributes (so N ) Works well for data cubes and range queries (important statistical query classes) on real-world data

36 One-dimensional range queries on blood transfustion data set y-axis = average squared error per query lower bound on previous approach [Li-Miklau11] 1.00E E E E E E E E E E E+01 MW+EM SVD 1.00E E x-axis = Number of queries (randomly chosen queries, epsilon fixed at 0.1) x-axis = epsilon value (2000 randomly chose queries)

37 Two-dimensional range queries on blood transfustion data set y-axis = average squared error per query lower bound on previous approach [Li-Miklau11] 1.00E E E E E E E E E E E E+01 MW+EM SVD 1.00E E x-axis = Number of queries (randomly chosen queries, epsilon fixed at 0.1) x-axis = epsilon value (2000 randomly chose queries)

38 Scalability MW successfully ignores irrelevant attributes! Adult Dataset, binary attributes Elapsed Milliseconds Maximum Error Adult Dataset, binary attributes + 50 random attributes Very Similar shapes Elapsed Milliseconds Maximum Error

39 Scalability Most time spent evaluating queries, MW logic negligible Synthetic Data, varying # of attributes Milliseconds in MW Milliseconds in Total Synthetic Data, varying # of attributes Milliseconds in MW Milliseconds in Total

40 Summary Differential privacy gives a formal privacy guarantee that you can trust Powerful algorithms achieve differential privacy at surprising levels of utility Would be exciting to try out differential privacy in an industry scale application!

41 Guiding future research Finding good models Who is sharing data with whom? Where do we need strong privacy guarantees? Finding good assumptions What assumptions are commonly true for data arising in practice? Finding good problems What algorithms/tools should we focus on?

42 Thank you

43 Differential Privacy and HIPAA HIPAA does not imply sufficient privacy protection Stronger guarantees are needed (increasingly clear to policy makers, see IOM report Beyond the HIPAA Privacy Rule ) Stronger guarantees means less utility in some settings Does differential privacy comply with current HIPAA regulations? Yes and no. Statistician Provision gives some room for differential privacy Safe Harbor Provision preferred in practice for legal reasons