Differential privacy in health care analytics and medical research An interactive tutorial Speaker: Moritz Hardt Theory Group, IBM Almaden February 21, 2012
Overview 1. Releasing medical data: What could go wrong? 2. Differential privacy a formal privacy guarantee 3. A framework for differentially private data analysis 4. Open problems and discussion
Insert your favorite application: Health care analytics, medical decision support, preventing epidemics, drug development, medical data (e.g., patient records, diagnostic data, DNA sequences, drug purchases, insurance data) Goal: Protect privacy of individuals while allowing useful analyses
What could go wrong? Many examples, several about medical data Here: The Case of Genome-Wide Association Studies (GWAS)
GWAS Setup: 1. NIH takes DNA of, say, 100 test candidates with common phenotype (e.g., certain disease) 2. NIH releases minor allele frequencies of test population at, say, 100,000 positions (SNPs) Goal: Find association between SNPs and phenotype
Attack on GWAS data [Homer et al.] Can infer membership in test group of an individual with known DNA from published data! SNP MAF 1 0.02 2 0.03 3 0.05 100000 0.02 Test population SNP MA 1 NO 2 NO 3 YES 100000 YES Moritz s DNA SNP MAF 1 0.01 2 0.04 3 0.04 100000 0.01 Reference population (HapMap data, public)
Attack on GWAS data [Homer et al.] Can infer membership in test group of an individual with known DNA from published data! SNP MAF 1 0.02 2 0.03 3 0.05 100000 0.02 Test population probably SNP MA 1 NO 2 NO 3 YES 100000 YES Moritz s DNA SNP MAF 1 0.01 2 0.04 3 0.04 100000 0.01 Reference population (HapMap data, public)
Intersting characteristics Only innocuous looking data was released Data was HIPAA compliant Data curator is trusted (NIH) Attack uses background knowledge (HapMap data set) available in public domain Attack uses unanticipated algorithm Need rigorous privacy guarantee!
Differential Privacy [Dwork-McSherry-Nissim-Smith-06] Rigorous privacy guarantee with important properties (e.g., handles attacks with background information, composes nicely, is robust) hundreds of papers in theory, databases, statistics, systems, programming languages Downloadable libraries (PINQ), implemented algorithms Intuition: Presence or absence of single individual in the data set cannot be inferred from output of the algorithm.
Differential Privacy [Dwork-McSherry-Nissim-Smith-06] A database is a collection of rows (tuples) from a universe U. Databases D, D are neighboring if they differ in only one row (i.e., one individual). Example: D = {GWAS test population}, D = D {Moritz s DNA} Definition. Randomized algorithm M is ε-differentially private if for all neighboring D, D and every set S of outcomes: Pr M D S 1 + ε Pr{M D S} Think of ε = 0.01
Differential Privacy [Dwork-McSherry-Nissim-Smith-06] A database is a collection of rows (tuples) from a universe U. Databases D, D are neighboring if they differ in only one row (i.e., one individual). Example: D = {GWAS test population}, D = D {Moritz s DNA} Definition. Randomized algorithm M is ε-differentially private if for all neighboring D, D and every set S of outcomes: Pr M D S exp(ε) Pr{M D S} Think of ε = 0.01
Definition. Randomized algorithm M is ε-differentially private if for all neighboring D, D and every set of outcomes S: Pr M D S exp ε Pr{M D S} Density ratio bounded by exp ε M(D) M(D ) Outputs
What does it mean? Suposse database teaches that smoking causes cancer Smoker S s insurance goes up This is true even if S is not in the database! Learning that smoking causes cancer is the whole point Differential privacy: No harm caused by participation in a data set automatically resilient against GWAS-style attacks
Is too strong? Some tasks you can do while guaranteeing DP Statistical queries on the data set data cubes, contingency tables, range queries Learning algorithms SQ learning, SVM, PCA, logistic regression, decision trees, clustering, online learning Monitor data streams heavy hitters, norm estimation In this talk: Focus on statistical queries
Statistical Queries with Differential Privacy Trusted Curator Set Q of statistical queries data set D Synopsis S: Q R E.g., list of answers, synthetic data set, data structure Analyst Requirement: Synopsis satisfies differential privacy Maximize usefulness of Synopsis Here: accuracy of answers
Statistical queries Example: How many people in D smoke and have cancer? Generally, specified by predicate f: U {0,1} Answer f D = i D f(i) ranges between 0 and n Fact: For every neighboring D, D we have f D f D 1
Laplacian Mechanism [DMNS06] Given query f: 1. Compute true answer f(d) 2. Output f D + Lap 1 ε
Laplacian Mechanism [DMNS06] Given query f: 1. Compute true answer f(d) 2. Output f D + Lap 1 ε exp( ε z ) Density ratio bounded by 1 + ε M(D) M(D ) f(d) f(d ) Outputs
Laplacian Mechanism [DMNS06] Given query f: 1. Compute true answer f(d) 2. Output f D + Lap 1 ε exp( ε z ) Handle k queries by adding Lap to each answer Error scales as O k ε Need k D per query! k ε independently
Laplacian Mechanism [DMNS06] Given query f: 1. Compute true answer f(d) 2. Output f D + Lap 1 ε exp( ε z ) Handle k queries by adding Lap to each answer Error scales as O k ε per query! k ε independently Need k D Can we answer more queries? YES!
Multiplicative Weights Approach* [H-Rothblum 10, Gupta-H-Roth-Ullman 11, H-McSherry-Ligett 12] Nearly optimal accuracy High accuracy on huge query sets can have k poly(n) Simple and scalable implementation gives significant empirical improvements on realworld queries and data sets Output is synthetic data set * Previous work using different ideas: Blum-Ligett-Roth 08, Dwork-Naor-Reingold- Rothblum-Vadhan 09, Dwork-Rothblum-Vadhan 10, Roth-Roughgarden 10
Histogram View Represent D as normalized histogram vector x R N where N = U x i = fraction of i s in D 1 must map to similar output distributions!... 1 2 3 4 5 N Statistical query becomes vector f 0,1 N Answer f x = f, x (inner product)
Algorithm: Input: Histogram x, query set Q Idea: Maintain differentially private histogram x t with x 0 uniform, increase quality of x t iteratively For t = 1 T: 1. Find bad query: Find query f t which maximizes f t x t 1 f t x 2. Improve histogram: Obtain x t from x t 1 using multiplicative weights update defined by f t Output: x T
Algorithm: Input: Histogram x, query set Q Idea: Maintain differentially private histogram x t with x 0 uniform, increase quality of x t iteratively For t = 1 T: 1. Find bad query: Find query f t which maximizes f t x t 1 f t x 2. Improve histogram: Obtain x t from x t 1 using multiplicative weights update defined by f t Output: x T
Algorithm: Input: Histogram x, query set Q Idea: Maintain differentially private histogram x t with x 0 uniform, increase quality of x t iteratively For t = 1 T: 1. Find bad query: Find query f t which maximizes f t x t 1 f t x 2. Improve histogram: Obtain x t from x t 1 using multiplicative weights update defined by f t Output: x T
f t x f t x t 1 Finding maximal violations (exponential mechanism [MT 07]) f 1 f 2 f 3 f k
f t x f t x t 1 Finding maximal violations (exponential mechanism [MT 07]) Add Lap 1 ε f 1 f 2 f 3 f k
f t x f t x t 1 Finding maximal violations (exponential mechanism [MT 07]) Add Lap 1 ε f 1 f 2 f 3 f k Pick maximal violation
f t x f t x t 1 Finding maximal violations (exponential mechanism [MT 07]) Add Lap 1 ε f 1 f 2 f 3 f k Pick maximal violation Lemma. Expected violation is MAX - log k /ε
Algorithm: Input: Histogram x, query set Q Idea: Maintain differentially private histogram x t with x 0 uniform, increase quality of x t iteratively For t = 1 T: 1. Find bad query: Find query f t which maximizes f t x t 1 f t x 2. Improve histogram: Obtain x t from x t 1 using multiplicative weights update defined by f t Output: x T
Multiplicative Weights Update Estimate x t 1 Input x Query f t 1 0... 1 2 3 N 4 5 Before update Suppose f t x t 1 too small!
Multiplicative Weights Update Estimate x t 1 Input x Query f t 1 0... 1 2 3 N 4 5 After update x t i = x t 1 i (1 + ηf t i )/Z η suitably chosen Z = normalization
Algorithm: Input: Histogram x, query set Q Idea: Maintain differentially private histogram x t with x 0 uniform, increase quality of x t iteratively For t = 1 T: 1. Find bad query: Find query f t which maximizes f t x t 1 f t x 2. Improve histogram: Obtain x t from x t 1 using multiplicative weights update defined by f t Output: x T
Privacy/Utility analysis Privacy: Algorithm is composition of T ε-differentially private steps. Hence, εt-differentially private over all by Composition theorem [DMNS06,DRV10] Utility: Why can we choose T small? Potential function RE(x t x) 0, log N Lemma: Potential drops with every update by roughly (f t x f t x t 1 ) 2 Conclusion: O α 2 log N steps give accuracy α
Implementation Computational bottleneck: Enumerating over all N coordinates N could be exponential (high-dimensional data) Parallelizable, scalable implementation with heuristic tweaks [H-McSherry-Ligett 12] Handles huge data sets with >100 attributes (so N 2 100 ) Works well for data cubes and range queries (important statistical query classes) on real-world data
One-dimensional range queries on blood transfustion data set y-axis = average squared error per query lower bound on previous approach [Li-Miklau11] 1.00E+05 1.00E+06 1.00E+04 1.00E+03 1.00E+02 1.00E+01 1.00E+05 1.00E+04 1.00E+03 1.00E+02 1.00E+01 MW+EM SVD 1.00E+00 0 1000 2000 3000 4000 5000 1.00E+00 0 0.02 0.04 0.06 0.08 0.1 0.12 x-axis = Number of queries (randomly chosen queries, epsilon fixed at 0.1) x-axis = epsilon value (2000 randomly chose queries)
Two-dimensional range queries on blood transfustion data set y-axis = average squared error per query lower bound on previous approach [Li-Miklau11] 1.00E+05 1.00E+04 1.00E+03 1.00E+02 1.00E+01 1.00E+07 1.00E+06 1.00E+05 1.00E+04 1.00E+03 1.00E+02 1.00E+01 MW+EM SVD 1.00E+00 0 1000 2000 3000 4000 5000 1.00E+00 0 0.02 0.04 0.06 0.08 0.1 0.12 x-axis = Number of queries (randomly chosen queries, epsilon fixed at 0.1) x-axis = epsilon value (2000 randomly chose queries)
Scalability MW successfully ignores irrelevant attributes! 90000 80000 70000 60000 50000 40000 30000 20000 10000 0 Adult Dataset, binary attributes 0 10 20 30 40 Elapsed Milliseconds Maximum Error 120000 100000 80000 60000 40000 20000 0 Adult Dataset, binary attributes + 50 random attributes 50 60 70 80 90 Very Similar shapes Elapsed Milliseconds Maximum Error
Scalability Most time spent evaluating queries, MW logic negligible 1600 1400 1200 1000 800 600 400 200 0 Synthetic Data, varying # of attributes 10 20 30 40 50 60 70 80 90 100 Milliseconds in MW Milliseconds in Total Synthetic Data, varying # of attributes 800000 700000 600000 500000 400000 300000 200000 100000 0 100 200 300 400 500 600 700 800 900 1000 Milliseconds in MW Milliseconds in Total
Summary Differential privacy gives a formal privacy guarantee that you can trust Powerful algorithms achieve differential privacy at surprising levels of utility Would be exciting to try out differential privacy in an industry scale application!
Guiding future research Finding good models Who is sharing data with whom? Where do we need strong privacy guarantees? Finding good assumptions What assumptions are commonly true for data arising in practice? Finding good problems What algorithms/tools should we focus on?
Thank you
Differential Privacy and HIPAA HIPAA does not imply sufficient privacy protection Stronger guarantees are needed (increasingly clear to policy makers, see IOM report Beyond the HIPAA Privacy Rule ) Stronger guarantees means less utility in some settings Does differential privacy comply with current HIPAA regulations? Yes and no. Statistician Provision gives some room for differential privacy Safe Harbor Provision preferred in practice for legal reasons