Data Privacy Aspects in Big Data Facilitating Medical Data Sharing. Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland

Transcription

1 Data Privacy Aspects in Big Data Facilitating Medical Data Sharing Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland September 2015

2 Content Introduction / Motivation Focus Application I: Privacy Preserving Medical Data Publishing to Support Intended Analyses Focus Application II: Identification of Privacy Vulnerabilities Summary 2

3 Data sharing Individuals data are increasing shared Netflix published movie ratings of 500K subscribers AOL published 20M search query terms of 658K web users TomTom sold customers location (GPS) data to the Dutch police emerge consortium published patient data related to genome-wide association studies to biorepositories (dbgap) Orange provided call information about its mobile subscribers, as part of the D4D challenge on mobile phone data ( Benefits of data sharing Personalization (e.g., Netflix s data mining contest aimed to improve movie recommendation based on personal preferences) Marketing (e.g., Tesco made 53M from selling shopping patterns to retailers and manufacturers, such as Nestle and Unilever, last year) Social benefits (e.g., promote medical research studies, improve traffic management, etc.) 3

4 Data sharing must guarantee privacy and accommodate utility A popular data sharing scenario (data publishing) Original data Released data data owners data publisher (trusted) data recipient (untrusted) Threats to data privacy Identity disclosure Sensitive information disclosure Membership disclosure Inferential disclosure Data utility requirements Minimal data distortion (general purpose use) Support of specific applications / workloads (e.g., building accurate predictive models, GWAS, etc.) 4

5 Focus Application I: Privacy Preserving Medical Data Publishing How can we share medical data in a way that protects patients privacy while supporting research studies?

6 Electronic Medical Records (EMR) Relational data Registration and demographic data Transaction (set-valued) data Billing information ICD codes* are represented as numbers (up to 5 digits) and denote signs, findings, and causes of injury or disease** Sequential data DNA Text data Clinical notes Electronic Medical Records (EMR) Name YOB ICD DNA Clinical notes Jim , 185 C T (doc1) Mary , A G (doc2) Mary C G (doc3) Carol C G (doc4) Anne , G C (doc5) Anne A T (doc6) * International Statistical Classification of Diseases and Related Health Problems ** Centers for Medicare & Medicaid Services - 6

7 EMR data use in analytics Statistical analysis Correlation between YOB and ICD code 185 (Malignant neoplasm of prostate) Querying Clustering Control epidemics* Classification Predict domestic violence** Association rule mining Electronic Medical Records Name YOB ICD DNA Jim , C T Mary A G Mary , C G Carol , C G Anne , G C Anne A T Formulate a government policy on hypertension management*** IF age in [43,48] AND smoke = yes AND exercise=no AND drink=yes;; THEN hypertension=yes (sup=2.9%;; conf=26%)0 * Tildesley et al. Impact of spatial clustering on disease transmission and optimal control, PNAS, ** Reis et al. Longitudinal Histories as Predictors of Future Diagnoses of Domestic Abuse: Modelling Study, BMJ: British Medical Journal, 2011 *** Chae et al. Data mining approach to policy analysis in a health insurance domain. Int. J. of Med. Inf.,

8 Need for privacy Why we need privacy in medical data sharing? If privacy is breached, there are consequences to patients Consequences to patients Emotional and economical embarrassment 62% of individuals worry their EMRs will not remain confidential* 35% expressed privacy concerns regarding the publishing of their data to dbgap** Opt-out or provide fake data à difficulty to conduct statistically powered studies * Health Confidence Survey 2008, Employee Benefit Research Institute ** Ludman et al. Glad You Asked: Participants Opinions of Re-Consent for dbgap Data Submission. Journal of Empirical Research on Human Research Ethics,

9 Need for privacy If privacy is breached, there are consequences to organizations Legal à HIPAA, EU legislation (95/46/EC, 2002/58/EC, 2009/136/EC etc.) Financial à The average cost of a single data breach is $5.85M in the US and $4.74M in Germany;; these countries have the highest per capita cost. Healthcare and Education are the most heavily regulated industries * Ponemon Institute Research Report 2014 Cost of Data Breach Study: Global Analysis. 9

10 Protecting data privacy: data masking / removal of identifiers Removing / masking direct identifiers data owners data publisher (trusted) data recipient (untrusted) Original data De-identified data 1. Locate the direct identifiers (attributes that uniquely identify an individual), such as SSN, Patient ID, Phone number etc. 2. Remove or mask them from the data prior to data publishing Name John Doe Thelma Arnold Search Query Terms Harry potter, King s speech Hand tremors, bipolar, dry mouth, effect of nicotine on the body 10

11 Protecting data privacy: data masking / removal of identifiers Masking / removal of direct identifiers is not sufficient! data owners data publisher (trusted) data recipient (untrusted) Original data Released data Main types of threats to data privacy External data Background Knowledge Identity disclosure Sensitive information disclosure Inferential disclosure 11

12 Privacy Threats: Identity disclosure Identity disclosure in relational data (e.g., patients demographics) Individuals are linked to their published records based on quasi-identifiers (attributes that in combination can identify an individual) Age Postcode Sex 20 NW10 M 45 NW15 M 22 NW30 M 50 NW25 F De-identified data Name Age Postcode Sex Greg 20 NW10 M Jim 45 NW15 M Jack 22 NW30 M Anne 50 NW25 F External data * Sweeney, k- anonymity: a model for protecting privacy. IJUFKS, % of US citizens can be identified by Age, DOB, 5-digit ZIP code* 12

13 Identity disclosure in sharing patients diagnosis codes Identity disclosure in transaction data (e.g., diagnosis codes) Identified EMR data ID ICD Jim Mary Anne Released EMR Data ICD DNA CT A AC T GC C Mary is diagnosed with benign essential hypertension (ICD code 401.1) the second record belongs to her à all her diagnosis codes Disclosure based on diagnosis codes* à general problem for other medical terminologies (e.g., ICD-10 used in EU) à sharing data susceptible to the attack against legislation * Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants Privacy. JAMIA,

14 Real-world identity disclosure cases involving medical data Group Insurance Commission data Voter list of Cambridge, MA William Weld, Former Governor of MA Chicago Homicide database Social security death index 35% of murder victims Adverse Drug Reaction Database Public obituaries 26-year old girl who died from drug 14

15 Issuing attacks on medical datasets Two-step attack using publicly available voter registration lists and hospital discharge summaries voter(name,..., zip, dob, sex) knowledge summary(zip, dob, sex, diagnoses) release(diagnoses, DNA) trust voter list & discharge summary à release 87% of US citizens can be identified by {dob, gender, ZIP-code} * Sweeney, k- anonymity: a model for protecting privacy. IJUFKS,

16 Issuing attacks on medical datasets One-step attack using EMRs* Insider s attack EMR (name,..., diagnoses) knowledge EMR à release release(, diagnoses, DNA) trust * Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants Privacy. JAMIA,

17 Case study: Evaluating the effectiveness of the insider s attack De-identified / Masked EMR population from VUMC Population: 1.2M records (patients) from Vanderbilt A unique random number for ID de-identified EMR (ID,..., diagnoses) VNEC(, diagnoses, DNA) VNEC de-identified / masked EMR sample 2762 records (patients) derived from the population Patients from VNEC were involved in a study (GWAS) for the Native Electrical Conduction of the heart Patients EMR were to be deposited into dbgap Data would be made available to support other studies (GWAS)? 17

18 Case study: Evaluating the effectiveness of the insider s attack Vanderbilt s EMR - VNEC dataset linkage on ICD codes 100.0% % of re-identified sample 90.0% 80.0% 70.0% 96.5% We assume that all ICD codes are used to issue an attack 96.5% of patients susceptible to identity disclosure 60.0% Distinguishability (log scale) Number of times a set of ICD codes appears in the population Support in the data mining literature 18

19 Case study: Evaluating the effectiveness of the insider s attack Vanderbilt s EMR - VNEC dataset linkage on ICD codes 1 ICD code 2 ICD code combination 3 ICD code combination 10 ICD code combination % of re-identifiable sample 100% 80% 60% 40% 20% 0% A random subset of ICD codes that can be used in attack Knowing a random combination of 2 ICD codes can lead to unique re-identification Distinguishability (log scale) Number of times a set of ICD codes appears in the population (equiv. to support count in data mining literature) 19

20 Privacy Threats: Sensitive information disclosure Individuals are associated with sensitive information Name Age Postcode Sex Greg 20 NW10 M Background knowledge Identified EMR data ID ICD Jim Mary Sensitive Attribute (SA) Age Postcode YOB Disease 20 NW HIV 20 NW HIV 20 NW HIV 20 NW HIV De-identified data Released EMR Data ID ICD DNA Jim C A Mary A T Mary is diagnosed with and 401.1à she has Schizophrenia Schizophrenia Sensitive information disclosure can occur without identity disclosure 20

21 Privacy Threats: Inferential disclosure Sensitive knowledge patterns are exposed by data mining 75% of patients visit the same physician more than 4 times Unsolicited advertisement 60% of the white males > 50 suffer from diabetes Stream data collected by health monitoring systems Electronic medical records Customer discrimination Drug orders & costs Business rivals can harm data publishers and insurance, pharmaceutical & marketing companies can harm data owners* * G. Das and N. Zhang. Privacy risks in health databases from aggregate disclosure. PETRA,

22 Anonymization of demographics k-anonymity principle* Each record in a relational table T should have the same value over quasi-identifiers with at least k-1 other records in T These records collectively form a k-anonymous group k-anonymity protects from identity disclosure Protects data from linkage to external sources (triangulation attacks) The probability that an individual is correctly associated with their record is at most 1/k Name Age Postcode Sex Greg 40 NW10 M Jim 45 NW15 M Jack 22 NW30 M Anne 50 NW25 F External data Age Postcode Sex 4* NW1* M 4* NW1* M * NW* * * NW* * 2-anonymous data * Sweeney. Achieving k- anonymity privacy protection using generalization and suppression. IJUFKS

23 Attack on k-anonymous data Homogeneity attack* All sensitive values in a k-anonymous group are the same à sensitive information disclosure Name Age Postcode Greg 40 NW10 External data Age Postcode Disease 4* NW1* HIV 4* NW1* HIV 5* NW* Ovarian Cancer 5* NW* Flu 2-anonymous data Attacker is confident that Greg suffers from HIV * Machanavajjhala et al, l-diversity: Privacy Beyond k-anonymity. ICDE

24 l-diversity principle for demographics l -diversity* A relational table is l-diverse if all groups of records with the same values over quasi-identifiers (QID groups) contain no less than l well-represented values for the SA Distinct l-diversity l well-represented à l distinct Age Postcode Disease 4* NW1* HIV 4* NW1* HIV 4* NW1* HIV 4* NW1* HIV 4* NW1* Flu 4* NW1* Cancer Three distinct values, but the probability of HIV being disclosed is ~

25 Differential privacy Objective Prevent attackers from inferring any additional information about an individual, regardless of whether the published dataset contains the individual s record or not. ε-differential privacy* satisfied by a randomized algorithm A if Pr[A(D) = D] exp(ε) Pr[A(D') = D] for all datasets D, D that differ in one record, and for any possible anonymized dataset D, where ε is a constant and the probabilities are over the randomness of A** Essentially differential privacy ensures that the outcome of a calculation is insensitive to any one particular record in the dataset * Dwork. Differential privacy. ICALP, ** Definition from Mohammed et al. Differentially private data release for data mining. KDD,

26 Offering differential privacy Laplace mechanism For any function f : DàR d, the algorithm A that adds independently generated noise with distribution Lap(Δ f / ε) to each of its d outputs, satisfies ε-differential privacy, when Δ f = max D,D f(d) f(d ) for all datasets D, D that differ in exactly one record. Age Gender 20 M 23 F 25 M 42 F q Assume that f : returns the number of patients with age < 40 q Then, original value: 3 q Δ f = 1 (sensitivity) q Add to the original value f (D) noise with distribution Lap(1/ε) => released value: 3 + Lap (1/ε) Exponential mechanism In tasks where adding noise makes no sense (e.g., training a classifier), differential privacy is offered by randomizing the selection of an outcome from a set of possible outcomes. For any function u: (D t) à R that measures the utility of an output t, an algorithm A that chooses t with probability proportional to exp(ε u(d,t) / 2 Δu) satisfies ε-differential privacy, where Δu = max forall t, D, D u(d,t) u(d,t). For example, the exponential mechanism can be used to select Age or Gender given a function u that scores attributes according to the perceived utility loss. * Dwork. Differential privacy. ICALP, ** Definition from Mohammed et al. Differentially private data release for data mining. KDD,

27 ε-differential privacy Pros (+) No/few assumptions on adversarial knowledge Composability [1] privacy holds even when multiple differentially-private datasets are obtained by an adversary Several mechanisms for the interactive [2] and the non-interactive scenario [3,4] Cons (-) Overly restrictive à differential privacy leads to very high information loss! several variations [5] & improved mechanisms [6], but still the utility remains very low No real-world applications & no privacy criterion is associated with the value of ε Important drawbacks due to the random noise addition Only noisy answers to a limited number and type of queries can be offered, or noisy summary statistics (such as histograms) can be released;; anonymous datasets cannot be created/released Individuals may become associated with false information in the output of differential privacy methods The utility loss is very high compared to syntactic approaches Several misconceptions [7] and susceptibility to attacks [8] (e.g., Cormode showed that an attacker can infer the sensitive value of an individual fairly accurately by applying Naïve Bayes classification on differentially private data) [1] Ganta et al. Composition attacks and auxiliary information in data privacy. KDD, [2] Dwork. Differential privacy: a survey of results. TAMC, [3] Mohammed. Differentially private release for data mining. KDD, [4] Xiao et al. Differential privacy via wavelet transforms. ICDE, [5] Machanavajjhala et al. Data Publishing against Realistic Adversaries. PVLDB, [6] Ding et al. Differentially private data cubes: optimizing noise sources and consistency. SIGMOD, [7] Kifer et al. No free lunch in data privacy. SIGMOD, [8] Cormode. Personal privacy vs. population privacy: learning to attack anonymization. KDD,

28 Partition-based algorithms for k-anonymity Main idea of partition-based algorithms A record projected over QIDs is treated as a multidimensional point A subspace (hyper-rectangle) that contains at least k points can form a k-anonymous group à multidimensional global recoding Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV M F 28 F Cancer 29 F Obesity How to partition the space? One attribute at a time which to use? How to split the selected attribute? 28

29 Example of applying Mondrian (k=3) M M F F Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity Age Sex Disease [20-26] {M,F} HIV [20-26] {M,F} HIV [20-26] {M,F} Obesity [27-29] F HIV [27-29] F Cancer [27-29] F Obesity 29

30 Clustering-based anonymization algorithms Main idea of clustering-based anonymization 1.Create clusters containing at least k records with similar values over QIDs Seed selection Similarity measurement Stopping criterion 2. Anonymize records in each cluster separately Local recoding and/or Suppression??? 30

31 Clustering-based anonymization algorithms Clusters need to be separated Seed Selection Furthest-first Random Clusters need to contain similar values Similarity measurement Stopping criterion Size-based Quality-based Clusters should not be too large All these heuristics attempt to improve data utility 31

32 Example of Bottom-up clustering algorithm (k=2) M M M F F F Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity Age Sex Disease [20-25] M HIV [20-25] M Obesity [23-27] F HIV [23-27] F HIV [28-29] F Cancer [28-29] F Obesity 32

33 Example of top-down clustering algorithm (k=2) M M M F F F Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity Age Sex Disease [20-25] {M,F} HIV [20-25] {M,F} HIV [20-25] {M,F} Obesity [27-29] F HIV [27-29] F Cancer [27-29] F Obesity 33

34 Preventing identity disclosure from diagnosis codes: Suppression Suppression Removes items or records from data prior to releasing the data Suppress ICD codes* appearing in less than a certain percent of patient records Intuition: such ICD codes can act as quasi-identifiers Identified EMR data ID ICD Mary Anne Released EMR Data ICD DNA AC T GC C * Vinterbo et al. Hiding information by cell suppression. AMIA Annual Symposium 01 34

35 Code suppression a case study using Vanderbilt s EMR data We had to suppress diagnosis codes appearing in less than 25% of the records in VNEC to prevent re-identification doing so we were left with only 5 out of ~6000 ICD codes! * 5-Digit ICD-9 Codes 3-Digit ICD-9 Codes ICD-9 Sections Benign essential hypertension Other malaise and fatigue 401-Essential hypertension 780- Other soft tissue Pain in limb Other disorders of soft tissues Abdominal pain 789 Other abdomen/pelvis symptoms Hypertensive disease Rheumatism excluding the back Rheumatism excluding the back Symptoms Chest pain 786 -Respiratory system Symptoms *Loukides, Gkoulalas- Divanis, Malin. Anonymization of Electronic Medical Records for Validating Genome- Wide Association Studies. PNAS 10 35

36 Preventing identity disclosure in EMR data: Generalization Generalization - replaces items with more general ones (usually with the help of a domain hierarchy) Any Chapters Sections 3-digit ICD codes 5-digit ICD codes Any Generalize ICD-codes to their 3-digit representation benign essential hypertension à 401- essential hypertension Identified EMR data ID ICD Mary Anne Released EMR Data ICD DNA AC T GC C 36

37 Code generalization a case study using Vanderbilt s EMR data Generalizing ICD codes from VNEC* 5-digit ICD codes 3-digit ICD codes 100.0% 90.0% 96.5% % of re-identified sample 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% 75.0% 25.6% 95% no suppression suppression 5% suppression 15% suppression 25% distinguishability (log scale) 95% of the patients remain re-identifiable * Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants Privacy. JAMIA,

38 Complete k-anonymity & k m -anonymity Complete k-anonymity: Knowing that an individual is associated with any itemset, an attacker should not associate this individual to < k transactions ICD DNA AC T GC C CC A CA T Original data ICD DNA AC T GC C CC A CA T 2-complete anonymous data k m -anonymity: Knowing that an individual is associated with any m-itemset, an attacker should not associate this individual to less than k transactions ICD DNA AC T GC C CC A CA T Original data ICD DNA 401 AC T 401 GC C 401 CC A 401 CA T anonymous data 38

39 Applicability of complete k-anonymity and k m -anonymity to medical data Limited in the specification of privacy requirements Assume too powerful attackers all m-itemsets (combinations of m diagnosis codes) need protection but medical data publishers have detailed privacy requirements Explore a small number of possible generalizations Do not take into account utility requirements Attackers know who is diagnosed with abc or defgh They protect all 5-itemsets instead of the 2 itemsets privacy constraints 39

40 Policy-based anonymization model Policy-based anonymization for ICD codes* Global anonymization model Models both generalization and suppression Each original ICD code is replaced by a unique set of ICD codes no need for generalization hierarchies ICD codes Anonymized codes (493.00, ) (296.01, ) Generalized ICD code interpreted as or or both Φ ( ) Suppressed ICD code Not released *Loukides, Gkoulalas- Divanis, Malin. Anonymization of Electronic Medical Records for Validating Genome- Wide Association Studies. PNAS 10 40

41 Policy-based anonymization: Privacy model Data publishers specify diagnosis codes that need protection Privacy Model: Knowing that an individual is associated with one or more specific itemsets (privacy constraints), an attacker should not be able to associate this individual to less than k transactions ICD DNA AC T GC C CC A CA T Original data ICD DNA AC T (401.2, 401.4) GC C CC A (401.2, 401.4) CA T Anonymized data Privacy Policy: The set of all specified privacy constraints Privacy is achieved when all privacy constraints are supported by at least k transactions in the published data or do not appear at all 41

42 Policy-based anonymization: Data utility considerations Utility Constraints: Published data must remain as useful as the original data for conducting a GWAS on a disease or trait à number of cases and controls in a GWAS must be preserved Supporting utility constraints: ICD codes from utility policy are generalized together;; a larger part of the solution space is searched than when using domain generalization hierarchies (296.00, ) 42

43 Policy-based anonymization algorithms Goal: Anonymize medical records so that Privacy is guaranteed Utility is high à many GWAS are supported simultaneously Incurred information loss is minimal Challenging optimization problem NP-hard Feasibility depends on constraints Heuristic algorithms Utility-Guided Anonymization of Clinical Profiles (UGACLIP) Clustering-based Anonymization (CBA) Algorithm Efficiency Scalability UGACLIP CBA Utility 43

44 Anonymization algorithms: UGACLIP Privacy Policy Utility Policy k=2 EMR data ICD DNA CT A AC T GC C UGACLIP Algorithm Data is protected;; {296.00, , } appears 2 times Anonymized EMR data ICD DNA (296.00, ) CT A AC T (296.00, ) GC C Data remains useful for GWAS on Bipolar disorder;; associations between (296.00, ) and DNA region CT A are preserved 44

45 Anonymization algorithms: CBA Sketch of CBA Retrieve the ICD codes that need less protection from the Privacy Policy Gradually build a cluster of codes that can be anonymized according to the utility policy and with minimal UL If the ICD codes are not protected Suppress no more ICD codes than required to protect privacy privacy req. are met p 1 = {i 1 } p 2 = {i 5, i 6 } k=3 clusters merging (driven by UL) singleton clusters 45

46 Case Study: EMRs from Vanderbilt University Medical Center Datasets VNEC 2762 de-identified EMRs from Vanderbilt involved in a GWAS VNECkc subset of VNEC, we know which diseases are controls for others BIOVU all de-identified EMRs (79087) from Vanderbilt s biobank (the largest dataset in medical data privacy literature)* Methods UGACLIP and CBA ACLIP (state-of-the-art method it does not take utility policy into account) *Loukides, Gkoulalas- Divanis. Utility- aware anonymization of diagnosis codes. IEEE TITB

47 UGACLIP & CBA: First algorithms to offer data utility in GWAS Setting: k = 5, protecting single-visits of patients, 18 GWAS-related diseases* no utility constraints Diseases related to all GWAS reported in Manolio* Best competitor Result of ACLIP is useless for validating GWAS UGACLIP preserves 11 out of 18 GWAS CBA preserves 14 out of 18 GWAS simultaneously * Manolio et al. A HapMap harvest of insights into the genetics of common disease. J Clinic. Inv

48 Utility beyond GWAS Supporting clinical case counts in addition to GWAS learn number of patients with sets of codes in 10% of the records useful for epidemiology and data mining applications act. estim. act. VNECkc Queries can be estimated accurately VNECkc (ARE <1.25), comparable to ACLIP Anonymized data can support both GWAS and studies on clinical case counts 48

49 Anonymizing the BIOVU (79K EMR) Supporting clinical case counts in BIOVU Very low error in query answering (Average Relative Error <1) All EMRs in the VUMC biobank can be anonymized and remain useful 49

50 Anonymization of RT-datasets (e.g., demographics + diagnoses codes) Privacy Threat: Attackers know some relational attribute values (e.g., demographics) plus some sensitive items (e.g., diagnoses) for an individual. *G. Poulis, G. Loukides, A. Gkoulalas- Divanis, S. Skiadopoulos. Anonymizing data with relational and transactional attributes. PKDD

51 Focus Application II: Identification of Privacy Vulnerabilities How can we analyze datasets to discover privacy vulnerabilities and select the best protection mechanism?

52 Privacy-preserving data publishing ID Name Address SSN Birth Gender ZIP Marital status A1 A2 0 Maria 10 NY E. Avenue /64 Female Divorced 1 Jenny 5 Brighton Street /64 Female Divorced 2 Nick 12 Doyle Ave /64 Male Widow 3 Tom 154 West End Av /64 Male Married 4 John 93 Somers Str /63 Male Married 5 Bob 35 University Av /63 Male Married 6 Noeleen 63 Mirror Street /64 Female Married 7 Eleni 67 Common Av /61 Female Married 8 Dave 65 Main Str /61 Male Single 9 Thomas 84 Main Ave /61 Male Single direct identifiers quasi-identifiers sensitive/other info 52

53 Identification of Privacy Vulnerabilities (IPV) Privacy-preserving data publishing Goal: Automatically analyze a dataset to expose privacy vulnerabilities, which could lead to privacy attacks, and to validate its protection level Methods: Identify existing, publicly available datasets that could be used by adversaries to perform triangulation attacks Identify privacy risks in the data itself (e.g., outliers, uniqueness or rarity of certain records based on some attributes, frequent behaviour of an individual that is infrequent for many others, etc.) Outcome: A list of identified privacy vulnerabilities and configuration options for the data anonymization algorithm 53

54 Discover QIDs based on publicly available datasets Name Birth ZIP Disease John 03/ HIV Bob 03/ Cancer Noeleen 09/ Flu Name Gender ZIP Salary Maria Female $100K Jenny Female $1M Nick Male $20K Name ZIP Marital status Nick Widow Tom Married John Married Considered publicly available datasets ID Birth Gender ZIP Marital status 0 09/64 Female Divorced 1 09/64 Female Divorced 2 04/64 Male Widow 3 04/64 Male Married 4 03/63 Male Married 5 03/63 Male Married 6 09/64 Female Married 7 09/61 Female Married 8 05/61 Male Single 9 05/61 Male Single Dataset to be published q Have we found all the existing publicly available datasets that could be linked to our data? q What if a dataset becomes available after the data has been published and can be linked to the published dataset, exposing the identity of individuals? 54

55 Discover privacy vulnerabilities on the data itself ID Birth Gender ZIP Marital status 0 09/64 Female Divorced 1 09/64 Female Divorced Discovery of privacy vulnerabilities: q What could lead to re-identification attacks in my data, assuming that sufficient background knowledge and/or public datasets are available to attackers? 2 04/64 Male Widow 3 04/64 Male Married 4 03/63 Male Married 5 03/63 Male Married 6 09/64 Female Married 7 09/61 Female Married 8 05/61 Male Single 9 05/61 Male Single Dataset to be published q What is unique or rare in a record? q Are there any outliers in my data? Quantification of privacy risk: q How many vulnerabilities exist in my dataset? q How powerful an attacker should be in order to reidentify the most vulnerable individual in the data? q How many individuals can be re-identified? 55

56 Identification of Privacy Vulnerabilities Our recent algorithms: Algorithm Description Part of IPV Tool MTUI (Multi-Threaded Uniques Identification) & FPVI (Fast Algorithm for privacy Vulnerabilities Identification) MTRA (Multi-Threaded Risk Assessment) MTS2 (Multi-Threaded Sample uniques identification algorithm) MTLVI (Multi-Threaded Location-based Vulnerabilities Identification) These algorithms compute the quasi-identifiers of a dataset to protect the data from re-identification attacks. They expose all minimal combinations of attributes that lead to unique (or rare) records. This algorithm calculates the vulnerability index for each combination of attributes in a dataset, by reporting the cardinality of the smallest group of records that share the same values for each combination. This algorithm identifies the specific records that are unique, along with the combination of attributes for which they are unique (or rare) in the dataset. This algorithm identifies and reports on a series of privacy vulnerabilities in the context of location/mobility data (user trajectories), such as sensitive sequences of locations (or location-time pairs) for a user, sensitive user itineraries, sensitive places-of-interest (POIs), infrequent user movement behavior, etc. 56

57 Example: Identification of Privacy Vulnerabilities ID Birth Gender ZIP Marital status 0 09/64 Female Divorced 1 09/64 Female Divorced 2 04/64 Male Widow 3 04/64 Male Married 4 03/63 Male Married 5 03/63 Male Married 6 09/64 Female Married 7 09/61 Female Married 8 05/61 Male Single 9 05/61 Male Single Search space (lattice) for discovering privacy vulnerabilities in a relational/transaction dataset Discovering quasi-identifiers Goal: Find all minimal combinations of attributes in the dataset that lead to at most k-individuals FPVI algorithm: {M à 2}, {BZ à 1}, {GZ à 0} (the number corresponds to the first entry where a unique was found) Discovering vulnerability indexes Goal: Find the vulnerability index for each combination of attributes MTRA algorithm: {B à 2}, {G à 4}, {Z à 2}, {M à1}, {BG à 2}, (the number corresponds to the vulnerability index for the corresponding combination) 57

58 Scalability of the MTUI and MTRA approaches Scalability on other datasets with various characteristics (Records: 165K 11M, Attributes: 9 50): MTUI significantly outperforms the state-of-the-art methods in all tested datasets and with all tested parameters. It can also analyze datasets that the other algorithms are unable to process (see (g)-(h)). 58

59 Scalability of the FPVI algorithm Scalability on other datasets with various characteristics (Records: 165K 11M, Attributes: 9 50): FPVI significantly outperforms the state-of-the-art methods & MTUI in all tested datasets and can scale to datasets of millions of records and tens of attributes without requiring excessive time. 59

60 Risk Utility Confidentiality Maps Identifying a good trade-off between data privacy and data utility R-U Confidentiality map to track the trade-off* high Disclosure risk c c c Original data publishing c b minimum level of protection required Data publisher decides the desired trade-off low a No publishing minimum level of acceptable utility Utility high An intuitive tool that allows comparing different anonymizations * Duncan et al. Disclosure Risk vs. Data Utility: The R-U Confidentiality map. Tech. Rep LA-UR , Los Alamos National Library,

61 Risk Utility Confidentiality Maps Selecting data anonymizations with a desired trade-off R-U confidentiality map Allows data publishers to compare data anonymizations Enables the selection of the desired anonymization Disclosure risk high c Solutions of PCTA d d d d c c a No publishing low Utility d Original data publishing b d c c c c high Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, 2011 Anonymization with best utility/privacy trade-off 61