Privacy Challenges and Solutions for Data Sharing. Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland

Transcription

1 Privacy Challenges and Solutions for Data Sharing Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland June 8, 2015

2 Content Introduction / Motivation for data privacy Focus Application Privacy Preserving Medical Data Publishing Electronic Medical Records (EMR) and their use in research Privacy threats and their effectiveness Privacy models for relational data (demographics) Privacy models for transaction/set-valued data (diagnoses codes) Policy-based anonymization model Rule-based anonymization model Anonymization of RT (relational-transaction)-datasets SECRETA anonymization toolkit Summary 2

3 Data sharing Individuals data are increasing shared Netflix published movie ratings of 500K subscribers AOL published 20M search query terms of 658K web users TomTom sold customers location (GPS) data to the Dutch police emerge consortium published patient data related to genome-wide association studies to biorepositories (dbgap) Orange provided call information about its mobile subscribers, as part of the D4D challenge on mobile phone data ( Benefits of data sharing Personalization (e.g., Netflix s data mining contest aimed to improve movie recommendation based on personal preferences) Marketing (e.g., Tesco made 53M from selling shopping patterns to retailers and manufacturers, such as Nestle and Unilever, last year) Social benefits (e.g., promote medical research studies, improve traffic management, etc.) 3

4 Data sharing must guarantee privacy and accommodate utility A popular data sharing scenario (data publishing) Original data Released data data owners data publisher (trusted) data recipient (untrusted) Threats to data privacy Identity disclosure Sensitive information disclosure Membership disclosure Inferential disclosure Data utility requirements Minimal data distortion (general purpose use) Support of specific applications / workloads (e.g., building accurate predictive models, GWAS, etc.) 4

5 Privacy Preserving Medical Data Publishing How can we share medical data in a way that protects patients privacy while supporting research studies? The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

6 Electronic Medical Records (EMR) Relational data Registration and demographic data Transaction (set-valued) data Billing information ICD codes* are represented as numbers (up to 5 digits) and denote signs, findings, and causes of injury or disease** Sequential data DNA Text data Clinical notes Electronic Medical Records (EMR) Name YOB ICD DNA Clinical notes Jim , 185 C T (doc1) Mary , A G (doc2) Mary C G (doc3) Carol C G (doc4) Anne , G C (doc5) Anne A T (doc6) * International Statistical Classification of Diseases and Related Health Problems ** Centers for Medicare & Medicaid Services - 6

7 EMR data use in analytics Statistical analysis Correlation between YOB and ICD code 185 (Malignant neoplasm of prostate) Querying Clustering Control epidemics* Classification Predict domestic violence** Association rule mining Electronic Medical Records Name YOB ICD Formulate a government policy on hypertension management*** IF age in [43,48] AND smoke = yes AND exercise=no AND drink=yes; DNA Jim , C T Mary A G Mary , C G Carol , C G Anne , G C Anne A T THEN hypertension=yes (sup=2.9%; conf=26%)0 * Tildesley et al. Impact of spatial clustering on disease transmission and optimal control, PNAS, ** Reis et al. Longitudinal Histories as Predictors of Future Diagnoses of Domestic Abuse: Modelling Study, BMJ: British Medical Journal, 2011 *** Chae et al. Data mining approach to policy analysis in a health insurance domain. Int. J. of Med. Inf.,

8 Need for privacy Why we need privacy in medical data sharing? If privacy is breached, there are consequences to patients Consequences to patients Emotional and economical embarrassment 62% of individuals worry their EMRs will not remain confidential* 35% expressed privacy concerns regarding the publishing of their data to dbgap** Opt-out or provide fake data à difficulty to conduct statistically powered studies * Health Confidence Survey 2008, Employee Benefit Research Institute ** Ludman et al. Glad You Asked: Participants Opinions of Re-Consent for dbgap Data Submission. Journal of Empirical Research on Human Research Ethics,

9 Need for privacy If privacy is breached, there are consequences to organizations Legal à HIPAA, EU legislation (95/46/EC, 2002/58/EC, 2009/136/EC etc.) Financial à The average cost of a single data breach is $5.85M in the US and $4.74M in Germany; these countries have the highest per capita cost. Healthcare and Education are the most heavily regulated industries * Ponemon Institute Research Report 2014 Cost of Data Breach Study: Global Analysis. 9

10 Protecting data privacy: data masking / removal of identifiers Removing / masking direct identifiers data owners data publisher (trusted) data recipient (untrusted) Original data De-identified data 1. Locate the direct identifiers (attributes that uniquely identify an individual), such as SSN, Patient ID, Phone number etc. 2. Remove or mask them from the data prior to data publishing Name John Doe Thelma Arnold Search Query Terms Harry potter, King s speech Hand tremors, bipolar, dry mouth, effect of nicotine on the body 10

11 Protecting data privacy: data masking / removal of identifiers Masking or removal of direct identifiers is not sufficient! data owners data publisher (trusted) data recipient (untrusted) Original data Released data Main types of threats to data privacy Identity disclosure Sensitive information disclosure Inferential disclosure External data Background Knowledge 11

12 Privacy Threats: Identity disclosure Identity disclosure in relational data (e.g., patients demographics) Individuals are linked to their published records based on quasi-identifiers (attributes that in combination can identify an individual) Age Postcode Sex 20 NW10 M 45 NW15 M 22 NW30 M 50 NW25 F De-identified data Name Age Postcode Sex Greg 20 NW10 M Jim 45 NW15 M Jack 22 NW30 M Anne 50 NW25 F External data * Sweeney, k- anonymity: a model for protec7ng privacy. IJUFKS, % of US citizens can be identified by Age, DOB, 5-digit ZIP code* 12

13 Identity disclosure in sharing patients diagnosis codes Identity disclosure in transaction data (e.g., diagnosis codes) Identified EMR data ID ICD Jim Mary Anne Released EMR Data ICD DNA CT A AC T GC C Mary is diagnosed with benign essential hypertension (ICD code 401.1) the second record belongs to her à all her diagnosis codes Disclosure based on diagnosis codes* à general problem for other medical terminologies (e.g., ICD-10 used in EU) à sharing data susceptible to the attack is against legislation * Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants Privacy. JAMIA,

14 Real-world identity disclosure cases involving medical data Group Insurance Commission data Voter list of Cambridge, MA William Weld, Former Governor of MA Chicago Homicide database Social security death index 35% of murder victims Adverse Drug Reaction Database Public obituaries 26-year old girl who died from drug 14

15 Issuing attacks on medical datasets Two-step attack using publicly available voter registration lists and hospital discharge summaries voter(name,..., zip, dob, sex) summary(zip, dob, sex, diagnoses) release(diagnoses, DNA) 87% of US citizens can be identified by {dob, gender, ZIP-code} voter list & discharge summary à privacy breach * Sweeney, k- anonymity: a model for protec7ng privacy. IJUFKS,

16 Issuing attacks on medical datasets One-step attack using EMRs* Insider s attack EMR (name,..., diagnoses) release(, diagnoses, DNA) * Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants Privacy. JAMIA,

17 Case study: Evaluating the effectiveness of the insider s attack De-identified / Masked EMR population from VUMC Population: 1.2M records (patients) from Vanderbilt A unique random number for ID de-identified EMR (ID,..., diagnoses) VNEC(, diagnoses, DNA) VNEC de-identified / masked EMR sample 2762 records (patients) derived from the population Patients from VNEC were involved in a study (GWAS) for the Native Electrical Conduction of the heart Patients EMR were to be deposited into dbgap Data would be made available to support other studies (GWAS)? 17

18 Case study: Evaluating the effectiveness of the insider s attack Vanderbilt s EMR - VNEC dataset linkage on ICD codes 100.0% % of re-identified sample 90.0% 80.0% 70.0% 96.5% We assume that all ICD codes are used to issue an attack 96.5% of patients susceptible to identity disclosure 60.0% Distinguishability (log scale) Number of times a set of ICD codes appears in the population (Support count in the data mining literature) 18

19 Case study: Evaluating the effectiveness of the insider s attack Vanderbilt s EMR - VNEC dataset linkage on ICD codes 1 ICD code 2 ICD code combination 3 ICD code combination 10 ICD code combination % of re-identifiable sample 100% 80% 60% 40% 20% 0% Distinguishability (log scale) A random subset of ICD codes that can be used in attack Knowing a random combination of 2 ICD codes can lead to unique re-identification Number of times a set of ICD codes appears in the population (equiv. to support count in data mining literature) 19

20 Case study: Evaluating the effectiveness of the insider s attack VNEC dataset linkage on ICD codes Hospital discharge records Number of times a set of ICD codes appears in the VNEC (Support count in data mining literature) All ICD codes associated with a patient for a single visit Difficult to know ICD codes that span visits when public discharge summaries are used 46% uniquely re-identifiable patients in VNEC 20

21 Privacy Threats: Sensitive information disclosure Individuals are associated with sensitive information Name Age Postcode Sex Greg 20 NW10 M Background knowledge Identified EMR data ID ICD Jim Mary Sensitive Attribute (SA) Age Postcode YOB Disease 20 NW HIV 20 NW HIV 20 NW HIV 20 NW HIV De-identified data Released EMR Data ID ICD DNA Jim C A Mary A T Mary is diagnosed with and 401.1à she has Schizophrenia Schizophrenia Sensitive information disclosure can occur without identity disclosure 21

22 Sensitive information disclosure in Netflix movie rating sharing 100M dated ratings from 480K users to 18K movies data mining contest ($1M prize) to improve movie recommendation based on personal preferences movies reveal political, religious, and sexual beliefs and need protection according to Video Protection Act Anonymized De-identification A lawsuit Sampling, was filed, date modification, Netflix settled rate suppression the lawsuit Movie title and year published in full We will find new ways to collaborate with researchers Researchers inferred movie rates of subscribers* Data are linked with IMDB w.r.t. ratings and/or dates * Narayanan et al. Robust De- anonymiza7on of Large Sparse Datasets. IEEE Symposium on Security and Privacy

23 Privacy Threats: Inferential disclosure Sensitive knowledge patterns are exposed by data mining 75% of patients visit the same physician more than 4 times Unsolicited advertisement 60% of the white males > 50 suffer from diabetes Stream data collected by health monitoring systems Electronic medical records Customer discrimination Drug orders & costs Business rivals can harm data publishers and insurance, pharmaceutical & marketing companies can harm data owners* * G. Das and N. Zhang. Privacy risks in health databases from aggregate disclosure. PETRA,

24 Anonymization of demographics k-anonymity principle* Each record in a relational table T should have the same value over quasi-identifiers with at least k-1 other records in T These records collectively form a k-anonymous group k-anonymity protects from identity disclosure Protects data from linkage to external sources (triangulation attacks) The probability that an individual is correctly associated with their record is at most 1/k Name Age Postcode Sex Greg 40 NW10 M Jim 45 NW15 M Jack 22 NW30 M Anne 50 NW25 F External data Age Postcode Sex 4* NW1* M 4* NW1* M * NW* * * NW* * 2-anonymous data * Sweeney. Achieving k- anonymity privacy protec7on using generaliza7on and suppression. IJUFKS

25 Anonymization of demographics k-anonymity Pros A baseline model Intuitive Has been implemented in many real-world systems Follows & impacts privacy legislation Name Age Postcode Sex Greg 40 NW10 M Jim 45 NW15 M Jack 22 NW30 M Anne 50 NW25 F External data Cons Does not protect against sensitive information disclosure Requires the data owner to specify the quasi-identifiers (QIDs) and the k-value Age Postcode Sex 4* NW1* M 4* NW1* M * NW* * * NW* * 2-anonymous data 25

26 Attack on k-anonymous data Homogeneity attack* All sensitive values in a k-anonymous group are the same à sensitive information disclosure Name Age Postcode Greg 40 NW10 External data Age Postcode Disease 4* NW1* HIV 4* NW1* HIV 5* NW* Ovarian Cancer 5* NW* Flu 2-anonymous data Attacker is confident that Greg suffers from HIV * Machanavajjhala et al, l-diversity: Privacy Beyond k-anonymity. ICDE

27 l-diversity principle for demographics l -diversity* A relational table is l-diverse if all groups of records with the same values over quasi-identifiers (QID groups) contain no less than l well-represented values for the sensitive attribute (SA) 6-anonymous group Distinct l-diversity l well-represented à l distinct Age Postcode Disease 4* NW1* HIV 4* NW1* HIV 4* NW1* HIV 4* NW1* HIV 4* NW1* Flu 4* NW1* Cancer Three distinct values, but the probability of HIV being disclosed is ~

28 Further improvements over l-diversity Sensitive values may not need the same level of protection (a,k)-anonymity [1] l-diversity is difficult to achieve when the SA values are skewed t-closeness [2] Does not consider semantic similarity of SA values (e,m)-anonymity [3], range diversity [4] Can patients decide the level of protection for their SA values? Personalized privacy [5] [1] Wong et al., (alpha, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing, KDD [2] Li et al., t-closeness: Privacy Beyond k-anonymity and l-diversity, ICDE [3] Li et al. Preservation of proximity privacy in publishing numerical sensitive data. SIGMOD [4] Loukides et al. Preventing range disclosure in k-anonymised data. Expert Syst. Appl [5] Xiao et al. Personalized privacy preservation. SIGMOD,

29 Partition-based algorithms for k-anonymity Main idea of partition-based algorithms A record projected over QIDs is treated as a multidimensional point A subspace (hyper-rectangle) that contains at least k points can form a k-anonymous group à multidimensional global recoding Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity M F How to partition the space? One attribute at a time which to use? How to split the selected attribute? 29

30 Mondrian algorithm Mondrian (D,k)* Find the QID attribute Q with the largest domain Attribute selection Find the median µ of Q Create subspace S with all records of D whose value in Q is less than µ Create subspace S with all records of D whose value in Q is at least µ Attribute split If S k or S k Return Mondrian(S,k) U Mondrian(S,k) Else Return D Recursive execution * LeFevre et al. Mondrian multidimensional k-anonymity, ICDE,

31 Example of applying Mondrian (k=2) M M F F Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity Age Sex Disease [20-26] {M,F} HIV [20-26] {M,F} HIV [20-26] {M,F} Obesity [27-29] F HIV [27-29] F Cancer [27-29] F Obesity 31

32 Other works on partition-based algorithms R-tree based algorithm [1] Optimized partitioning for intended tasks [2] Classification Regression Query answering Algorithms for disk-resident data [3] Extensions to prevent sensitive information disclosure [4] [1] Iwuchukwu et al. K-anonymization as spatial indexing: toward scalable and incremental anonymization, VLDB, [2] LeFevre et al. Workload-aware anonymization. KDD, [3] LeFevre et al. Workload-aware anonymization techniques for large-scale datasets. TODS, [4] Loukides et al. Preventing range disclosure in k-anonymised data. Expert Syst. Appl

33 Clustering-based anonymization algorithms Main idea of clustering-based anonymization 1.Create clusters containing at least k records with similar values over QIDs Seed selection Similarity measurement Stopping criterion 2. Anonymize records in each cluster separately Local recoding and/or Suppression??? 33

34 Clustering-based anonymization algorithms Clusters need to be separated Seed Selection Furthest-first Random Clusters need to contain similar values Similarity measurement Stopping criterion Size-based Quality-based Clusters should not be too large All these heuristics attempt to improve data utility 34

35 Bottom-up clustering algorithm Bottom-up clustering algorithm* Each record is selected as a seed to start a cluster While there exists group G For each group G s.t. s.t. Find group G' s.t. NCP( G G') is min. and merge G and For each group s.t. G Split G into groups s.t. each group has at k least k records Generalize the QID values in each group Return all groups G G < k G > 2 G < k k G' * Xu et al. Utility-Based Anonymization Using Local Recoding, KDD,

36 Example of Bottom-up clustering algorithm (k=2) M M M F F F Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity Age Sex Disease [20-25] M HIV [20-25] M Obesity [23-27] F HIV [23-27] F HIV [28-29] F Cancer [28-29] F Obesity 36

37 Example of top-down clustering algorithm (k=2) M M M F F F Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity Age Sex Disease [20-25] {M,F} HIV [20-25] {M,F} HIV [20-25] {M,F} Obesity [27-29] F HIV [27-29] F Cancer [27-29] F Obesity 37

38 Other works on clustering-based anonymization Constant factor approximation algorithms* Publish only the cluster centers along with radius information Combine partitioning with clustering for efficiency** * Aggarwal et al. Achieving anonymity via clustering. ACM Trans. on Algorithms, ** Loukides et al. Preventing range disclosure in k-anonymised data. Expert Syst. Appl

39 Preventing identity disclosure from diagnosis codes: Suppression Suppression Removes items or records from data prior to releasing the data Suppress ICD codes* appearing in less than a certain percent of patient records Intuition: such ICD codes can act as quasi-identifiers Identified EMR data ID ICD Mary Anne Released EMR Data ICD DNA AC T GC C * Vinterbo et al. Hiding information by cell suppression. AMIA Annual Symposium 01 39

40 Code suppression a case study using Vanderbilt s EMR data We had to suppress diagnosis codes appearing in less than 25% of the records in VNEC to prevent re-identification doing so we were left with only 5 out of ~6000 ICD codes! * 5-Digit ICD-9 Codes 3-Digit ICD-9 Codes ICD-9 Sections Benign essential hypertension Other malaise and fatigue 401-Essential hypertension 780- Other soft tissue Pain in limb Other disorders of soft tissues Abdominal pain 789 Other abdomen/pelvis symptoms Hypertensive disease Rheumatism excluding the back Rheumatism excluding the back Symptoms Chest pain 786 -Respiratory system Symptoms *Loukides, Gkoulalas- Divanis, Malin. Anonymiza7on of Electronic Medical Records for Valida7ng Genome- Wide Associa7on Studies. PNAS

41 Preventing identity disclosure in EMR data: Generalization Generalization - replaces items with more general ones (usually with the help of a domain hierarchy) Any Chapters Sections 3-digit ICD codes 5-digit ICD codes Any Generalize ICD-codes to their 3-digit representation benign essential hypertension à 401- essential hypertension Identified EMR data ID ICD Mary Anne Released EMR Data ICD DNA AC T GC C 41

42 Code generalization a case study using Vanderbilt s EMR data Generalizing ICD codes from VNEC* 5-digit ICD codes 3-digit ICD codes 100.0% % of re-identified sample 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% 96.5% 75.0% 25.6% 95% no suppression suppression 5% suppression 15% suppression 25% distinguishability (log scale) 95% of the patients remain re-identifiable * Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants Privacy. JAMIA,

43 Complete k-anonymity & k m -anonymity Complete k-anonymity: Knowing that an individual is associated with any itemset, an attacker should not associate this individual to < k transactions ICD DNA AC T GC C CC A CA T Original data ICD DNA AC T GC C CC A CA T 2-complete anonymous data k m -anonymity: Knowing that an individual is associated with any m-itemset, an attacker should not associate this individual to less than k transactions ICD DNA AC T GC C CC A CA T Original data ICD DNA 401 AC T 401 GC C 401 CC A 401 CA T anonymous data 43

44 Applicability of complete k-anonymity and k m -anonymity to medical data Limited in the specification of privacy requirements Assume too powerful attackers all m-itemsets (combinations of m diagnosis codes) need protection but medical data publishers have detailed privacy requirements Explore a small number of possible generalizations Do not take into account utility requirements Attackers know who is diagnosed with abc or defgh They protect all 5-itemsets instead of the 2 itemsets privacy constraints 44

45 Policy-based anonymization model Policy-based anonymization for ICD codes* Global anonymization model Models both generalization and suppression Each original ICD code is replaced by a unique set of ICD codes no need for generalization hierarchies ICD codes Anonymized codes (493.00, ) (296.01, ) Generalized ICD code interpreted as or or both Φ ( ) Suppressed ICD code Not released *Loukides, Gkoulalas- Divanis, Malin. Anonymiza7on of Electronic Medical Records for Valida7ng Genome- Wide Associa7on Studies. PNAS 10 45

46 Policy-based anonymization: Privacy model Data publishers specify diagnosis codes that need protection Privacy Model: Knowing that an individual is associated with one or more specific itemsets (privacy constraints), an attacker should not be able to associate this individual to less than k transactions ICD DNA AC T GC C CC A CA T Original data ICD DNA AC T (401.2, 401.4) GC C CC A (401.2, 401.4) CA T Anonymized data Privacy Policy: The set of all specified privacy constraints Privacy is achieved when all privacy constraints are supported by at least k transactions in the published data or do not appear at all 46

47 Policy-based anonymization: Data utility considerations Utility Constraints: Published data must remain as useful as the original data for conducting a GWAS on a disease or trait à number of cases and controls in a GWAS must be preserved Supporting utility constraints: ICD codes from utility policy are generalized together; a larger part of the solution space is searched than when using domain generalization hierarchies (296.00, ) 47

48 Policy-based anonymization: Measuring information loss Utility Loss: A measure to quantify the level of information loss incurred by anonymization Favors (493.01) over (493.01, ) captures the introduced uncertainty of interpreting an anonymized item customizable # of items mapped to generalized item weight (semantic closeness) fraction of affected transactions 48

49 Policy-based anonymization algorithms Goal: Anonymize medical records so that Privacy is guaranteed Utility is high à many GWAS are supported simultaneously Incurred information loss is minimal Challenging optimization problem NP-hard Feasibility depends on constraints Heuristic algorithms Utility-Guided Anonymization of Clinical Profiles (UGACLIP) Clustering-based Anonymization (CBA) Algorithm Efficiency Scalability UGACLIP CBA Utility 49

50 Anonymization algorithms: UGACLIP Sketch of UGACLIP (PNAS) Input: EMRs, Privacy Policy, Utility Policy, k Output: Anonymized EMRs While the Privacy Policy is not satisfied Select the privacy constraint p that corresponds to most patients While p is not protected Select the ICD code i in p that corresponds to fewest patients Anonymize i If i can be anonymized according to the Utility Policy Else generalize i to (i,i ) suppress each unprotected ICD code in p Considers privacy constraints in a certain order Protects a privacy constraint by set-based anonymization - Generalization when Utility Policy is satisfied - otherwise suppression 50

51 Anonymization algorithms: UGACLIP Privacy Policy Utility Policy k=2 EMR data ICD DNA CT A AC T GC C UGACLIP Algorithm Data is protected; {296.00, , } appears 2 times Anonymized EMR data ICD DNA (296.00, ) CT A AC T (296.00, ) GC C Data remains useful for GWAS on Bipolar disorder; associations between (296.00, ) and DNA region CT A are preserved 51

52 Anonymization algorithms: CBA Sketch of CBA Retrieve the ICD codes that need less protection from the Privacy Policy Gradually build a cluster of codes that can be anonymized according to the utility policy and with minimal UL If the ICD codes are not protected Suppress no more ICD codes than required to protect privacy privacy req. are met p 1 = {i 1 } p 2 = {i 5, i 6 } p 3 = {i 3, i 4 } k=3 clusters merging (driven by UL) singleton clusters 52

53 Case Study: EMRs from Vanderbilt University Medical Center Datasets VNEC 2762 de-identified EMRs from Vanderbilt involved in a GWAS VNECkc subset of VNEC, we know which diseases are controls for others BIOVU all de-identified EMRs (79087) from Vanderbilt s biobank (the largest dataset in medical data privacy literature)* Methods UGACLIP and CBA ACLIP (state-of-the-art method it does not take utility policy into account) *Loukides, Gkoulalas- Divanis. U7lity- aware anonymiza7on of diagnosis codes. IEEE TITB

54 UGACLIP & CBA: First algorithms to offer data utility in GWAS Setting: k = 5, protecting single-visits of patients, 18 GWAS-related diseases* no utility constraints Diseases related to all GWAS reported in Manolio* Best competitor Result of ACLIP is useless for validating GWAS UGACLIP preserves 11 out of 18 GWAS CBA preserves 14 out of 18 GWAS simultaneously * Manolio et al. A HapMap harvest of insights into the genetics of common disease. J Clinic. Inv

55 Utility beyond GWAS Supporting clinical case counts in addition to GWAS learn number of patients with sets of codes in 10% of the records useful for epidemiology and data mining applications act. estim. act. VNECkc VNECkc Queries can be estimated accurately (ARE <1.25), comparable to ACLIP Anonymized data can support both GWAS and studies on clinical case counts 55

56 Anonymizing the BIOVU (79K EMR) Supporting clinical case counts in BIOVU Very low error in query answering (Average Relative Error <1) All EMRs in the VUMC biobank can be anonymized and remain useful 56

57 Rule-based anonymization Our approach*: We use PS-rules to express protection requirements against both identity and sensitive information disclosure Public items at least k records to support I (preventing identity disclosure) Contributions I à J Rule-based privacy model More flexible and general than existing models Sensitive items at most c x 100% of the records that support I also support J (preventing sensitive information disclosure) Intuitive and able to capture real-world privacy requirements Three anonymization algorithms Effective (better data utility and protection than state-of-the-art) Efficient (efficient rule checking strategies, sampling, etc.) *G. Loukides, A. Gkoulalas- Divanis, J. Shao. Anonymizing transac7on- data to eliminate sensi7ve inferences. DEXA 10 (extended to KAIS) 57

58 Rule-based anonymization An example of offering PS-rule based anonymization Name Diagnoses codes Name Diagnoses codes Mary Bob Tom Anne Brad Jim Name Mary Bob Tom Anne Brad a b c d g h i j e f h i d g j e f g h a b d e i c f j Diagnoses codes a (b,c) d g h i j e f h i d g j e f g h a (b,c) d e i Mary Bob Tom Anne Brad Jim (a,b,c) (d,e,f) g h i j (d,e,f) h i (d,e,f) g j (d,e,f) g h (a,b,c) (d,e,f) i (a,b,c) (d,e,f) j PS-rules Jim (b,c) f j ü PS-rules can be automatically discovered and specified *Grigorios Loukides, Aris Gkoulalas- Divanis, Jianhua Shao: Efficient and flexible anonymiza7on of transac7on data. Knowledge and Informa7on Systems (KAIS), 36(1), pp , a à j c d à g d à h i original dataset 2 2 -anonymous dataset Hierarchy and PS-rules Anonymous dataset based on the PS-rules model ü Support of fine-grained, flexible privacy requirements ü Privacy protection from both identity and sensitive information disclosure ü General privacy model that incorporates existing privacy models

59 Experimental results: Setup Datasets BMS1, BMS2 contain click-stream data and POS contains sales transaction data Evaluation Is anonymized data useful in aggregate query answering? How efficient are the algorithms? Methods Tree-based, Sample-based vs Baseline (no pruning) and Apriori Anonymization* *M. Terrovi7s, N. Mamoulis, P. Kalnis. Privacy- preserving anonymiza7on of set- valued data, PVLDB

60 Data Utility: uniform privacy requirements BMS2 dataset, k=5,p=2 (all 2-itemsets need protection from identity disclosure) Same protection as Apriori for identity disclosure, and additionally thwarts sensitive information disclosure Our algorithms offer many times more accurate query answering 60

61 Data Utility: detailed privacy requirements BMS2 dataset, k=5,p=2, type 2-1 rules of varying number and rules of other types Apriori cannot take the detailed privacy requirements into account and overdistorts data Our algorithms protect data no more than necessary to satisfy these requirements, achieving much higher data utility 61

62 Efficiency BMS2 dataset, k=5,p=2, 5K type 2-1 rules Synthetic data of varying D and I, k=5,p=2, 5K rules of type 2-1 Sample-based is the fastest and most scalable; Apriori is the slowest 62

63 Anonymization of RT-datasets (e.g., demographics + diagnoses codes) Privacy Threat: Attackers know some relational attribute values (e.g., demographics) plus some sensitive items (e.g., diagnoses) for an individual. *G. Poulis, G. Loukides, A. Gkoulalas- Divanis, S. Skiadopoulos. Anonymizing data with rela7onal and transac7onal aaributes. PKDD

64 SECRETA anonymization tool SECRETA*, **: System for Evaluating and Comparing RElational and Transaction Anonymization algorithms. ü Evaluates algorithms for relational, transaction and RT-dataset anonymization ü Integrates 9 popular anonymization algorithms and 3 bounding methods for combining them ü R: Supports Incognito, Cluster, Top-down and Full subtree bottom-up ü T: Supports COAT, PCTA, Apriori, LRA and VPA ü Supports two modes of operation: Evaluation and Comparison *G. Poulis, A. Gkoulalas- Divanis, G. Loukides, C. Tryfonopoulos, S. Skiadopoulos. SECRETA: A system for evalua7ng and comparing rela7onal and transac7on anonymiza7on algorithms. EDBT 14. **G. Poulis, G. Loukides, A. Gkoulalas- Divanis, S. Skiadopoulos. Anonymizing data with rela7onal and transac7onal aaributes. PKDD

65 Summary Explained the need for privacy in medical data sharing Presented the state-of-the-art in privacy-preserving medical data publishing to support intended analyses Elaborated on the policy-based anonymization model, which allows data publishers to specify detailed privacy and utility constraints for the data anonymization process Discussed methods for anonymizing data of co-existing data types, and introduced the SECRETA anonymization tool Thank you! Questions?

66 Internship IBM paid internships per DRL (out of ~100 applications) ~5 unpaid internships Internship duration: 3-4 months, start date: flexible Positions are advertised in December and filled as soon as possible Each candidate identifies an DRL to work with and a project that is of mutual interest to the applicant and to IBM The candidate submits a 1-2 page document on the project that he / she will be involved in if accepted At the end of the internship the candidate gives a talk to the lab on his/ her accomplishments during the internship Need more information? me at: [email protected] 66