Privacy Challenges and Solutions for Data Sharing. Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland



Similar documents
Medical Data Sharing: Privacy Challenges and Solutions

CS346: Advanced Databases

Privacy Techniques for Big Data

Information Security in Big Data using Encryption and Decryption

DATA MINING - 1DL360

CS377: Database Systems Data Security and Privacy. Li Xiong Department of Mathematics and Computer Science Emory University

Policy-based Pre-Processing in Hadoop

Privacy Challenges of Telco Big Data

De-Identification 101

Differential privacy in health care analytics and medical research An interactive tutorial

RESEARCH. Acar Tamersoy. Thesis. Submitted to the Faculty of the. Graduate School of Vanderbilt University. for the degree of MASTER OF SCIENCE

Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service

Data attribute security and privacy in distributed database system

A GENERAL SURVEY OF PRIVACY-PRESERVING DATA MINING MODELS AND ALGORITHMS

(Big) Data Anonymization Claude Castelluccia Inria, Privatics

ARX A Comprehensive Tool for Anonymizing Biomedical Data

Societal benefits vs. privacy: what distributed secure multi-party computation enable? Research ehelse April Oslo

Privacy-Preserving Big Data Publishing

GONZABA MEDICAL GROUP PATIENT REGISTRATION FORM

DRAFT NISTIR 8053 De-Identification of Personally Identifiable Information

Challenges of Data Privacy in the Era of Big Data. Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014

Data Privacy and Biomedicine Syllabus - Page 1 of 6

Anonymization of Longitudinal Electronic Medical Records. Acar Tamersoy, Grigorios Loukides, Mehmet Ercan Nergiz, Yucel Saygin, and Bradley Malin

Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0

Proposing a Novel Synergized K-Degree L-Diversity T- Closeness Model for Graph Based Data Anonymization

Privacy Preserving Data Mining

Classification and Prediction

Big Data Integration and Governance Considerations for Healthcare

Degrees of De-identification of Clinical Research Data

De-identification Koans. ICTR Data Managers Darren Lacey January 15, 2013

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

How to De-identify Data. Xulei Shirley Liu Department of Biostatistics Vanderbilt University 03/07/2008

Efficient Algorithms for Masking and Finding Quasi-Identifiers

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

ARTICLE 29 DATA PROTECTION WORKING PARTY

International Journal of Advanced Computer Technology (IJACT) ISSN: PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

Data Driven Approaches to Prescription Medication Outcomes Analysis Using EMR

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October-2013 ISSN

CONSENT FOR MEDICAL TREATMENT

Health Data De-Identification by Dr. Khaled El Emam

Obfuscation of sensitive data in network flows 1

A Q&A with the Commissioner: Big Data and Privacy Health Research: Big Data, Health Research Yes! Personal Data No!

Big Data Analytics for Healthcare

Healthcare data analytics. Da-Wei Wang Institute of Information Science

The Use of Patient Records (EHR) for Research

Auditing EMR System Usage. You Chen Jan, 17, 2013

MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior

The De-identification of Personally Identifiable Information

Contents QUALIFICATIONS PACK - OCCUPATIONAL STANDARDS FOR ALLIED HEALTHCARE

Data Outsourcing based on Secure Association Rule Mining Processes

Project Proposal: SAP Big Data Analytics on Mobile Usage Inferring age and gender of a person through his/her phone habits

Travis Goodwin & Sanda Harabagiu

Adult Information Form Page 1

PharmaSUG2011 Paper HS03

Healthcare Big Data Exploration in Real-Time

Protein Protein Interaction Networks

Module outline. CS 458 / 658 Computer Security and Privacy. (Relational) Databases. Module outline. Module 6 Database Security and Privacy.

Notice of Privacy Practices Walter L Cohen High School School-based Health Center. Effective as of August 6, 2004

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

Privacy Committee. Privacy and Open Data Guideline. Guideline. Of South Australia. Version 1

Understanding De-identification, Limited Data Sets, Encryption and Data Masking under HIPAA/HITECH: Implementing Solutions and Tackling Challenges

Introduction to Data Mining

ACKNOWLEDGEMENT OF RECEIPT OF WESTERN DENTAL S NOTICE OF PRIVACY PRACTICE

Transcription:

Privacy Challenges and Solutions for Data Sharing Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland June 8, 2015

Content Introduction / Motivation for data privacy Focus Application Privacy Preserving Medical Data Publishing Electronic Medical Records (EMR) and their use in research Privacy threats and their effectiveness Privacy models for relational data (demographics) Privacy models for transaction/set-valued data (diagnoses codes) Policy-based anonymization model Rule-based anonymization model Anonymization of RT (relational-transaction)-datasets SECRETA anonymization toolkit Summary 2

Data sharing Individuals data are increasing shared Netflix published movie ratings of 500K subscribers AOL published 20M search query terms of 658K web users TomTom sold customers location (GPS) data to the Dutch police emerge consortium published patient data related to genome-wide association studies to biorepositories (dbgap) Orange provided call information about its mobile subscribers, as part of the D4D challenge on mobile phone data (http://www.d4d.orange.com/en/home) Benefits of data sharing Personalization (e.g., Netflix s data mining contest aimed to improve movie recommendation based on personal preferences) Marketing (e.g., Tesco made 53M from selling shopping patterns to retailers and manufacturers, such as Nestle and Unilever, last year) Social benefits (e.g., promote medical research studies, improve traffic management, etc.) 3

Data sharing must guarantee privacy and accommodate utility A popular data sharing scenario (data publishing) Original data Released data data owners data publisher (trusted) data recipient (untrusted) Threats to data privacy Identity disclosure Sensitive information disclosure Membership disclosure Inferential disclosure Data utility requirements Minimal data distortion (general purpose use) Support of specific applications / workloads (e.g., building accurate predictive models, GWAS, etc.) 4

Privacy Preserving Medical Data Publishing How can we share medical data in a way that protects patients privacy while supporting research studies? The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

Electronic Medical Records (EMR) Relational data Registration and demographic data Transaction (set-valued) data Billing information ICD codes* are represented as numbers (up to 5 digits) and denote signs, findings, and causes of injury or disease** Sequential data DNA Text data Clinical notes Electronic Medical Records (EMR) Name YOB ICD DNA Clinical notes Jim 1955 493.00, 185 C T (doc1) Mary 1943 185, 157.3 A G (doc2) Mary 1943 493.01 C G (doc3) Carol 1965 493.02 C G (doc4) Anne 1973 157.9, 493.03 G C (doc5) Anne 1973 157.3 A T (doc6) * International Statistical Classification of Diseases and Related Health Problems ** Centers for Medicare & Medicaid Services - https://www.cms.gov/icd9providerdiagnosticcodes/ 6

EMR data use in analytics Statistical analysis Correlation between YOB and ICD code 185 (Malignant neoplasm of prostate) Querying Clustering Control epidemics* Classification Predict domestic violence** Association rule mining Electronic Medical Records Name YOB ICD Formulate a government policy on hypertension management*** IF age in [43,48] AND smoke = yes AND exercise=no AND drink=yes; DNA Jim 1955 493.00, 493.01 C T Mary 1943 185 A G Mary 1943 493.01, 493.02 C G Carol 1965 493.02, 157.9 C G Anne 1973 157.9, 157.3 G C Anne 1973 157.3 A T THEN hypertension=yes (sup=2.9%; conf=26%)0 * Tildesley et al. Impact of spatial clustering on disease transmission and optimal control, PNAS, 2010. ** Reis et al. Longitudinal Histories as Predictors of Future Diagnoses of Domestic Abuse: Modelling Study, BMJ: British Medical Journal, 2011 *** Chae et al. Data mining approach to policy analysis in a health insurance domain. Int. J. of Med. Inf., 2001 7

Need for privacy Why we need privacy in medical data sharing? If privacy is breached, there are consequences to patients Consequences to patients Emotional and economical embarrassment 62% of individuals worry their EMRs will not remain confidential* 35% expressed privacy concerns regarding the publishing of their data to dbgap** Opt-out or provide fake data à difficulty to conduct statistically powered studies * Health Confidence Survey 2008, Employee Benefit Research Institute ** Ludman et al. Glad You Asked: Participants Opinions of Re-Consent for dbgap Data Submission. Journal of Empirical Research on Human Research Ethics, 2010. 8

Need for privacy If privacy is breached, there are consequences to organizations Legal à HIPAA, EU legislation (95/46/EC, 2002/58/EC, 2009/136/EC etc.) Financial à The average cost of a single data breach is $5.85M in the US and $4.74M in Germany; these countries have the highest per capita cost. Healthcare and Education are the most heavily regulated industries. 10 6 * Ponemon Institute Research Report 2014 Cost of Data Breach Study: Global Analysis. 9

Protecting data privacy: data masking / removal of identifiers Removing / masking direct identifiers data owners data publisher (trusted) data recipient (untrusted) Original data De-identified data 1. Locate the direct identifiers (attributes that uniquely identify an individual), such as SSN, Patient ID, Phone number etc. 2. Remove or mask them from the data prior to data publishing Name John Doe Thelma Arnold Search Query Terms Harry potter, King s speech Hand tremors, bipolar, dry mouth, effect of nicotine on the body 10

Protecting data privacy: data masking / removal of identifiers Masking or removal of direct identifiers is not sufficient! data owners data publisher (trusted) data recipient (untrusted) Original data Released data Main types of threats to data privacy Identity disclosure Sensitive information disclosure Inferential disclosure External data Background Knowledge 11

Privacy Threats: Identity disclosure Identity disclosure in relational data (e.g., patients demographics) Individuals are linked to their published records based on quasi-identifiers (attributes that in combination can identify an individual) Age Postcode Sex 20 NW10 M 45 NW15 M 22 NW30 M 50 NW25 F De-identified data Name Age Postcode Sex Greg 20 NW10 M Jim 45 NW15 M Jack 22 NW30 M Anne 50 NW25 F External data * Sweeney, k- anonymity: a model for protec7ng privacy. IJUFKS, 2002. 87% of US citizens can be identified by Age, DOB, 5-digit ZIP code* 12

Identity disclosure in sharing patients diagnosis codes Identity disclosure in transaction data (e.g., diagnosis codes) Identified EMR data ID ICD Jim 333.4 Mary 401.0 401.1 Anne 401.0 401.2 401.3 Released EMR Data ICD DNA 333.4 CT A 401.0 401.1 AC T 401.0 401.2 401.3 GC C Mary is diagnosed with benign essential hypertension (ICD code 401.1) the second record belongs to her à all her diagnosis codes Disclosure based on diagnosis codes* à general problem for other medical terminologies (e.g., ICD-10 used in EU) à sharing data susceptible to the attack is against legislation * Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants Privacy. JAMIA, 2010. 13

Real-world identity disclosure cases involving medical data Group Insurance Commission data Voter list of Cambridge, MA William Weld, Former Governor of MA Chicago Homicide database Social security death index 35% of murder victims Adverse Drug Reaction Database Public obituaries 26-year old girl who died from drug 14

Issuing attacks on medical datasets Two-step attack using publicly available voter registration lists and hospital discharge summaries voter(name,..., zip, dob, sex) summary(zip, dob, sex, diagnoses) release(diagnoses, DNA) 87% of US citizens can be identified by {dob, gender, ZIP-code} voter list & discharge summary à privacy breach * Sweeney, k- anonymity: a model for protec7ng privacy. IJUFKS, 2002. 15

Issuing attacks on medical datasets One-step attack using EMRs* Insider s attack EMR (name,..., diagnoses) release(, diagnoses, DNA) * Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants Privacy. JAMIA, 2010. 16

Case study: Evaluating the effectiveness of the insider s attack De-identified / Masked EMR population from VUMC Population: 1.2M records (patients) from Vanderbilt A unique random number for ID de-identified EMR (ID,..., diagnoses) VNEC(, diagnoses, DNA) VNEC de-identified / masked EMR sample 2762 records (patients) derived from the population Patients from VNEC were involved in a study (GWAS) for the Native Electrical Conduction of the heart Patients EMR were to be deposited into dbgap Data would be made available to support other studies (GWAS)? 17

Case study: Evaluating the effectiveness of the insider s attack Vanderbilt s EMR - VNEC dataset linkage on ICD codes 100.0% % of re-identified sample 90.0% 80.0% 70.0% 96.5% We assume that all ICD codes are used to issue an attack 96.5% of patients susceptible to identity disclosure 60.0% 1 10 100 1000 Distinguishability (log scale) Number of times a set of ICD codes appears in the population (Support count in the data mining literature) 18

Case study: Evaluating the effectiveness of the insider s attack Vanderbilt s EMR - VNEC dataset linkage on ICD codes 1 ICD code 2 ICD code combination 3 ICD code combination 10 ICD code combination % of re-identifiable sample 100% 80% 60% 40% 20% 0% 1 10 100 1000 10000 100000 Distinguishability (log scale) A random subset of ICD codes that can be used in attack Knowing a random combination of 2 ICD codes can lead to unique re-identification Number of times a set of ICD codes appears in the population (equiv. to support count in data mining literature) 19

Case study: Evaluating the effectiveness of the insider s attack VNEC dataset linkage on ICD codes Hospital discharge records Number of times a set of ICD codes appears in the VNEC (Support count in data mining literature) All ICD codes associated with a patient for a single visit Difficult to know ICD codes that span visits when public discharge summaries are used 46% uniquely re-identifiable patients in VNEC 20

Privacy Threats: Sensitive information disclosure Individuals are associated with sensitive information Name Age Postcode Sex Greg 20 NW10 M Background knowledge Identified EMR data ID ICD Jim 401.0 401.1 295 Mary 401.0 401.1 303 295 Sensitive Attribute (SA) Age Postcode YOB Disease 20 NW10 1981 HIV 20 NW10 1987 HIV 20 NW10 1998 HIV 20 NW10 1970 HIV De-identified data Released EMR Data ID ICD DNA Jim 401.1 401.1 295 C A Mary 401.0 401.1 303 295 A T Mary is diagnosed with 401.0 and 401.1à she has Schizophrenia Schizophrenia Sensitive information disclosure can occur without identity disclosure 21

Sensitive information disclosure in Netflix movie rating sharing 100M dated ratings from 480K users to 18K movies data mining contest ($1M prize) to improve movie recommendation based on personal preferences movies reveal political, religious, and sexual beliefs and need protection according to Video Protection Act Anonymized De-identification A lawsuit Sampling, was filed, date modification, Netflix settled rate suppression the lawsuit Movie title and year published in full We will find new ways to collaborate with researchers Researchers inferred movie rates of subscribers* Data are linked with IMDB w.r.t. ratings and/or dates * Narayanan et al. Robust De- anonymiza7on of Large Sparse Datasets. IEEE Symposium on Security and Privacy 08. 22

Privacy Threats: Inferential disclosure Sensitive knowledge patterns are exposed by data mining 75% of patients visit the same physician more than 4 times Unsolicited advertisement 60% of the white males > 50 suffer from diabetes Stream data collected by health monitoring systems Electronic medical records Customer discrimination Drug orders & costs Business rivals can harm data publishers and insurance, pharmaceutical & marketing companies can harm data owners* * G. Das and N. Zhang. Privacy risks in health databases from aggregate disclosure. PETRA, 2009. 23

Anonymization of demographics k-anonymity principle* Each record in a relational table T should have the same value over quasi-identifiers with at least k-1 other records in T These records collectively form a k-anonymous group k-anonymity protects from identity disclosure Protects data from linkage to external sources (triangulation attacks) The probability that an individual is correctly associated with their record is at most 1/k Name Age Postcode Sex Greg 40 NW10 M Jim 45 NW15 M Jack 22 NW30 M Anne 50 NW25 F External data Age Postcode Sex 4* NW1* M 4* NW1* M * NW* * * NW* * 2-anonymous data * Sweeney. Achieving k- anonymity privacy protec7on using generaliza7on and suppression. IJUFKS. 2002. 24

Anonymization of demographics k-anonymity Pros A baseline model Intuitive Has been implemented in many real-world systems Follows & impacts privacy legislation Name Age Postcode Sex Greg 40 NW10 M Jim 45 NW15 M Jack 22 NW30 M Anne 50 NW25 F External data Cons Does not protect against sensitive information disclosure Requires the data owner to specify the quasi-identifiers (QIDs) and the k-value Age Postcode Sex 4* NW1* M 4* NW1* M * NW* * * NW* * 2-anonymous data 25

Attack on k-anonymous data Homogeneity attack* All sensitive values in a k-anonymous group are the same à sensitive information disclosure Name Age Postcode Greg 40 NW10 External data Age Postcode Disease 4* NW1* HIV 4* NW1* HIV 5* NW* Ovarian Cancer 5* NW* Flu 2-anonymous data Attacker is confident that Greg suffers from HIV * Machanavajjhala et al, l-diversity: Privacy Beyond k-anonymity. ICDE 2006. 26

l-diversity principle for demographics l -diversity* A relational table is l-diverse if all groups of records with the same values over quasi-identifiers (QID groups) contain no less than l well-represented values for the sensitive attribute (SA) 6-anonymous group Distinct l-diversity l well-represented à l distinct Age Postcode Disease 4* NW1* HIV 4* NW1* HIV 4* NW1* HIV 4* NW1* HIV 4* NW1* Flu 4* NW1* Cancer Three distinct values, but the probability of HIV being disclosed is ~0.67 27

Further improvements over l-diversity Sensitive values may not need the same level of protection (a,k)-anonymity [1] l-diversity is difficult to achieve when the SA values are skewed t-closeness [2] Does not consider semantic similarity of SA values (e,m)-anonymity [3], range diversity [4] Can patients decide the level of protection for their SA values? Personalized privacy [5] [1] Wong et al., (alpha, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing, KDD 2006. [2] Li et al., t-closeness: Privacy Beyond k-anonymity and l-diversity, ICDE 2007. [3] Li et al. Preservation of proximity privacy in publishing numerical sensitive data. SIGMOD 2008. [4] Loukides et al. Preventing range disclosure in k-anonymised data. Expert Syst. Appl. 2011. [5] Xiao et al. Personalized privacy preservation. SIGMOD, 2006. 28

Partition-based algorithms for k-anonymity Main idea of partition-based algorithms A record projected over QIDs is treated as a multidimensional point A subspace (hyper-rectangle) that contains at least k points can form a k-anonymous group à multidimensional global recoding Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity M F 20 22 24 26 28 30 How to partition the space? One attribute at a time which to use? How to split the selected attribute? 29

Mondrian algorithm Mondrian (D,k)* Find the QID attribute Q with the largest domain Attribute selection Find the median µ of Q Create subspace S with all records of D whose value in Q is less than µ Create subspace S with all records of D whose value in Q is at least µ Attribute split If S k or S k Return Mondrian(S,k) U Mondrian(S,k) Else Return D Recursive execution * LeFevre et al. Mondrian multidimensional k-anonymity, ICDE, 2006. 30

Example of applying Mondrian (k=2) M M F F 20 22 24 26 28 30 20 22 24 26 28 30 Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity Age Sex Disease [20-26] {M,F} HIV [20-26] {M,F} HIV [20-26] {M,F} Obesity [27-29] F HIV [27-29] F Cancer [27-29] F Obesity 31

Other works on partition-based algorithms R-tree based algorithm [1] Optimized partitioning for intended tasks [2] Classification Regression Query answering Algorithms for disk-resident data [3] Extensions to prevent sensitive information disclosure [4] [1] Iwuchukwu et al. K-anonymization as spatial indexing: toward scalable and incremental anonymization, VLDB, 2007. [2] LeFevre et al. Workload-aware anonymization. KDD, 2006. [3] LeFevre et al. Workload-aware anonymization techniques for large-scale datasets. TODS, 2008. [4] Loukides et al. Preventing range disclosure in k-anonymised data. Expert Syst. Appl. 2011. 32

Clustering-based anonymization algorithms Main idea of clustering-based anonymization 1.Create clusters containing at least k records with similar values over QIDs Seed selection Similarity measurement Stopping criterion 2. Anonymize records in each cluster separately Local recoding and/or Suppression??? 33

Clustering-based anonymization algorithms Clusters need to be separated Seed Selection Furthest-first Random Clusters need to contain similar values Similarity measurement Stopping criterion Size-based Quality-based Clusters should not be too large All these heuristics attempt to improve data utility 34

Bottom-up clustering algorithm Bottom-up clustering algorithm* Each record is selected as a seed to start a cluster While there exists group G For each group G s.t. s.t. Find group G' s.t. NCP( G G') is min. and merge G and For each group s.t. G Split G into groups s.t. each group has at k least k records Generalize the QID values in each group Return all groups G G < k G > 2 G < k k G' * Xu et al. Utility-Based Anonymization Using Local Recoding, KDD, 2006. 35

Example of Bottom-up clustering algorithm (k=2) M M M F F F 20 22 24 26 28 30 20 22 24 26 28 30 20 22 24 26 28 30 Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity Age Sex Disease [20-25] M HIV [20-25] M Obesity [23-27] F HIV [23-27] F HIV [28-29] F Cancer [28-29] F Obesity 36

Example of top-down clustering algorithm (k=2) M M M F F F 20 22 24 26 28 30 20 22 24 26 28 30 20 22 24 26 28 30 Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity Age Sex Disease [20-25] {M,F} HIV [20-25] {M,F} HIV [20-25] {M,F} Obesity [27-29] F HIV [27-29] F Cancer [27-29] F Obesity 37

Other works on clustering-based anonymization Constant factor approximation algorithms* Publish only the cluster centers along with radius information Combine partitioning with clustering for efficiency** * Aggarwal et al. Achieving anonymity via clustering. ACM Trans. on Algorithms, 2010. ** Loukides et al. Preventing range disclosure in k-anonymised data. Expert Syst. Appl. 2011. 38

Preventing identity disclosure from diagnosis codes: Suppression Suppression Removes items or records from data prior to releasing the data Suppress ICD codes* appearing in less than a certain percent of patient records Intuition: such ICD codes can act as quasi-identifiers Identified EMR data ID ICD Mary 401.0 401.1 Anne 401.0 401.3 Released EMR Data ICD DNA 401.0 401.1 AC T 401.0 401.3 GC C * Vinterbo et al. Hiding information by cell suppression. AMIA Annual Symposium 01 39

Code suppression a case study using Vanderbilt s EMR data We had to suppress diagnosis codes appearing in less than 25% of the records in VNEC to prevent re-identification doing so we were left with only 5 out of ~6000 ICD codes! * 5-Digit ICD-9 Codes 3-Digit ICD-9 Codes ICD-9 Sections 401.1- Benign essential hypertension 780.79 -Other malaise and fatigue 401-Essential hypertension 780- Other soft tissue 729.5 -Pain in limb 729 - Other disorders of soft tissues 789.0 -Abdominal pain 789 Other abdomen/pelvis symptoms Hypertensive disease Rheumatism excluding the back Rheumatism excluding the back Symptoms 786.5 -Chest pain 786 -Respiratory system Symptoms *Loukides, Gkoulalas- Divanis, Malin. Anonymiza7on of Electronic Medical Records for Valida7ng Genome- Wide Associa7on Studies. PNAS 2010 40

Preventing identity disclosure in EMR data: Generalization Generalization - replaces items with more general ones (usually with the help of a domain hierarchy) Any Chapters Sections 3-digit ICD codes 5-digit ICD codes Any 401 401.1 Generalize ICD-codes to their 3-digit representation 401.1 - benign essential hypertension à 401- essential hypertension Identified EMR data ID ICD Mary 401.0 401.1 Anne 401.0 401.3 Released EMR Data ICD DNA 401.0 401.1 AC T 401.0 401.3 GC C 41

Code generalization a case study using Vanderbilt s EMR data Generalizing ICD codes from VNEC* 5-digit ICD codes 3-digit ICD codes 100.0% % of re-identified sample 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% 96.5% 75.0% 25.6% 95% no suppression suppression 5% suppression 15% suppression 25% 1 10 100 1000 10000 100000 1000000 distinguishability (log scale) 95% of the patients remain re-identifiable * Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants Privacy. JAMIA, 2010. 42

Complete k-anonymity & k m -anonymity Complete k-anonymity: Knowing that an individual is associated with any itemset, an attacker should not associate this individual to < k transactions ICD DNA 401.0 401.1 AC T 401.2 401.3 GC C 401.0 401.1 CC A 401.4 401.3 CA T Original data ICD DNA 401.0 401.1 AC T 401 401.3 GC C 401.0 401.1 CC A 401 401.3 CA T 2-complete anonymous data k m -anonymity: Knowing that an individual is associated with any m-itemset, an attacker should not associate this individual to less than k transactions ICD DNA 401.0 401.1 AC T 401.2 401.3 GC C 401.0 401.1 CC A 401.4 401.3 CA T Original data ICD DNA 401 AC T 401 GC C 401 CC A 401 CA T 4 2 - anonymous data 43

Applicability of complete k-anonymity and k m -anonymity to medical data Limited in the specification of privacy requirements Assume too powerful attackers all m-itemsets (combinations of m diagnosis codes) need protection but medical data publishers have detailed privacy requirements Explore a small number of possible generalizations Do not take into account utility requirements Attackers know who is diagnosed with abc or defgh They protect all 5-itemsets instead of the 2 itemsets privacy constraints 44

Policy-based anonymization model Policy-based anonymization for ICD codes* Global anonymization model Models both generalization and suppression Each original ICD code is replaced by a unique set of ICD codes no need for generalization hierarchies ICD codes 493.00 493.01 296.01 Anonymized codes (493.00, 493.01) (296.01, 296.02) Generalized ICD code interpreted as 493.00 or 493.01 or both 296.02 Φ ( ) Suppressed ICD code Not released 174.01 *Loukides, Gkoulalas- Divanis, Malin. Anonymiza7on of Electronic Medical Records for Valida7ng Genome- Wide Associa7on Studies. PNAS 10 45

Policy-based anonymization: Privacy model Data publishers specify diagnosis codes that need protection Privacy Model: Knowing that an individual is associated with one or more specific itemsets (privacy constraints), an attacker should not be able to associate this individual to less than k transactions ICD DNA 401.0 401.1 AC T 401.2 401.3 GC C 401.0 401.1 CC A 401.4 401.3 CA T Original data ICD DNA 401.0 401.1 AC T (401.2, 401.4) 401.3 GC C 401.0 401.1 CC A (401.2, 401.4) 401.3 CA T Anonymized data Privacy Policy: The set of all specified privacy constraints Privacy is achieved when all privacy constraints are supported by at least k transactions in the published data or do not appear at all 46

Policy-based anonymization: Data utility considerations Utility Constraints: Published data must remain as useful as the original data for conducting a GWAS on a disease or trait à number of cases and controls in a GWAS must be preserved Supporting utility constraints: ICD codes from utility policy are generalized together; a larger part of the solution space is searched than when using domain generalization hierarchies (296.00, 296.01) 47

Policy-based anonymization: Measuring information loss Utility Loss: A measure to quantify the level of information loss incurred by anonymization Favors (493.01) over (493.01, 493.02) captures the introduced uncertainty of interpreting an anonymized item customizable # of items mapped to generalized item weight (semantic closeness) fraction of affected transactions 48

Policy-based anonymization algorithms Goal: Anonymize medical records so that Privacy is guaranteed Utility is high à many GWAS are supported simultaneously Incurred information loss is minimal Challenging optimization problem NP-hard Feasibility depends on constraints Heuristic algorithms Utility-Guided Anonymization of Clinical Profiles (UGACLIP) Clustering-based Anonymization (CBA) Algorithm Efficiency Scalability UGACLIP CBA Utility 49

Anonymization algorithms: UGACLIP Sketch of UGACLIP (PNAS) Input: EMRs, Privacy Policy, Utility Policy, k Output: Anonymized EMRs While the Privacy Policy is not satisfied Select the privacy constraint p that corresponds to most patients While p is not protected Select the ICD code i in p that corresponds to fewest patients Anonymize i If i can be anonymized according to the Utility Policy Else generalize i to (i,i ) suppress each unprotected ICD code in p Considers privacy constraints in a certain order Protects a privacy constraint by set-based anonymization - Generalization when Utility Policy is satisfied - otherwise suppression 50

Anonymization algorithms: UGACLIP Privacy Policy 296.00 296.01 296.02 Utility Policy 296.00 296.01 k=2 EMR data ICD DNA 296.00 296.01 296.02 CT A 295.00 295.01 295.02 AC T 296.00 296.02 GC C UGACLIP Algorithm Data is protected; {296.00, 296.01, 296.02} appears 2 times Anonymized EMR data ICD DNA (296.00, 296.01) 296.02 CT A 295.00 295.01 295.02 AC T (296.00, 296.01) 296.02 GC C Data remains useful for GWAS on Bipolar disorder; associations between (296.00, 296.01) and DNA region CT A are preserved 51

Anonymization algorithms: CBA Sketch of CBA Retrieve the ICD codes that need less protection from the Privacy Policy Gradually build a cluster of codes that can be anonymized according to the utility policy and with minimal UL If the ICD codes are not protected Suppress no more ICD codes than required to protect privacy privacy req. are met p 1 = {i 1 } p 2 = {i 5, i 6 } p 3 = {i 3, i 4 } k=3 clusters merging (driven by UL) singleton clusters 52

Case Study: EMRs from Vanderbilt University Medical Center Datasets VNEC 2762 de-identified EMRs from Vanderbilt involved in a GWAS VNECkc subset of VNEC, we know which diseases are controls for others BIOVU all de-identified EMRs (79087) from Vanderbilt s biobank (the largest dataset in medical data privacy literature)* Methods UGACLIP and CBA ACLIP (state-of-the-art method it does not take utility policy into account) *Loukides, Gkoulalas- Divanis. U7lity- aware anonymiza7on of diagnosis codes. IEEE TITB 2011. 53

UGACLIP & CBA: First algorithms to offer data utility in GWAS Setting: k = 5, protecting single-visits of patients, 18 GWAS-related diseases* no utility constraints Diseases related to all GWAS reported in Manolio* Best competitor Result of ACLIP is useless for validating GWAS UGACLIP preserves 11 out of 18 GWAS CBA preserves 14 out of 18 GWAS simultaneously * Manolio et al. A HapMap harvest of insights into the genetics of common disease. J Clinic. Inv. 2008. 54

Utility beyond GWAS Supporting clinical case counts in addition to GWAS learn number of patients with sets of codes in 10% of the records useful for epidemiology and data mining applications act. estim. act. VNECkc VNECkc Queries can be estimated accurately (ARE <1.25), comparable to ACLIP Anonymized data can support both GWAS and studies on clinical case counts 55

Anonymizing the BIOVU (79K EMR) Supporting clinical case counts in BIOVU Very low error in query answering (Average Relative Error <1) All EMRs in the VUMC biobank can be anonymized and remain useful 56

Rule-based anonymization Our approach*: We use PS-rules to express protection requirements against both identity and sensitive information disclosure Public items at least k records to support I (preventing identity disclosure) Contributions I à J Rule-based privacy model More flexible and general than existing models Sensitive items at most c x 100% of the records that support I also support J (preventing sensitive information disclosure) Intuitive and able to capture real-world privacy requirements Three anonymization algorithms Effective (better data utility and protection than state-of-the-art) Efficient (efficient rule checking strategies, sampling, etc.) *G. Loukides, A. Gkoulalas- Divanis, J. Shao. Anonymizing transac7on- data to eliminate sensi7ve inferences. DEXA 10 (extended to KAIS) 57

Rule-based anonymization An example of offering PS-rule based anonymization Name Diagnoses codes Name Diagnoses codes Mary Bob Tom Anne Brad Jim Name Mary Bob Tom Anne Brad a b c d g h i j e f h i d g j e f g h a b d e i c f j Diagnoses codes a (b,c) d g h i j e f h i d g j e f g h a (b,c) d e i Mary Bob Tom Anne Brad Jim (a,b,c) (d,e,f) g h i j (d,e,f) h i (d,e,f) g j (d,e,f) g h (a,b,c) (d,e,f) i (a,b,c) (d,e,f) j PS-rules Jim (b,c) f j ü PS-rules can be automatically discovered and specified *Grigorios Loukides, Aris Gkoulalas- Divanis, Jianhua Shao: Efficient and flexible anonymiza7on of transac7on data. Knowledge and Informa7on Systems (KAIS), 36(1), pp. 153-210, 2013. 58 a à j c d à g d à h i original dataset 2 2 -anonymous dataset Hierarchy and PS-rules Anonymous dataset based on the PS-rules model ü Support of fine-grained, flexible privacy requirements ü Privacy protection from both identity and sensitive information disclosure ü General privacy model that incorporates existing privacy models

Experimental results: Setup Datasets BMS1, BMS2 contain click-stream data and POS contains sales transaction data Evaluation Is anonymized data useful in aggregate query answering? How efficient are the algorithms? Methods Tree-based, Sample-based vs Baseline (no pruning) and Apriori Anonymization* *M. Terrovi7s, N. Mamoulis, P. Kalnis. Privacy- preserving anonymiza7on of set- valued data, PVLDB 08. 59

Data Utility: uniform privacy requirements BMS2 dataset, k=5,p=2 (all 2-itemsets need protection from identity disclosure) Same protection as Apriori for identity disclosure, and additionally thwarts sensitive information disclosure Our algorithms offer many times more accurate query answering 60

Data Utility: detailed privacy requirements BMS2 dataset, k=5,p=2, type 2-1 rules of varying number and rules of other types Apriori cannot take the detailed privacy requirements into account and overdistorts data Our algorithms protect data no more than necessary to satisfy these requirements, achieving much higher data utility 61

Efficiency BMS2 dataset, k=5,p=2, 5K type 2-1 rules Synthetic data of varying D and I, k=5,p=2, 5K rules of type 2-1 Sample-based is the fastest and most scalable; Apriori is the slowest 62

Anonymization of RT-datasets (e.g., demographics + diagnoses codes) Privacy Threat: Attackers know some relational attribute values (e.g., demographics) plus some sensitive items (e.g., diagnoses) for an individual. *G. Poulis, G. Loukides, A. Gkoulalas- Divanis, S. Skiadopoulos. Anonymizing data with rela7onal and transac7onal aaributes. PKDD 13. 63

SECRETA anonymization tool SECRETA*, **: System for Evaluating and Comparing RElational and Transaction Anonymization algorithms. ü Evaluates algorithms for relational, transaction and RT-dataset anonymization ü Integrates 9 popular anonymization algorithms and 3 bounding methods for combining them ü R: Supports Incognito, Cluster, Top-down and Full subtree bottom-up ü T: Supports COAT, PCTA, Apriori, LRA and VPA ü Supports two modes of operation: Evaluation and Comparison *G. Poulis, A. Gkoulalas- Divanis, G. Loukides, C. Tryfonopoulos, S. Skiadopoulos. SECRETA: A system for evalua7ng and comparing rela7onal and transac7on anonymiza7on algorithms. EDBT 14. **G. Poulis, G. Loukides, A. Gkoulalas- Divanis, S. Skiadopoulos. Anonymizing data with rela7onal and transac7onal aaributes. PKDD 13. 64

Summary Explained the need for privacy in medical data sharing Presented the state-of-the-art in privacy-preserving medical data publishing to support intended analyses Elaborated on the policy-based anonymization model, which allows data publishers to specify detailed privacy and utility constraints for the data anonymization process Discussed methods for anonymizing data of co-existing data types, and introduced the SECRETA anonymization tool Thank you! Questions?

Internship Opportunities @ IBM 10-15 paid internships per year @ DRL (out of ~100 applications) ~5 unpaid internships Internship duration: 3-4 months, start date: flexible Positions are advertised in December and filled as soon as possible Each candidate identifies an RSM @ DRL to work with and a project that is of mutual interest to the applicant and to IBM The candidate submits a 1-2 page document on the project that he / she will be involved in if accepted At the end of the internship the candidate gives a talk to the lab on his/ her accomplishments during the internship Need more information? Email me at: arisdiva@ie.ibm.com 66