Data Privacy Aspects in Big Data Facilitating Medical Data Sharing. Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland

Size: px
Start display at page:

Download "Data Privacy Aspects in Big Data Facilitating Medical Data Sharing. Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland"

Transcription

1 Data Privacy Aspects in Big Data Facilitating Medical Data Sharing Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland September 2015

2 Content Introduction / Motivation Focus Application I: Privacy Preserving Medical Data Publishing to Support Intended Analyses Focus Application II: Identification of Privacy Vulnerabilities Summary 2

3 Data sharing Individuals data are increasing shared Netflix published movie ratings of 500K subscribers AOL published 20M search query terms of 658K web users TomTom sold customers location (GPS) data to the Dutch police emerge consortium published patient data related to genome-wide association studies to biorepositories (dbgap) Orange provided call information about its mobile subscribers, as part of the D4D challenge on mobile phone data ( Benefits of data sharing Personalization (e.g., Netflix s data mining contest aimed to improve movie recommendation based on personal preferences) Marketing (e.g., Tesco made 53M from selling shopping patterns to retailers and manufacturers, such as Nestle and Unilever, last year) Social benefits (e.g., promote medical research studies, improve traffic management, etc.) 3

4 Data sharing must guarantee privacy and accommodate utility A popular data sharing scenario (data publishing) Original data Released data data owners data publisher (trusted) data recipient (untrusted) Threats to data privacy Identity disclosure Sensitive information disclosure Membership disclosure Inferential disclosure Data utility requirements Minimal data distortion (general purpose use) Support of specific applications / workloads (e.g., building accurate predictive models, GWAS, etc.) 4

5 Focus Application I: Privacy Preserving Medical Data Publishing How can we share medical data in a way that protects patients privacy while supporting research studies?

6 Electronic Medical Records (EMR) Relational data Registration and demographic data Transaction (set-valued) data Billing information ICD codes* are represented as numbers (up to 5 digits) and denote signs, findings, and causes of injury or disease** Sequential data DNA Text data Clinical notes Electronic Medical Records (EMR) Name YOB ICD DNA Clinical notes Jim , 185 C T (doc1) Mary , A G (doc2) Mary C G (doc3) Carol C G (doc4) Anne , G C (doc5) Anne A T (doc6) * International Statistical Classification of Diseases and Related Health Problems ** Centers for Medicare & Medicaid Services - 6

7 EMR data use in analytics Statistical analysis Correlation between YOB and ICD code 185 (Malignant neoplasm of prostate) Querying Clustering Control epidemics* Classification Predict domestic violence** Association rule mining Electronic Medical Records Name YOB ICD DNA Jim , C T Mary A G Mary , C G Carol , C G Anne , G C Anne A T Formulate a government policy on hypertension management*** IF age in [43,48] AND smoke = yes AND exercise=no AND drink=yes;; THEN hypertension=yes (sup=2.9%;; conf=26%)0 * Tildesley et al. Impact of spatial clustering on disease transmission and optimal control, PNAS, ** Reis et al. Longitudinal Histories as Predictors of Future Diagnoses of Domestic Abuse: Modelling Study, BMJ: British Medical Journal, 2011 *** Chae et al. Data mining approach to policy analysis in a health insurance domain. Int. J. of Med. Inf.,

8 Need for privacy Why we need privacy in medical data sharing? If privacy is breached, there are consequences to patients Consequences to patients Emotional and economical embarrassment 62% of individuals worry their EMRs will not remain confidential* 35% expressed privacy concerns regarding the publishing of their data to dbgap** Opt-out or provide fake data à difficulty to conduct statistically powered studies * Health Confidence Survey 2008, Employee Benefit Research Institute ** Ludman et al. Glad You Asked: Participants Opinions of Re-Consent for dbgap Data Submission. Journal of Empirical Research on Human Research Ethics,

9 Need for privacy If privacy is breached, there are consequences to organizations Legal à HIPAA, EU legislation (95/46/EC, 2002/58/EC, 2009/136/EC etc.) Financial à The average cost of a single data breach is $5.85M in the US and $4.74M in Germany;; these countries have the highest per capita cost. Healthcare and Education are the most heavily regulated industries * Ponemon Institute Research Report 2014 Cost of Data Breach Study: Global Analysis. 9

10 Protecting data privacy: data masking / removal of identifiers Removing / masking direct identifiers data owners data publisher (trusted) data recipient (untrusted) Original data De-identified data 1. Locate the direct identifiers (attributes that uniquely identify an individual), such as SSN, Patient ID, Phone number etc. 2. Remove or mask them from the data prior to data publishing Name John Doe Thelma Arnold Search Query Terms Harry potter, King s speech Hand tremors, bipolar, dry mouth, effect of nicotine on the body 10

11 Protecting data privacy: data masking / removal of identifiers Masking / removal of direct identifiers is not sufficient! data owners data publisher (trusted) data recipient (untrusted) Original data Released data Main types of threats to data privacy External data Background Knowledge Identity disclosure Sensitive information disclosure Inferential disclosure 11

12 Privacy Threats: Identity disclosure Identity disclosure in relational data (e.g., patients demographics) Individuals are linked to their published records based on quasi-identifiers (attributes that in combination can identify an individual) Age Postcode Sex 20 NW10 M 45 NW15 M 22 NW30 M 50 NW25 F De-identified data Name Age Postcode Sex Greg 20 NW10 M Jim 45 NW15 M Jack 22 NW30 M Anne 50 NW25 F External data * Sweeney, k- anonymity: a model for protecting privacy. IJUFKS, % of US citizens can be identified by Age, DOB, 5-digit ZIP code* 12

13 Identity disclosure in sharing patients diagnosis codes Identity disclosure in transaction data (e.g., diagnosis codes) Identified EMR data ID ICD Jim Mary Anne Released EMR Data ICD DNA CT A AC T GC C Mary is diagnosed with benign essential hypertension (ICD code 401.1) the second record belongs to her à all her diagnosis codes Disclosure based on diagnosis codes* à general problem for other medical terminologies (e.g., ICD-10 used in EU) à sharing data susceptible to the attack against legislation * Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants Privacy. JAMIA,

14 Real-world identity disclosure cases involving medical data Group Insurance Commission data Voter list of Cambridge, MA William Weld, Former Governor of MA Chicago Homicide database Social security death index 35% of murder victims Adverse Drug Reaction Database Public obituaries 26-year old girl who died from drug 14

15 Issuing attacks on medical datasets Two-step attack using publicly available voter registration lists and hospital discharge summaries voter(name,..., zip, dob, sex) knowledge summary(zip, dob, sex, diagnoses) release(diagnoses, DNA) trust voter list & discharge summary à release 87% of US citizens can be identified by {dob, gender, ZIP-code} * Sweeney, k- anonymity: a model for protecting privacy. IJUFKS,

16 Issuing attacks on medical datasets One-step attack using EMRs* Insider s attack EMR (name,..., diagnoses) knowledge EMR à release release(, diagnoses, DNA) trust * Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants Privacy. JAMIA,

17 Case study: Evaluating the effectiveness of the insider s attack De-identified / Masked EMR population from VUMC Population: 1.2M records (patients) from Vanderbilt A unique random number for ID de-identified EMR (ID,..., diagnoses) VNEC(, diagnoses, DNA) VNEC de-identified / masked EMR sample 2762 records (patients) derived from the population Patients from VNEC were involved in a study (GWAS) for the Native Electrical Conduction of the heart Patients EMR were to be deposited into dbgap Data would be made available to support other studies (GWAS)? 17

18 Case study: Evaluating the effectiveness of the insider s attack Vanderbilt s EMR - VNEC dataset linkage on ICD codes 100.0% % of re-identified sample 90.0% 80.0% 70.0% 96.5% We assume that all ICD codes are used to issue an attack 96.5% of patients susceptible to identity disclosure 60.0% Distinguishability (log scale) Number of times a set of ICD codes appears in the population Support in the data mining literature 18

19 Case study: Evaluating the effectiveness of the insider s attack Vanderbilt s EMR - VNEC dataset linkage on ICD codes 1 ICD code 2 ICD code combination 3 ICD code combination 10 ICD code combination % of re-identifiable sample 100% 80% 60% 40% 20% 0% A random subset of ICD codes that can be used in attack Knowing a random combination of 2 ICD codes can lead to unique re-identification Distinguishability (log scale) Number of times a set of ICD codes appears in the population (equiv. to support count in data mining literature) 19

20 Privacy Threats: Sensitive information disclosure Individuals are associated with sensitive information Name Age Postcode Sex Greg 20 NW10 M Background knowledge Identified EMR data ID ICD Jim Mary Sensitive Attribute (SA) Age Postcode YOB Disease 20 NW HIV 20 NW HIV 20 NW HIV 20 NW HIV De-identified data Released EMR Data ID ICD DNA Jim C A Mary A T Mary is diagnosed with and 401.1à she has Schizophrenia Schizophrenia Sensitive information disclosure can occur without identity disclosure 20

21 Privacy Threats: Inferential disclosure Sensitive knowledge patterns are exposed by data mining 75% of patients visit the same physician more than 4 times Unsolicited advertisement 60% of the white males > 50 suffer from diabetes Stream data collected by health monitoring systems Electronic medical records Customer discrimination Drug orders & costs Business rivals can harm data publishers and insurance, pharmaceutical & marketing companies can harm data owners* * G. Das and N. Zhang. Privacy risks in health databases from aggregate disclosure. PETRA,

22 Anonymization of demographics k-anonymity principle* Each record in a relational table T should have the same value over quasi-identifiers with at least k-1 other records in T These records collectively form a k-anonymous group k-anonymity protects from identity disclosure Protects data from linkage to external sources (triangulation attacks) The probability that an individual is correctly associated with their record is at most 1/k Name Age Postcode Sex Greg 40 NW10 M Jim 45 NW15 M Jack 22 NW30 M Anne 50 NW25 F External data Age Postcode Sex 4* NW1* M 4* NW1* M * NW* * * NW* * 2-anonymous data * Sweeney. Achieving k- anonymity privacy protection using generalization and suppression. IJUFKS

23 Attack on k-anonymous data Homogeneity attack* All sensitive values in a k-anonymous group are the same à sensitive information disclosure Name Age Postcode Greg 40 NW10 External data Age Postcode Disease 4* NW1* HIV 4* NW1* HIV 5* NW* Ovarian Cancer 5* NW* Flu 2-anonymous data Attacker is confident that Greg suffers from HIV * Machanavajjhala et al, l-diversity: Privacy Beyond k-anonymity. ICDE

24 l-diversity principle for demographics l -diversity* A relational table is l-diverse if all groups of records with the same values over quasi-identifiers (QID groups) contain no less than l well-represented values for the SA Distinct l-diversity l well-represented à l distinct Age Postcode Disease 4* NW1* HIV 4* NW1* HIV 4* NW1* HIV 4* NW1* HIV 4* NW1* Flu 4* NW1* Cancer Three distinct values, but the probability of HIV being disclosed is ~

25 Differential privacy Objective Prevent attackers from inferring any additional information about an individual, regardless of whether the published dataset contains the individual s record or not. ε-differential privacy* satisfied by a randomized algorithm A if Pr[A(D) = D] exp(ε) Pr[A(D') = D] for all datasets D, D that differ in one record, and for any possible anonymized dataset D, where ε is a constant and the probabilities are over the randomness of A** Essentially differential privacy ensures that the outcome of a calculation is insensitive to any one particular record in the dataset * Dwork. Differential privacy. ICALP, ** Definition from Mohammed et al. Differentially private data release for data mining. KDD,

26 Offering differential privacy Laplace mechanism For any function f : DàR d, the algorithm A that adds independently generated noise with distribution Lap(Δ f / ε) to each of its d outputs, satisfies ε-differential privacy, when Δ f = max D,D f(d) f(d ) for all datasets D, D that differ in exactly one record. Age Gender 20 M 23 F 25 M 42 F q Assume that f : returns the number of patients with age < 40 q Then, original value: 3 q Δ f = 1 (sensitivity) q Add to the original value f (D) noise with distribution Lap(1/ε) => released value: 3 + Lap (1/ε) Exponential mechanism In tasks where adding noise makes no sense (e.g., training a classifier), differential privacy is offered by randomizing the selection of an outcome from a set of possible outcomes. For any function u: (D t) à R that measures the utility of an output t, an algorithm A that chooses t with probability proportional to exp(ε u(d,t) / 2 Δu) satisfies ε-differential privacy, where Δu = max forall t, D, D u(d,t) u(d,t). For example, the exponential mechanism can be used to select Age or Gender given a function u that scores attributes according to the perceived utility loss. * Dwork. Differential privacy. ICALP, ** Definition from Mohammed et al. Differentially private data release for data mining. KDD,

27 ε-differential privacy Pros (+) No/few assumptions on adversarial knowledge Composability [1] privacy holds even when multiple differentially-private datasets are obtained by an adversary Several mechanisms for the interactive [2] and the non-interactive scenario [3,4] Cons (-) Overly restrictive à differential privacy leads to very high information loss! several variations [5] & improved mechanisms [6], but still the utility remains very low No real-world applications & no privacy criterion is associated with the value of ε Important drawbacks due to the random noise addition Only noisy answers to a limited number and type of queries can be offered, or noisy summary statistics (such as histograms) can be released;; anonymous datasets cannot be created/released Individuals may become associated with false information in the output of differential privacy methods The utility loss is very high compared to syntactic approaches Several misconceptions [7] and susceptibility to attacks [8] (e.g., Cormode showed that an attacker can infer the sensitive value of an individual fairly accurately by applying Naïve Bayes classification on differentially private data) [1] Ganta et al. Composition attacks and auxiliary information in data privacy. KDD, [2] Dwork. Differential privacy: a survey of results. TAMC, [3] Mohammed. Differentially private release for data mining. KDD, [4] Xiao et al. Differential privacy via wavelet transforms. ICDE, [5] Machanavajjhala et al. Data Publishing against Realistic Adversaries. PVLDB, [6] Ding et al. Differentially private data cubes: optimizing noise sources and consistency. SIGMOD, [7] Kifer et al. No free lunch in data privacy. SIGMOD, [8] Cormode. Personal privacy vs. population privacy: learning to attack anonymization. KDD,

28 Partition-based algorithms for k-anonymity Main idea of partition-based algorithms A record projected over QIDs is treated as a multidimensional point A subspace (hyper-rectangle) that contains at least k points can form a k-anonymous group à multidimensional global recoding Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV M F 28 F Cancer 29 F Obesity How to partition the space? One attribute at a time which to use? How to split the selected attribute? 28

29 Example of applying Mondrian (k=3) M M F F Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity Age Sex Disease [20-26] {M,F} HIV [20-26] {M,F} HIV [20-26] {M,F} Obesity [27-29] F HIV [27-29] F Cancer [27-29] F Obesity 29

30 Clustering-based anonymization algorithms Main idea of clustering-based anonymization 1.Create clusters containing at least k records with similar values over QIDs Seed selection Similarity measurement Stopping criterion 2. Anonymize records in each cluster separately Local recoding and/or Suppression??? 30

31 Clustering-based anonymization algorithms Clusters need to be separated Seed Selection Furthest-first Random Clusters need to contain similar values Similarity measurement Stopping criterion Size-based Quality-based Clusters should not be too large All these heuristics attempt to improve data utility 31

32 Example of Bottom-up clustering algorithm (k=2) M M M F F F Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity Age Sex Disease [20-25] M HIV [20-25] M Obesity [23-27] F HIV [23-27] F HIV [28-29] F Cancer [28-29] F Obesity 32

33 Example of top-down clustering algorithm (k=2) M M M F F F Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity Age Sex Disease [20-25] {M,F} HIV [20-25] {M,F} HIV [20-25] {M,F} Obesity [27-29] F HIV [27-29] F Cancer [27-29] F Obesity 33

34 Preventing identity disclosure from diagnosis codes: Suppression Suppression Removes items or records from data prior to releasing the data Suppress ICD codes* appearing in less than a certain percent of patient records Intuition: such ICD codes can act as quasi-identifiers Identified EMR data ID ICD Mary Anne Released EMR Data ICD DNA AC T GC C * Vinterbo et al. Hiding information by cell suppression. AMIA Annual Symposium 01 34

35 Code suppression a case study using Vanderbilt s EMR data We had to suppress diagnosis codes appearing in less than 25% of the records in VNEC to prevent re-identification doing so we were left with only 5 out of ~6000 ICD codes! * 5-Digit ICD-9 Codes 3-Digit ICD-9 Codes ICD-9 Sections Benign essential hypertension Other malaise and fatigue 401-Essential hypertension 780- Other soft tissue Pain in limb Other disorders of soft tissues Abdominal pain 789 Other abdomen/pelvis symptoms Hypertensive disease Rheumatism excluding the back Rheumatism excluding the back Symptoms Chest pain 786 -Respiratory system Symptoms *Loukides, Gkoulalas- Divanis, Malin. Anonymization of Electronic Medical Records for Validating Genome- Wide Association Studies. PNAS 10 35

36 Preventing identity disclosure in EMR data: Generalization Generalization - replaces items with more general ones (usually with the help of a domain hierarchy) Any Chapters Sections 3-digit ICD codes 5-digit ICD codes Any Generalize ICD-codes to their 3-digit representation benign essential hypertension à 401- essential hypertension Identified EMR data ID ICD Mary Anne Released EMR Data ICD DNA AC T GC C 36

37 Code generalization a case study using Vanderbilt s EMR data Generalizing ICD codes from VNEC* 5-digit ICD codes 3-digit ICD codes 100.0% 90.0% 96.5% % of re-identified sample 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% 75.0% 25.6% 95% no suppression suppression 5% suppression 15% suppression 25% distinguishability (log scale) 95% of the patients remain re-identifiable * Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants Privacy. JAMIA,

38 Complete k-anonymity & k m -anonymity Complete k-anonymity: Knowing that an individual is associated with any itemset, an attacker should not associate this individual to < k transactions ICD DNA AC T GC C CC A CA T Original data ICD DNA AC T GC C CC A CA T 2-complete anonymous data k m -anonymity: Knowing that an individual is associated with any m-itemset, an attacker should not associate this individual to less than k transactions ICD DNA AC T GC C CC A CA T Original data ICD DNA 401 AC T 401 GC C 401 CC A 401 CA T anonymous data 38

39 Applicability of complete k-anonymity and k m -anonymity to medical data Limited in the specification of privacy requirements Assume too powerful attackers all m-itemsets (combinations of m diagnosis codes) need protection but medical data publishers have detailed privacy requirements Explore a small number of possible generalizations Do not take into account utility requirements Attackers know who is diagnosed with abc or defgh They protect all 5-itemsets instead of the 2 itemsets privacy constraints 39

40 Policy-based anonymization model Policy-based anonymization for ICD codes* Global anonymization model Models both generalization and suppression Each original ICD code is replaced by a unique set of ICD codes no need for generalization hierarchies ICD codes Anonymized codes (493.00, ) (296.01, ) Generalized ICD code interpreted as or or both Φ ( ) Suppressed ICD code Not released *Loukides, Gkoulalas- Divanis, Malin. Anonymization of Electronic Medical Records for Validating Genome- Wide Association Studies. PNAS 10 40

41 Policy-based anonymization: Privacy model Data publishers specify diagnosis codes that need protection Privacy Model: Knowing that an individual is associated with one or more specific itemsets (privacy constraints), an attacker should not be able to associate this individual to less than k transactions ICD DNA AC T GC C CC A CA T Original data ICD DNA AC T (401.2, 401.4) GC C CC A (401.2, 401.4) CA T Anonymized data Privacy Policy: The set of all specified privacy constraints Privacy is achieved when all privacy constraints are supported by at least k transactions in the published data or do not appear at all 41

42 Policy-based anonymization: Data utility considerations Utility Constraints: Published data must remain as useful as the original data for conducting a GWAS on a disease or trait à number of cases and controls in a GWAS must be preserved Supporting utility constraints: ICD codes from utility policy are generalized together;; a larger part of the solution space is searched than when using domain generalization hierarchies (296.00, ) 42

43 Policy-based anonymization algorithms Goal: Anonymize medical records so that Privacy is guaranteed Utility is high à many GWAS are supported simultaneously Incurred information loss is minimal Challenging optimization problem NP-hard Feasibility depends on constraints Heuristic algorithms Utility-Guided Anonymization of Clinical Profiles (UGACLIP) Clustering-based Anonymization (CBA) Algorithm Efficiency Scalability UGACLIP CBA Utility 43

44 Anonymization algorithms: UGACLIP Privacy Policy Utility Policy k=2 EMR data ICD DNA CT A AC T GC C UGACLIP Algorithm Data is protected;; {296.00, , } appears 2 times Anonymized EMR data ICD DNA (296.00, ) CT A AC T (296.00, ) GC C Data remains useful for GWAS on Bipolar disorder;; associations between (296.00, ) and DNA region CT A are preserved 44

45 Anonymization algorithms: CBA Sketch of CBA Retrieve the ICD codes that need less protection from the Privacy Policy Gradually build a cluster of codes that can be anonymized according to the utility policy and with minimal UL If the ICD codes are not protected Suppress no more ICD codes than required to protect privacy privacy req. are met p 1 = {i 1 } p 2 = {i 5, i 6 } k=3 clusters merging (driven by UL) singleton clusters 45

46 Case Study: EMRs from Vanderbilt University Medical Center Datasets VNEC 2762 de-identified EMRs from Vanderbilt involved in a GWAS VNECkc subset of VNEC, we know which diseases are controls for others BIOVU all de-identified EMRs (79087) from Vanderbilt s biobank (the largest dataset in medical data privacy literature)* Methods UGACLIP and CBA ACLIP (state-of-the-art method it does not take utility policy into account) *Loukides, Gkoulalas- Divanis. Utility- aware anonymization of diagnosis codes. IEEE TITB

47 UGACLIP & CBA: First algorithms to offer data utility in GWAS Setting: k = 5, protecting single-visits of patients, 18 GWAS-related diseases* no utility constraints Diseases related to all GWAS reported in Manolio* Best competitor Result of ACLIP is useless for validating GWAS UGACLIP preserves 11 out of 18 GWAS CBA preserves 14 out of 18 GWAS simultaneously * Manolio et al. A HapMap harvest of insights into the genetics of common disease. J Clinic. Inv

48 Utility beyond GWAS Supporting clinical case counts in addition to GWAS learn number of patients with sets of codes in 10% of the records useful for epidemiology and data mining applications act. estim. act. VNECkc Queries can be estimated accurately VNECkc (ARE <1.25), comparable to ACLIP Anonymized data can support both GWAS and studies on clinical case counts 48

49 Anonymizing the BIOVU (79K EMR) Supporting clinical case counts in BIOVU Very low error in query answering (Average Relative Error <1) All EMRs in the VUMC biobank can be anonymized and remain useful 49

50 Anonymization of RT-datasets (e.g., demographics + diagnoses codes) Privacy Threat: Attackers know some relational attribute values (e.g., demographics) plus some sensitive items (e.g., diagnoses) for an individual. *G. Poulis, G. Loukides, A. Gkoulalas- Divanis, S. Skiadopoulos. Anonymizing data with relational and transactional attributes. PKDD

51 Focus Application II: Identification of Privacy Vulnerabilities How can we analyze datasets to discover privacy vulnerabilities and select the best protection mechanism?

52 Privacy-preserving data publishing ID Name Address SSN Birth Gender ZIP Marital status A1 A2 0 Maria 10 NY E. Avenue /64 Female Divorced 1 Jenny 5 Brighton Street /64 Female Divorced 2 Nick 12 Doyle Ave /64 Male Widow 3 Tom 154 West End Av /64 Male Married 4 John 93 Somers Str /63 Male Married 5 Bob 35 University Av /63 Male Married 6 Noeleen 63 Mirror Street /64 Female Married 7 Eleni 67 Common Av /61 Female Married 8 Dave 65 Main Str /61 Male Single 9 Thomas 84 Main Ave /61 Male Single direct identifiers quasi-identifiers sensitive/other info 52

53 Identification of Privacy Vulnerabilities (IPV) Privacy-preserving data publishing Goal: Automatically analyze a dataset to expose privacy vulnerabilities, which could lead to privacy attacks, and to validate its protection level Methods: Identify existing, publicly available datasets that could be used by adversaries to perform triangulation attacks Identify privacy risks in the data itself (e.g., outliers, uniqueness or rarity of certain records based on some attributes, frequent behaviour of an individual that is infrequent for many others, etc.) Outcome: A list of identified privacy vulnerabilities and configuration options for the data anonymization algorithm 53

54 Discover QIDs based on publicly available datasets Name Birth ZIP Disease John 03/ HIV Bob 03/ Cancer Noeleen 09/ Flu Name Gender ZIP Salary Maria Female $100K Jenny Female $1M Nick Male $20K Name ZIP Marital status Nick Widow Tom Married John Married Considered publicly available datasets ID Birth Gender ZIP Marital status 0 09/64 Female Divorced 1 09/64 Female Divorced 2 04/64 Male Widow 3 04/64 Male Married 4 03/63 Male Married 5 03/63 Male Married 6 09/64 Female Married 7 09/61 Female Married 8 05/61 Male Single 9 05/61 Male Single Dataset to be published q Have we found all the existing publicly available datasets that could be linked to our data? q What if a dataset becomes available after the data has been published and can be linked to the published dataset, exposing the identity of individuals? 54

55 Discover privacy vulnerabilities on the data itself ID Birth Gender ZIP Marital status 0 09/64 Female Divorced 1 09/64 Female Divorced Discovery of privacy vulnerabilities: q What could lead to re-identification attacks in my data, assuming that sufficient background knowledge and/or public datasets are available to attackers? 2 04/64 Male Widow 3 04/64 Male Married 4 03/63 Male Married 5 03/63 Male Married 6 09/64 Female Married 7 09/61 Female Married 8 05/61 Male Single 9 05/61 Male Single Dataset to be published q What is unique or rare in a record? q Are there any outliers in my data? Quantification of privacy risk: q How many vulnerabilities exist in my dataset? q How powerful an attacker should be in order to reidentify the most vulnerable individual in the data? q How many individuals can be re-identified? 55

56 Identification of Privacy Vulnerabilities Our recent algorithms: Algorithm Description Part of IPV Tool MTUI (Multi-Threaded Uniques Identification) & FPVI (Fast Algorithm for privacy Vulnerabilities Identification) MTRA (Multi-Threaded Risk Assessment) MTS2 (Multi-Threaded Sample uniques identification algorithm) MTLVI (Multi-Threaded Location-based Vulnerabilities Identification) These algorithms compute the quasi-identifiers of a dataset to protect the data from re-identification attacks. They expose all minimal combinations of attributes that lead to unique (or rare) records. This algorithm calculates the vulnerability index for each combination of attributes in a dataset, by reporting the cardinality of the smallest group of records that share the same values for each combination. This algorithm identifies the specific records that are unique, along with the combination of attributes for which they are unique (or rare) in the dataset. This algorithm identifies and reports on a series of privacy vulnerabilities in the context of location/mobility data (user trajectories), such as sensitive sequences of locations (or location-time pairs) for a user, sensitive user itineraries, sensitive places-of-interest (POIs), infrequent user movement behavior, etc. 56

57 Example: Identification of Privacy Vulnerabilities ID Birth Gender ZIP Marital status 0 09/64 Female Divorced 1 09/64 Female Divorced 2 04/64 Male Widow 3 04/64 Male Married 4 03/63 Male Married 5 03/63 Male Married 6 09/64 Female Married 7 09/61 Female Married 8 05/61 Male Single 9 05/61 Male Single Search space (lattice) for discovering privacy vulnerabilities in a relational/transaction dataset Discovering quasi-identifiers Goal: Find all minimal combinations of attributes in the dataset that lead to at most k-individuals FPVI algorithm: {M à 2}, {BZ à 1}, {GZ à 0} (the number corresponds to the first entry where a unique was found) Discovering vulnerability indexes Goal: Find the vulnerability index for each combination of attributes MTRA algorithm: {B à 2}, {G à 4}, {Z à 2}, {M à1}, {BG à 2}, (the number corresponds to the vulnerability index for the corresponding combination) 57

58 Scalability of the MTUI and MTRA approaches Scalability on other datasets with various characteristics (Records: 165K 11M, Attributes: 9 50): MTUI significantly outperforms the state-of-the-art methods in all tested datasets and with all tested parameters. It can also analyze datasets that the other algorithms are unable to process (see (g)-(h)). 58

59 Scalability of the FPVI algorithm Scalability on other datasets with various characteristics (Records: 165K 11M, Attributes: 9 50): FPVI significantly outperforms the state-of-the-art methods & MTUI in all tested datasets and can scale to datasets of millions of records and tens of attributes without requiring excessive time. 59

60 Risk Utility Confidentiality Maps Identifying a good trade-off between data privacy and data utility R-U Confidentiality map to track the trade-off* high Disclosure risk c c c Original data publishing c b minimum level of protection required Data publisher decides the desired trade-off low a No publishing minimum level of acceptable utility Utility high An intuitive tool that allows comparing different anonymizations * Duncan et al. Disclosure Risk vs. Data Utility: The R-U Confidentiality map. Tech. Rep LA-UR , Los Alamos National Library,

61 Risk Utility Confidentiality Maps Selecting data anonymizations with a desired trade-off R-U confidentiality map Allows data publishers to compare data anonymizations Enables the selection of the desired anonymization Disclosure risk high c Solutions of PCTA d d d d c c a No publishing low Utility d Original data publishing b d c c c c high Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, 2011 Anonymization with best utility/privacy trade-off 61

Privacy Challenges and Solutions for Data Sharing. Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland

Privacy Challenges and Solutions for Data Sharing. Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland Privacy Challenges and Solutions for Data Sharing Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland June 8, 2015 Content Introduction / Motivation for data privacy Focus Application

More information

Medical Data Sharing: Privacy Challenges and Solutions

Medical Data Sharing: Privacy Challenges and Solutions Medical Data Sharing: Privacy Challenges and Solutions Aris Gkoulalas-Divanis agd@zurich.ibm.com IBM Research - Zurich Grigorios Loukides g.loukides@cs.cf.ac.uk Cardiff University ECML/PKDD, Athens, September

More information

Privacy-Preserving Medical Data Sharing

Privacy-Preserving Medical Data Sharing Privacy-Preserving Medical Data Sharing Aris Gkoulalas-Divanis* arisdiva@ie.ibm.com IBM Research - Ireland Grigorios Loukides* g.loukides@cs.cf.ac.uk Cardiff University SIAM Data Mining, Anaheim, CA, USA,

More information

CS346: Advanced Databases

CS346: Advanced Databases CS346: Advanced Databases Alexandra I. Cristea A.I.Cristea@warwick.ac.uk Data Security and Privacy Outline Chapter: Database Security in Elmasri and Navathe (chapter 24, 6 th Edition) Brief overview of

More information

Anonymization of Administrative Billing Codes with Repeated Diagnoses Through Censoring

Anonymization of Administrative Billing Codes with Repeated Diagnoses Through Censoring Anonymization of Administrative Billing Codes with Repeated Diagnoses Through Censoring Acar Tamersoy, Grigorios Loukides PhD, Joshua C. Denny MD MS, and Bradley Malin PhD Department of Biomedical Informatics,

More information

Privacy Challenges of Telco Big Data

Privacy Challenges of Telco Big Data Dr. Günter Karjoth June 17, 2014 ITU telco big data workshop Privacy Challenges of Telco Big Data Mobile phones are great sources of data but we must be careful about privacy 1 / 15 Sources of Big Data

More information

Information Security in Big Data using Encryption and Decryption

Information Security in Big Data using Encryption and Decryption International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842 Information Security in Big Data using Encryption and Decryption SHASHANK -PG Student II year MCA S.K.Saravanan, Assistant Professor

More information

DATA MINING - 1DL360

DATA MINING - 1DL360 DATA MINING - 1DL360 Fall 2013" An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/per1ht13 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Principles and Best Practices for Sharing Data from Environmental Health Research: Challenges Associated with Data-Sharing: HIPAA De-identification

Principles and Best Practices for Sharing Data from Environmental Health Research: Challenges Associated with Data-Sharing: HIPAA De-identification Principles and Best Practices for Sharing Data from Environmental Health Research: Challenges Associated with Data-Sharing: HIPAA De-identification Daniel C. Barth-Jones, M.P.H., Ph.D Assistant Professor

More information

Privacy by Design für Big Data

Privacy by Design für Big Data Dr. Günter Karjoth 26. August 2013 Sommerakademie Kiel Privacy by Design für Big Data 1 / 34 2013 IBM Coorporation Privacy by Design (PbD) proposed by Ann Cavoukin, Privacy Commissioner Ontario mostly

More information

CS377: Database Systems Data Security and Privacy. Li Xiong Department of Mathematics and Computer Science Emory University

CS377: Database Systems Data Security and Privacy. Li Xiong Department of Mathematics and Computer Science Emory University CS377: Database Systems Data Security and Privacy Li Xiong Department of Mathematics and Computer Science Emory University 1 Principles of Data Security CIA Confidentiality Triad Prevent the disclosure

More information

(Big) Data Anonymization Claude Castelluccia Inria, Privatics

(Big) Data Anonymization Claude Castelluccia Inria, Privatics (Big) Data Anonymization Claude Castelluccia Inria, Privatics BIG DATA: The Risks Singling-out/ Re-Identification: ADV is able to identify the target s record in the published dataset from some know information

More information

PRIVACY-PRESERVING DATA ANALYSIS AND DATA SHARING

PRIVACY-PRESERVING DATA ANALYSIS AND DATA SHARING PRIVACY-PRESERVING DATA ANALYSIS AND DATA SHARING Chih-Hua Tai Dept. of Computer Science and Information Engineering, National Taipei University New Taipei City, Taiwan BENEFIT OF DATA ANALYSIS Many fields

More information

Privacy Techniques for Big Data

Privacy Techniques for Big Data Privacy Techniques for Big Data The Pros and Cons of Syntatic and Differential Privacy Approaches Dr#Roksana#Boreli# SMU,#Singapore,#May#2015# Introductions NICTA Australia s National Centre of Excellence

More information

De-Identification 101

De-Identification 101 De-Identification 101 We live in a world today where our personal information is continuously being captured in a multitude of electronic databases. Details about our health, financial status and buying

More information

Differential privacy in health care analytics and medical research An interactive tutorial

Differential privacy in health care analytics and medical research An interactive tutorial Differential privacy in health care analytics and medical research An interactive tutorial Speaker: Moritz Hardt Theory Group, IBM Almaden February 21, 2012 Overview 1. Releasing medical data: What could

More information

Challenges of Data Privacy in the Era of Big Data. Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014

Challenges of Data Privacy in the Era of Big Data. Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014 Challenges of Data Privacy in the Era of Big Data Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014 1 Outline Why should we care? What is privacy? How do achieve privacy? Big

More information

Data attribute security and privacy in distributed database system

Data attribute security and privacy in distributed database system IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. V (Mar-Apr. 2014), PP 27-33 Data attribute security and privacy in distributed database system

More information

Privacy Committee. Privacy and Open Data Guideline. Guideline. Of South Australia. Version 1

Privacy Committee. Privacy and Open Data Guideline. Guideline. Of South Australia. Version 1 Privacy Committee Of South Australia Privacy and Open Data Guideline Guideline Version 1 Executive Officer Privacy Committee of South Australia c/o State Records of South Australia GPO Box 2343 ADELAIDE

More information

ARX A Comprehensive Tool for Anonymizing Biomedical Data

ARX A Comprehensive Tool for Anonymizing Biomedical Data ARX A Comprehensive Tool for Anonymizing Biomedical Data Fabian Prasser, Florian Kohlmayer, Klaus A. Kuhn Chair of Biomedical Informatics Institute of Medical Statistics and Epidemiology Rechts der Isar

More information

Privacy Preserving Data Mining

Privacy Preserving Data Mining Privacy Preserving Data Mining Technion - Computer Science Department - Ph.D. Thesis PHD-2011-01 - 2011 Arie Friedman Privacy Preserving Data Mining Technion - Computer Science Department - Ph.D. Thesis

More information

Privacy-preserving Data Mining: current research and trends

Privacy-preserving Data Mining: current research and trends Privacy-preserving Data Mining: current research and trends Stan Matwin School of Information Technology and Engineering University of Ottawa, Canada stan@site.uottawa.ca Few words about our research Universit[é

More information

Policy-based Pre-Processing in Hadoop

Policy-based Pre-Processing in Hadoop Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden yi.cheng@ericsson.com, christian.schaefer@ericsson.com Abstract While big data analytics provides

More information

Privacy Preserving Health Data Publishing using Secure Two Party Algorithm

Privacy Preserving Health Data Publishing using Secure Two Party Algorithm IJIRST International Journal for Innovative Research in Science & Technology Volume 2 Issue 1 June 2015 ISSN (online): 2349-6010 Privacy Preserving Health Data Publishing using Secure Two Party Algorithm

More information

How To Protect Your Health Data From Being Used For Research

How To Protect Your Health Data From Being Used For Research Big Data: Research Ethics, Regulation and the Way Forward Tia Powell, MD AAIC Washington, DC, 2015 1854 Broad Street Cholera Outbreak Federal Office of Personnel Management Data Breach, 2015 Well-known

More information

A GENERAL SURVEY OF PRIVACY-PRESERVING DATA MINING MODELS AND ALGORITHMS

A GENERAL SURVEY OF PRIVACY-PRESERVING DATA MINING MODELS AND ALGORITHMS Chapter 2 A GENERAL SURVEY OF PRIVACY-PRESERVING DATA MINING MODELS AND ALGORITHMS Charu C. Aggarwal IBM T. J. Watson Research Center Hawthorne, NY 10532 charu@us.ibm.com Philip S. Yu IBM T. J. Watson

More information

Current Developments of k-anonymous Data Releasing

Current Developments of k-anonymous Data Releasing Current Developments of k-anonymous Data Releasing Jiuyong Li 1 Hua Wang 1 Huidong Jin 2 Jianming Yong 3 Abstract Disclosure-control is a traditional statistical methodology for protecting privacy when

More information

Workload-Aware Anonymization Techniques for Large-Scale Datasets

Workload-Aware Anonymization Techniques for Large-Scale Datasets Workload-Aware Anonymization Techniques for Large-Scale Datasets KRISTEN LeFEVRE University of Michigan DAVID J. DeWITT Microsoft and RAGHU RAMAKRISHNAN Yahoo! Research Protecting individual privacy is

More information

RESEARCH. Acar Tamersoy. Thesis. Submitted to the Faculty of the. Graduate School of Vanderbilt University. for the degree of MASTER OF SCIENCE

RESEARCH. Acar Tamersoy. Thesis. Submitted to the Faculty of the. Graduate School of Vanderbilt University. for the degree of MASTER OF SCIENCE ANONYMIZATION OF LONGITUDINAL ELECTRONIC MEDICAL RECORDS FOR CLINICAL RESEARCH By Acar Tamersoy Thesis Submitted to the Faculty of the Graduate School of Vanderbilt University in partial fulllment of the

More information

GONZABA MEDICAL GROUP PATIENT REGISTRATION FORM

GONZABA MEDICAL GROUP PATIENT REGISTRATION FORM GONZABA MEDICAL GROUP PATIENT REGISTRATION FORM DATE: CHART#: GUARANTOR INFORMATION LAST NAME: FIRST NAME: MI: ADDRESS: HOME PHONE: ADDRESS: CITY/STATE: ZIP CODE: **************************************************************************************

More information

Efficient Algorithms for Masking and Finding Quasi-Identifiers

Efficient Algorithms for Masking and Finding Quasi-Identifiers Efficient Algorithms for Masking and Finding Quasi-Identifiers Rajeev Motwani Stanford University rajeev@cs.stanford.edu Ying Xu Stanford University xuying@cs.stanford.edu ABSTRACT A quasi-identifier refers

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0

Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0 Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0 Guidance Document Prepared by: PCORnet Data Privacy Task Force Submitted to the PMO Approved by the PMO Submitted

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

Impact of Breast Cancer Genetic Testing on Insurance Issues

Impact of Breast Cancer Genetic Testing on Insurance Issues Impact of Breast Cancer Genetic Testing on Insurance Issues Prepared by the Health Research Unit September 1999 Introduction The discoveries of BRCA1 and BRCA2, two cancer-susceptibility genes, raise serious

More information

How to De-identify Data. Xulei Shirley Liu Department of Biostatistics Vanderbilt University 03/07/2008

How to De-identify Data. Xulei Shirley Liu Department of Biostatistics Vanderbilt University 03/07/2008 How to De-identify Data Xulei Shirley Liu Department of Biostatistics Vanderbilt University 03/07/2008 1 Outline The problem Brief history The solutions Examples with SAS and R code 2 Background The adoption

More information

PharmaSUG2011 Paper HS03

PharmaSUG2011 Paper HS03 PharmaSUG2011 Paper HS03 Using SAS Predictive Modeling to Investigate the Asthma s Patient Future Hospitalization Risk Yehia H. Khalil, University of Louisville, Louisville, KY, US ABSTRACT The focus of

More information

Obfuscation of sensitive data in network flows 1

Obfuscation of sensitive data in network flows 1 Obfuscation of sensitive data in network flows 1 D. Riboni 2, A. Villani 1, D. Vitali 1 C. Bettini 2, L.V. Mancini 1 1 Dipartimento di Informatica,Universitá di Roma, Sapienza. E-mail: {villani, vitali,

More information

Data Privacy and Biomedicine Syllabus - Page 1 of 6

Data Privacy and Biomedicine Syllabus - Page 1 of 6 Data Privacy and Biomedicine Syllabus - Page 1 of 6 Course: Data Privacy in Biomedicine (BMIF-380 / CS-396) Instructor: Bradley Malin, Ph.D. (b.malin@vanderbilt.edu) Semester: Spring 2015 Time: Mondays

More information

Societal benefits vs. privacy: what distributed secure multi-party computation enable? Research ehelse 2015 21-22 April Oslo

Societal benefits vs. privacy: what distributed secure multi-party computation enable? Research ehelse 2015 21-22 April Oslo Privacy Societal benefits vs. privacy: what distributed secure multi-party computation enable? Research ehelse 2015 21-22 April Oslo Kassaye Yitbarek Yigzaw UiT The Arctic University of Norway Outline

More information

DRAFT NISTIR 8053 De-Identification of Personally Identifiable Information

DRAFT NISTIR 8053 De-Identification of Personally Identifiable Information 1 2 3 4 5 6 7 8 DRAFT NISTIR 8053 De-Identification of Personally Identifiable Information Simson L. Garfinkel 9 10 11 12 13 14 15 16 17 18 NISTIR 8053 DRAFT De-Identification of Personally Identifiable

More information

Proximity-Aware Local-Recoding Anonymization with MapReduce for Scalable Big Data Privacy Preservation in Cloud

Proximity-Aware Local-Recoding Anonymization with MapReduce for Scalable Big Data Privacy Preservation in Cloud IEEE TRANSACTIONS ON COMPUTERS, TC-2013-12-0869 1 Proximity-Aware Local-Recoding Anonymization with MapReduce for Scalable Big Data Privacy Preservation in Cloud Xuyun Zhang, Wanchun Dou, Jian Pei, Fellow,

More information

Privacy-Preserving Big Data Publishing

Privacy-Preserving Big Data Publishing Privacy-Preserving Big Data Publishing Hessam Zakerzadeh 1, Charu C. Aggarwal 2, Ken Barker 1 SSDBM 15 1 University of Calgary, Canada 2 IBM TJ Watson, USA Data Publishing OECD * declaration on access

More information

De-identification Koans. ICTR Data Managers Darren Lacey January 15, 2013

De-identification Koans. ICTR Data Managers Darren Lacey January 15, 2013 De-identification Koans ICTR Data Managers Darren Lacey January 15, 2013 Disclaimer There are several efforts addressing this issue in whole or part Over the next year or so, I believe that the conversation

More information

Probabilistic Prediction of Privacy Risks

Probabilistic Prediction of Privacy Risks Probabilistic Prediction of Privacy Risks in User Search Histories Joanna Biega Ida Mele Gerhard Weikum PSBD@CIKM, Shanghai, 07.11.2014 Or rather: On diverging towards user-centric privacy Traditional

More information

Degrees of De-identification of Clinical Research Data

Degrees of De-identification of Clinical Research Data Vol. 7, No. 11, November 2011 Can You Handle the Truth? Degrees of De-identification of Clinical Research Data By Jeanne M. Mattern Two sets of U.S. government regulations govern the protection of personal

More information

NSF Workshop on Big Data Security and Privacy

NSF Workshop on Big Data Security and Privacy NSF Workshop on Big Data Security and Privacy Report Summary Bhavani Thuraisingham The University of Texas at Dallas (UTD) February 19, 2015 Acknowledgement NSF SaTC Program for support Chris Clifton and

More information

International Journal of Advanced Computer Technology (IJACT) ISSN:2319-7900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

International Journal of Advanced Computer Technology (IJACT) ISSN:2319-7900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS First A. Dr. D. Aruna Kumari, Ph.d, ; Second B. Ch.Mounika, Student, Department Of ECM, K L University, chittiprolumounika@gmail.com; Third C.

More information

Information Security in Big Data: Privacy and Data Mining (IEEE, 2014) Dilara USTAÖMER 2065787

Information Security in Big Data: Privacy and Data Mining (IEEE, 2014) Dilara USTAÖMER 2065787 Information Security in Big Data: Privacy and Data Mining (IEEE, 2014) Dilara USTAÖMER 2065787 2015/5/13 OUTLINE Introduction User Role Based Methodology Data Provider Data Collector Data Miner Decision

More information

Practicing Differential Privacy in Health Care: A Review

Practicing Differential Privacy in Health Care: A Review TRANSACTIONS ON DATA PRIVACY 5 (2013) 35 67 Practicing Differential Privacy in Health Care: A Review Fida K. Dankar*, and Khaled El Emam* * CHEO Research Institute, 401 Smyth Road, Ottawa, Ontario E mail

More information

Midha Medical Clinic REGISTRATION FORM

Midha Medical Clinic REGISTRATION FORM Midha Medical Clinic REGISTRATION FORM Today s / / (PLEASE PRINT NEATLY) PATIENT INFORMATION Last Name: First Name: Middle Initial: IS THIS YOUR LEGAL NAME? YES NO IF NOT, WHAT IS YOUR LEGAL NAME DATE

More information

Data Mining and risk Management

Data Mining and risk Management ES ET DE LA VIE PRIVÉE E 29 th INTERNATIONAL CONFERENCE OF DATA PROTECTION AND PRIVACY COMMISS Data Mining Dr. Bradley A. Malin Assistant Professor Department of Biomedical Informatics Vanderbilt University

More information

Orthodontics on Silver Lake, P.A. Stephanie E. Steckel, D.D.S., M.S. Welcome To Our Office -Please Print-

Orthodontics on Silver Lake, P.A. Stephanie E. Steckel, D.D.S., M.S. Welcome To Our Office -Please Print- HEALTH HISTORY Orthodontics on Silver Lake, P.A. Stephanie E. Steckel, D.D.S., M.S. Welcome To Our Office -Please Print- Date: 20 Date of Birth: Patient s name: First Middle Last Name Patient Prefers to

More information

De-identification, defined and explained. Dan Stocker, MBA, MS, QSA Professional Services, Coalfire

De-identification, defined and explained. Dan Stocker, MBA, MS, QSA Professional Services, Coalfire De-identification, defined and explained Dan Stocker, MBA, MS, QSA Professional Services, Coalfire Introduction This perspective paper helps organizations understand why de-identification of protected

More information

Health Data De-Identification by Dr. Khaled El Emam

Health Data De-Identification by Dr. Khaled El Emam RISK-BASED METHODOLOGY DEFENSIBLE COST-EFFECTIVE DE-IDENTIFICATION OPTIMAL STATISTICAL METHOD REPORTING RE-IDENTIFICATION BUSINESS ASSOCIATES COMPLIANCE HIPAA PHI REPORTING DATA SHARING REGULATORY UTILITY

More information

Of Codes, Genomes, and Electronic Health Records: It s Only Sensitive If It Hurts When You Touch It

Of Codes, Genomes, and Electronic Health Records: It s Only Sensitive If It Hurts When You Touch It Of Codes, Genomes, and Electronic Health Records: It s Only Sensitive If It Hurts When You Touch It Daniel Masys, M.D. Affiliate Professor Biomedical and Health Informatics University of Washington Seattle,

More information

Protecting Patient Privacy. Khaled El Emam, CHEO RI & uottawa

Protecting Patient Privacy. Khaled El Emam, CHEO RI & uottawa Protecting Patient Privacy Khaled El Emam, CHEO RI & uottawa Context In Ontario data custodians are permitted to disclose PHI without consent for public health purposes What is the problem then? This disclosure

More information

Data Driven Approaches to Prescription Medication Outcomes Analysis Using EMR

Data Driven Approaches to Prescription Medication Outcomes Analysis Using EMR Data Driven Approaches to Prescription Medication Outcomes Analysis Using EMR Nathan Manwaring University of Utah Masters Project Presentation April 2012 Equation Consulting Who we are Equation Consulting

More information

ACTA UNIVERSITATIS APULENSIS No 15/2008 A COMPARISON BETWEEN LOCAL AND GLOBAL RECODING ALGORITHMS FOR ACHIEVING MICRODATA P-SENSITIVE K -ANONYMITY

ACTA UNIVERSITATIS APULENSIS No 15/2008 A COMPARISON BETWEEN LOCAL AND GLOBAL RECODING ALGORITHMS FOR ACHIEVING MICRODATA P-SENSITIVE K -ANONYMITY ACTA UNIVERSITATIS APULENSIS No 15/2008 A COMPARISON BETWEEN LOCAL AND GLOBAL RECODING ALGORITHMS FOR ACHIEVING MICRODATA P-SENSITIVE K -ANONYMITY Traian Marius Truta, Alina Campan, Michael Abrinica, John

More information

Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service

Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service Noman Mohammed Benjamin C. M. Fung Patrick C. K. Hung Cheuk-kwong Lee CIISE, Concordia University, Montreal, QC, Canada University

More information

CONSENT FOR MEDICAL TREATMENT

CONSENT FOR MEDICAL TREATMENT CONSENT FOR MEDICAL TREATMENT Patient Name DOB Date I, the patient or authorized representative, consent to any examination, evaluation and treatment regarding any illness, injury or other health concern

More information

Classification and Prediction

Classification and Prediction Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser

More information

NOTICE OF HEALTH INFORMATION PRIVACY PRACTICES (HIPAA)

NOTICE OF HEALTH INFORMATION PRIVACY PRACTICES (HIPAA) NOTICE OF HEALTH INFORMATION PRIVACY PRACTICES (HIPAA) THIS NOTICE OF PRIVACY PRACTICES DESCRIBES HOW HEALTH INFORMATION ABOUT YOU MAY BE USED AND DISCLOSED AND HOW YOU CAN GET ACCESS TO THIS INFORMATION.

More information

Introduction. A. Bellaachia Page: 1

Introduction. A. Bellaachia Page: 1 Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.

More information

Personal Contact and Insurance Information

Personal Contact and Insurance Information Kenneth A. Holt, M.D. 3320 Executive Drive Tele: 919-877-1100 Building E, Suite 222 Fax: 919-877-8118 Raleigh, NC 27609 Personal Contact and Insurance Information Please fill out this form as completely

More information

Auditing EMR System Usage. You Chen Jan, 17, 2013 You.chen@vanderbilt.edu

Auditing EMR System Usage. You Chen Jan, 17, 2013 You.chen@vanderbilt.edu Auditing EMR System Usage You Chen Jan, 17, 2013 You.chen@vanderbilt.edu Health data being accessed by hackers, lost with laptop computers, or simply read by curious employees Anomalous Usage You Chen,

More information

Understanding De-identification, Limited Data Sets, Encryption and Data Masking under HIPAA/HITECH: Implementing Solutions and Tackling Challenges

Understanding De-identification, Limited Data Sets, Encryption and Data Masking under HIPAA/HITECH: Implementing Solutions and Tackling Challenges Understanding De-identification, Limited Data Sets, Encryption and Data Masking under HIPAA/HITECH: Implementing Solutions and Tackling Challenges Daniel C. Barth-Jones, M.P.H., Ph.D. Assistant Professor

More information

Comments of the World Privacy Forum To: Office of Science and Technology Policy Re: Big Data Request for Information. Via email to bigdata@ostp.

Comments of the World Privacy Forum To: Office of Science and Technology Policy Re: Big Data Request for Information. Via email to bigdata@ostp. 3108 Fifth Avenue Suite B San Diego, CA 92103 Comments of the World Privacy Forum To: Office of Science and Technology Policy Re: Big Data Request for Information Via email to bigdata@ostp.gov Big Data

More information

Big Data - Security and Privacy

Big Data - Security and Privacy Big Data - Security and Privacy Elisa Bertino CS Department, Cyber Center, and CERIAS Purdue University Cyber Center! Big Data EveryWhere! Lots of data is being collected, warehoused, and mined Web data,

More information

Big Data Analytics in Mobile Environments

Big Data Analytics in Mobile Environments 1 Big Data Analytics in Mobile Environments 熊 辉 教 授 罗 格 斯 - 新 泽 西 州 立 大 学 2012-10-2 Rutgers, the State University of New Jersey Why big data: historical view? Productivity versus Complexity (interrelatedness,

More information

A Survey of Quantification of Privacy Preserving Data Mining Algorithms

A Survey of Quantification of Privacy Preserving Data Mining Algorithms A Survey of Quantification of Privacy Preserving Data Mining Algorithms Elisa Bertino, Dan Lin, and Wei Jiang Abstract The aim of privacy preserving data mining (PPDM) algorithms is to extract relevant

More information

Behavioral Health Consulting Services, LLC

Behavioral Health Consulting Services, LLC www.bhcsct.org infohealth@bhcsct.org 46 West Avon Road 322 Main St. 530 Middlebury Road Suite 202 Suite 1-G Suite 103 B Avon, CT 06001 Willimantic, CT 06226 Middlebury, CT 06762 Office phone- 1-860-673-0145

More information

Guidance on De-identification of Protected Health Information November 26, 2012.

Guidance on De-identification of Protected Health Information November 26, 2012. Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule November 26, 2012 OCR gratefully

More information

Big Data Integration and Governance Considerations for Healthcare

Big Data Integration and Governance Considerations for Healthcare White Paper Big Data Integration and Governance Considerations for Healthcare by Sunil Soares, Founder & Managing Partner, Information Asset, LLC Big Data Integration and Governance Considerations for

More information

19235 N Cave Creek Rd #104 Phoenix, AZ 85024 Phone: (602) 485-3414 Fax: (602) 788-0405

19235 N Cave Creek Rd #104 Phoenix, AZ 85024 Phone: (602) 485-3414 Fax: (602) 788-0405 19235 N Cave Creek Rd #104 Phoenix, AZ 85024 Phone: (602) 485-3414 Fax: (602) 788-0405 Welcome to our practice. We are happy that you selected us as your eye care provider and appreciate the opportunity

More information

Faculty Group Practice Patient Demographic Form

Faculty Group Practice Patient Demographic Form Name (Last, First, MI) Faculty Group Practice Patient Demographic Form Today s Patient Information Street Address City State Zip Home Phone SSN of Birth Gender Male Female Work Phone Cell Phone Marital

More information

Keweenaw Holistic Family Medicine Patient Registration Form

Keweenaw Holistic Family Medicine Patient Registration Form Keweenaw Holistic Family Medicine Patient Registration Form How did you first learn of our Clinic? Circle one: Attended Lecture Internet KHFM website Newspaper Sign in window Yellow Pages Physician Friend

More information

Major US Genomic Medicine Programs: NHGRI s Electronic Medical Records and Genomics (emerge) Network

Major US Genomic Medicine Programs: NHGRI s Electronic Medical Records and Genomics (emerge) Network Major US Genomic Medicine Programs: NHGRI s Electronic Medical Records and Genomics (emerge) Network Dan Roden Member, National Advisory Council For Human Genome Research Genomic Medicine Working Group

More information

OHIO VICTIMS OF CRIME COMPENSATION PROGRAM

OHIO VICTIMS OF CRIME COMPENSATION PROGRAM OHIO VICTIMS OF CRIME COMPENSATION PROGRAM Application for Supplemental Compensation If you or your family members are innocent victims of a violent crime, financial assistance may be available. For more

More information

A generalized Framework of Privacy Preservation in Distributed Data mining for Unstructured Data Environment

A generalized Framework of Privacy Preservation in Distributed Data mining for Unstructured Data Environment www.ijcsi.org 434 A generalized Framework of Privacy Preservation in Distributed Data mining for Unstructured Data Environment V.THAVAVEL and S.SIVAKUMAR* Department of Computer Applications, Karunya University,

More information

A Brief Survey on Anonymization Techniques for Privacy Preserving Publishing of Social Network Data

A Brief Survey on Anonymization Techniques for Privacy Preserving Publishing of Social Network Data A Brief Survey on Anonymization Techniques for Privacy Preserving Publishing of Social Network Data Bin Zhou School of Computing Science Simon Fraser University, Canada bzhou@cs.sfu.ca Jian Pei School

More information

GENETIC DATA ANALYSIS

GENETIC DATA ANALYSIS GENETIC DATA ANALYSIS 1 Genetic Data: Future of Personalized Healthcare To achieve personalization in Healthcare, there is a need for more advancements in the field of Genomics. The human genome is made

More information

Patient or Guardian Signature

Patient or Guardian Signature Co Payment Policy According to the regulations of individual insurance carriers, patients are responsible for paying co payments at the time of each office visit. PAYMENT POLICY FOR SERVICES RENDERED If

More information

Ohio Victims of Crime Compensation Program

Ohio Victims of Crime Compensation Program Ohio Victims of Crime Compensation Program Application for Compensation If you or your family members are innocent victims of a violent crime, financial assistance may be available. The Ohio Victims of

More information

REGISTRATION FORM. How would you like to receive health information? Electronic Paper In Person. Daytime Phone Preferred.

REGISTRATION FORM. How would you like to receive health information? Electronic Paper In Person. Daytime Phone Preferred. Signature Preferred Pharmacy Referral Info Emergency Contact Guarantor Information Patient Information Name (Last, First, MI) REGISTRATION FORM Today's Date Street Address City State Zip Gender M F SSN

More information

UNIVERSITY OF WISCONSIN MADISON BADGER SPORTS CAMP HEALTH FORM

UNIVERSITY OF WISCONSIN MADISON BADGER SPORTS CAMP HEALTH FORM UNIVERSITY OF WISCONSIN MADISON BADGER SPORTS CAMP HEALTH FORM Event Name: Dates: Participant Name: Participant cell phone with area code: Custodial Parent/Guardian Name: Phone number: Cell phone: Home

More information

Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects

Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects Report on the Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects Background and Goals of the Workshop June 5 6, 2012 The use of genome sequencing in human research is growing

More information

HKU Big Data and Privacy Workshop. Privacy Risks of Big Data Analytics From a Regulator s Point of View

HKU Big Data and Privacy Workshop. Privacy Risks of Big Data Analytics From a Regulator s Point of View HKU Big Data and Privacy Workshop Privacy Risks of Big Data Analytics From a Regulator s Point of View 30 November 2015 Henry Chang, IT Advisor Office of the Privacy Commissioner for Personal Data, Hong

More information

A De-identification Strategy Used for Sharing One Data Provider s Oncology Trials Data through the Project Data Sphere Repository

A De-identification Strategy Used for Sharing One Data Provider s Oncology Trials Data through the Project Data Sphere Repository A De-identification Strategy Used for Sharing One Data Provider s Oncology Trials Data through the Project Data Sphere Repository Prepared by: Bradley Malin, Ph.D. 2525 West End Avenue, Suite 1030 Nashville,

More information

Big Genetic Data Opportunities and Threats Johann Eder. Funded by GZ 10.470/0016 II/3/2013

Big Genetic Data Opportunities and Threats Johann Eder. Funded by GZ 10.470/0016 II/3/2013 Big Genetic Data Opportunities and Threats Johann Eder Funded by GZ 10.470/0016 II/3/2013 Cholera Outbreak, London, 1854 John Snow showed by cluster analysis that contaminated water, not air, spread cholera

More information

DOB: // // Gender: Male Female. Home: Cell: Work:

DOB: // // Gender: Male Female. Home: Cell: Work: Core Physical Therapy Clinics, LLC Paper Registration Form Patient Name Date DOB: // // Gender: Male Female Address: City State: Zip Code Home: Cell: Work: Email: Emergency Contact Employer: Name Insurance

More information

Privacy in Data Publication and Outsourcing Scenarios

Privacy in Data Publication and Outsourcing Scenarios Privacy in Data Publication and Outsourcing Scenarios Pierangela Samarati Dipartimento di Informatica Università degli Studi di Milano pierangela.samarati@unimi.it 12th International School on Foundations

More information

Find the signal in the noise

Find the signal in the noise Find the signal in the noise Electronic Health Records: The challenge The adoption of Electronic Health Records (EHRs) in the USA is rapidly increasing, due to the Health Information Technology and Clinical

More information

Li Xiong, Emory University

Li Xiong, Emory University Healthcare Industry Skills Innovation Award Proposal Hippocratic Database Technology Li Xiong, Emory University I propose to design and develop a course focused on the values and principles of the Hippocratic

More information

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining

More information

Linking Hospitalizations and Death Certificates across Minnesota Hospitals

Linking Hospitalizations and Death Certificates across Minnesota Hospitals Linking Hospitalizations and Death Certificates across Minnesota Hospitals AcademyHealth, Baltimore, June 2013 JMNaessens,ScD; SMPeterson; MBPine,MD; JSchindler; MSonneborn; JRoland; ASRahman; MGJohnson;

More information

future proof data privacy

future proof data privacy 2809 Telegraph Avenue, Suite 206 Berkeley, California 94705 leapyear.io future proof data privacy Copyright 2015 LeapYear Technologies, Inc. All rights reserved. This document does not provide you with

More information

Faculty Group Practice Patient Demographic Form

Faculty Group Practice Patient Demographic Form Name (Last, First, MI) Faculty Group Practice Patient Demographic Form Today s Date Patient Information Street Address City State Zip Home Phone Work Phone Cell Phone ( ) Preferred ( ) Preferred ( ) Preferred

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information