Data Privacy Aspects in Big Data Facilitating Medical Data Sharing. Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland
|
|
- Priscilla Kelley
- 8 years ago
- Views:
Transcription
1 Data Privacy Aspects in Big Data Facilitating Medical Data Sharing Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland September 2015
2 Content Introduction / Motivation Focus Application I: Privacy Preserving Medical Data Publishing to Support Intended Analyses Focus Application II: Identification of Privacy Vulnerabilities Summary 2
3 Data sharing Individuals data are increasing shared Netflix published movie ratings of 500K subscribers AOL published 20M search query terms of 658K web users TomTom sold customers location (GPS) data to the Dutch police emerge consortium published patient data related to genome-wide association studies to biorepositories (dbgap) Orange provided call information about its mobile subscribers, as part of the D4D challenge on mobile phone data ( Benefits of data sharing Personalization (e.g., Netflix s data mining contest aimed to improve movie recommendation based on personal preferences) Marketing (e.g., Tesco made 53M from selling shopping patterns to retailers and manufacturers, such as Nestle and Unilever, last year) Social benefits (e.g., promote medical research studies, improve traffic management, etc.) 3
4 Data sharing must guarantee privacy and accommodate utility A popular data sharing scenario (data publishing) Original data Released data data owners data publisher (trusted) data recipient (untrusted) Threats to data privacy Identity disclosure Sensitive information disclosure Membership disclosure Inferential disclosure Data utility requirements Minimal data distortion (general purpose use) Support of specific applications / workloads (e.g., building accurate predictive models, GWAS, etc.) 4
5 Focus Application I: Privacy Preserving Medical Data Publishing How can we share medical data in a way that protects patients privacy while supporting research studies?
6 Electronic Medical Records (EMR) Relational data Registration and demographic data Transaction (set-valued) data Billing information ICD codes* are represented as numbers (up to 5 digits) and denote signs, findings, and causes of injury or disease** Sequential data DNA Text data Clinical notes Electronic Medical Records (EMR) Name YOB ICD DNA Clinical notes Jim , 185 C T (doc1) Mary , A G (doc2) Mary C G (doc3) Carol C G (doc4) Anne , G C (doc5) Anne A T (doc6) * International Statistical Classification of Diseases and Related Health Problems ** Centers for Medicare & Medicaid Services - 6
7 EMR data use in analytics Statistical analysis Correlation between YOB and ICD code 185 (Malignant neoplasm of prostate) Querying Clustering Control epidemics* Classification Predict domestic violence** Association rule mining Electronic Medical Records Name YOB ICD DNA Jim , C T Mary A G Mary , C G Carol , C G Anne , G C Anne A T Formulate a government policy on hypertension management*** IF age in [43,48] AND smoke = yes AND exercise=no AND drink=yes;; THEN hypertension=yes (sup=2.9%;; conf=26%)0 * Tildesley et al. Impact of spatial clustering on disease transmission and optimal control, PNAS, ** Reis et al. Longitudinal Histories as Predictors of Future Diagnoses of Domestic Abuse: Modelling Study, BMJ: British Medical Journal, 2011 *** Chae et al. Data mining approach to policy analysis in a health insurance domain. Int. J. of Med. Inf.,
8 Need for privacy Why we need privacy in medical data sharing? If privacy is breached, there are consequences to patients Consequences to patients Emotional and economical embarrassment 62% of individuals worry their EMRs will not remain confidential* 35% expressed privacy concerns regarding the publishing of their data to dbgap** Opt-out or provide fake data à difficulty to conduct statistically powered studies * Health Confidence Survey 2008, Employee Benefit Research Institute ** Ludman et al. Glad You Asked: Participants Opinions of Re-Consent for dbgap Data Submission. Journal of Empirical Research on Human Research Ethics,
9 Need for privacy If privacy is breached, there are consequences to organizations Legal à HIPAA, EU legislation (95/46/EC, 2002/58/EC, 2009/136/EC etc.) Financial à The average cost of a single data breach is $5.85M in the US and $4.74M in Germany;; these countries have the highest per capita cost. Healthcare and Education are the most heavily regulated industries * Ponemon Institute Research Report 2014 Cost of Data Breach Study: Global Analysis. 9
10 Protecting data privacy: data masking / removal of identifiers Removing / masking direct identifiers data owners data publisher (trusted) data recipient (untrusted) Original data De-identified data 1. Locate the direct identifiers (attributes that uniquely identify an individual), such as SSN, Patient ID, Phone number etc. 2. Remove or mask them from the data prior to data publishing Name John Doe Thelma Arnold Search Query Terms Harry potter, King s speech Hand tremors, bipolar, dry mouth, effect of nicotine on the body 10
11 Protecting data privacy: data masking / removal of identifiers Masking / removal of direct identifiers is not sufficient! data owners data publisher (trusted) data recipient (untrusted) Original data Released data Main types of threats to data privacy External data Background Knowledge Identity disclosure Sensitive information disclosure Inferential disclosure 11
12 Privacy Threats: Identity disclosure Identity disclosure in relational data (e.g., patients demographics) Individuals are linked to their published records based on quasi-identifiers (attributes that in combination can identify an individual) Age Postcode Sex 20 NW10 M 45 NW15 M 22 NW30 M 50 NW25 F De-identified data Name Age Postcode Sex Greg 20 NW10 M Jim 45 NW15 M Jack 22 NW30 M Anne 50 NW25 F External data * Sweeney, k- anonymity: a model for protecting privacy. IJUFKS, % of US citizens can be identified by Age, DOB, 5-digit ZIP code* 12
13 Identity disclosure in sharing patients diagnosis codes Identity disclosure in transaction data (e.g., diagnosis codes) Identified EMR data ID ICD Jim Mary Anne Released EMR Data ICD DNA CT A AC T GC C Mary is diagnosed with benign essential hypertension (ICD code 401.1) the second record belongs to her à all her diagnosis codes Disclosure based on diagnosis codes* à general problem for other medical terminologies (e.g., ICD-10 used in EU) à sharing data susceptible to the attack against legislation * Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants Privacy. JAMIA,
14 Real-world identity disclosure cases involving medical data Group Insurance Commission data Voter list of Cambridge, MA William Weld, Former Governor of MA Chicago Homicide database Social security death index 35% of murder victims Adverse Drug Reaction Database Public obituaries 26-year old girl who died from drug 14
15 Issuing attacks on medical datasets Two-step attack using publicly available voter registration lists and hospital discharge summaries voter(name,..., zip, dob, sex) knowledge summary(zip, dob, sex, diagnoses) release(diagnoses, DNA) trust voter list & discharge summary à release 87% of US citizens can be identified by {dob, gender, ZIP-code} * Sweeney, k- anonymity: a model for protecting privacy. IJUFKS,
16 Issuing attacks on medical datasets One-step attack using EMRs* Insider s attack EMR (name,..., diagnoses) knowledge EMR à release release(, diagnoses, DNA) trust * Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants Privacy. JAMIA,
17 Case study: Evaluating the effectiveness of the insider s attack De-identified / Masked EMR population from VUMC Population: 1.2M records (patients) from Vanderbilt A unique random number for ID de-identified EMR (ID,..., diagnoses) VNEC(, diagnoses, DNA) VNEC de-identified / masked EMR sample 2762 records (patients) derived from the population Patients from VNEC were involved in a study (GWAS) for the Native Electrical Conduction of the heart Patients EMR were to be deposited into dbgap Data would be made available to support other studies (GWAS)? 17
18 Case study: Evaluating the effectiveness of the insider s attack Vanderbilt s EMR - VNEC dataset linkage on ICD codes 100.0% % of re-identified sample 90.0% 80.0% 70.0% 96.5% We assume that all ICD codes are used to issue an attack 96.5% of patients susceptible to identity disclosure 60.0% Distinguishability (log scale) Number of times a set of ICD codes appears in the population Support in the data mining literature 18
19 Case study: Evaluating the effectiveness of the insider s attack Vanderbilt s EMR - VNEC dataset linkage on ICD codes 1 ICD code 2 ICD code combination 3 ICD code combination 10 ICD code combination % of re-identifiable sample 100% 80% 60% 40% 20% 0% A random subset of ICD codes that can be used in attack Knowing a random combination of 2 ICD codes can lead to unique re-identification Distinguishability (log scale) Number of times a set of ICD codes appears in the population (equiv. to support count in data mining literature) 19
20 Privacy Threats: Sensitive information disclosure Individuals are associated with sensitive information Name Age Postcode Sex Greg 20 NW10 M Background knowledge Identified EMR data ID ICD Jim Mary Sensitive Attribute (SA) Age Postcode YOB Disease 20 NW HIV 20 NW HIV 20 NW HIV 20 NW HIV De-identified data Released EMR Data ID ICD DNA Jim C A Mary A T Mary is diagnosed with and 401.1à she has Schizophrenia Schizophrenia Sensitive information disclosure can occur without identity disclosure 20
21 Privacy Threats: Inferential disclosure Sensitive knowledge patterns are exposed by data mining 75% of patients visit the same physician more than 4 times Unsolicited advertisement 60% of the white males > 50 suffer from diabetes Stream data collected by health monitoring systems Electronic medical records Customer discrimination Drug orders & costs Business rivals can harm data publishers and insurance, pharmaceutical & marketing companies can harm data owners* * G. Das and N. Zhang. Privacy risks in health databases from aggregate disclosure. PETRA,
22 Anonymization of demographics k-anonymity principle* Each record in a relational table T should have the same value over quasi-identifiers with at least k-1 other records in T These records collectively form a k-anonymous group k-anonymity protects from identity disclosure Protects data from linkage to external sources (triangulation attacks) The probability that an individual is correctly associated with their record is at most 1/k Name Age Postcode Sex Greg 40 NW10 M Jim 45 NW15 M Jack 22 NW30 M Anne 50 NW25 F External data Age Postcode Sex 4* NW1* M 4* NW1* M * NW* * * NW* * 2-anonymous data * Sweeney. Achieving k- anonymity privacy protection using generalization and suppression. IJUFKS
23 Attack on k-anonymous data Homogeneity attack* All sensitive values in a k-anonymous group are the same à sensitive information disclosure Name Age Postcode Greg 40 NW10 External data Age Postcode Disease 4* NW1* HIV 4* NW1* HIV 5* NW* Ovarian Cancer 5* NW* Flu 2-anonymous data Attacker is confident that Greg suffers from HIV * Machanavajjhala et al, l-diversity: Privacy Beyond k-anonymity. ICDE
24 l-diversity principle for demographics l -diversity* A relational table is l-diverse if all groups of records with the same values over quasi-identifiers (QID groups) contain no less than l well-represented values for the SA Distinct l-diversity l well-represented à l distinct Age Postcode Disease 4* NW1* HIV 4* NW1* HIV 4* NW1* HIV 4* NW1* HIV 4* NW1* Flu 4* NW1* Cancer Three distinct values, but the probability of HIV being disclosed is ~
25 Differential privacy Objective Prevent attackers from inferring any additional information about an individual, regardless of whether the published dataset contains the individual s record or not. ε-differential privacy* satisfied by a randomized algorithm A if Pr[A(D) = D] exp(ε) Pr[A(D') = D] for all datasets D, D that differ in one record, and for any possible anonymized dataset D, where ε is a constant and the probabilities are over the randomness of A** Essentially differential privacy ensures that the outcome of a calculation is insensitive to any one particular record in the dataset * Dwork. Differential privacy. ICALP, ** Definition from Mohammed et al. Differentially private data release for data mining. KDD,
26 Offering differential privacy Laplace mechanism For any function f : DàR d, the algorithm A that adds independently generated noise with distribution Lap(Δ f / ε) to each of its d outputs, satisfies ε-differential privacy, when Δ f = max D,D f(d) f(d ) for all datasets D, D that differ in exactly one record. Age Gender 20 M 23 F 25 M 42 F q Assume that f : returns the number of patients with age < 40 q Then, original value: 3 q Δ f = 1 (sensitivity) q Add to the original value f (D) noise with distribution Lap(1/ε) => released value: 3 + Lap (1/ε) Exponential mechanism In tasks where adding noise makes no sense (e.g., training a classifier), differential privacy is offered by randomizing the selection of an outcome from a set of possible outcomes. For any function u: (D t) à R that measures the utility of an output t, an algorithm A that chooses t with probability proportional to exp(ε u(d,t) / 2 Δu) satisfies ε-differential privacy, where Δu = max forall t, D, D u(d,t) u(d,t). For example, the exponential mechanism can be used to select Age or Gender given a function u that scores attributes according to the perceived utility loss. * Dwork. Differential privacy. ICALP, ** Definition from Mohammed et al. Differentially private data release for data mining. KDD,
27 ε-differential privacy Pros (+) No/few assumptions on adversarial knowledge Composability [1] privacy holds even when multiple differentially-private datasets are obtained by an adversary Several mechanisms for the interactive [2] and the non-interactive scenario [3,4] Cons (-) Overly restrictive à differential privacy leads to very high information loss! several variations [5] & improved mechanisms [6], but still the utility remains very low No real-world applications & no privacy criterion is associated with the value of ε Important drawbacks due to the random noise addition Only noisy answers to a limited number and type of queries can be offered, or noisy summary statistics (such as histograms) can be released;; anonymous datasets cannot be created/released Individuals may become associated with false information in the output of differential privacy methods The utility loss is very high compared to syntactic approaches Several misconceptions [7] and susceptibility to attacks [8] (e.g., Cormode showed that an attacker can infer the sensitive value of an individual fairly accurately by applying Naïve Bayes classification on differentially private data) [1] Ganta et al. Composition attacks and auxiliary information in data privacy. KDD, [2] Dwork. Differential privacy: a survey of results. TAMC, [3] Mohammed. Differentially private release for data mining. KDD, [4] Xiao et al. Differential privacy via wavelet transforms. ICDE, [5] Machanavajjhala et al. Data Publishing against Realistic Adversaries. PVLDB, [6] Ding et al. Differentially private data cubes: optimizing noise sources and consistency. SIGMOD, [7] Kifer et al. No free lunch in data privacy. SIGMOD, [8] Cormode. Personal privacy vs. population privacy: learning to attack anonymization. KDD,
28 Partition-based algorithms for k-anonymity Main idea of partition-based algorithms A record projected over QIDs is treated as a multidimensional point A subspace (hyper-rectangle) that contains at least k points can form a k-anonymous group à multidimensional global recoding Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV M F 28 F Cancer 29 F Obesity How to partition the space? One attribute at a time which to use? How to split the selected attribute? 28
29 Example of applying Mondrian (k=3) M M F F Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity Age Sex Disease [20-26] {M,F} HIV [20-26] {M,F} HIV [20-26] {M,F} Obesity [27-29] F HIV [27-29] F Cancer [27-29] F Obesity 29
30 Clustering-based anonymization algorithms Main idea of clustering-based anonymization 1.Create clusters containing at least k records with similar values over QIDs Seed selection Similarity measurement Stopping criterion 2. Anonymize records in each cluster separately Local recoding and/or Suppression??? 30
31 Clustering-based anonymization algorithms Clusters need to be separated Seed Selection Furthest-first Random Clusters need to contain similar values Similarity measurement Stopping criterion Size-based Quality-based Clusters should not be too large All these heuristics attempt to improve data utility 31
32 Example of Bottom-up clustering algorithm (k=2) M M M F F F Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity Age Sex Disease [20-25] M HIV [20-25] M Obesity [23-27] F HIV [23-27] F HIV [28-29] F Cancer [28-29] F Obesity 32
33 Example of top-down clustering algorithm (k=2) M M M F F F Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity Age Sex Disease [20-25] {M,F} HIV [20-25] {M,F} HIV [20-25] {M,F} Obesity [27-29] F HIV [27-29] F Cancer [27-29] F Obesity 33
34 Preventing identity disclosure from diagnosis codes: Suppression Suppression Removes items or records from data prior to releasing the data Suppress ICD codes* appearing in less than a certain percent of patient records Intuition: such ICD codes can act as quasi-identifiers Identified EMR data ID ICD Mary Anne Released EMR Data ICD DNA AC T GC C * Vinterbo et al. Hiding information by cell suppression. AMIA Annual Symposium 01 34
35 Code suppression a case study using Vanderbilt s EMR data We had to suppress diagnosis codes appearing in less than 25% of the records in VNEC to prevent re-identification doing so we were left with only 5 out of ~6000 ICD codes! * 5-Digit ICD-9 Codes 3-Digit ICD-9 Codes ICD-9 Sections Benign essential hypertension Other malaise and fatigue 401-Essential hypertension 780- Other soft tissue Pain in limb Other disorders of soft tissues Abdominal pain 789 Other abdomen/pelvis symptoms Hypertensive disease Rheumatism excluding the back Rheumatism excluding the back Symptoms Chest pain 786 -Respiratory system Symptoms *Loukides, Gkoulalas- Divanis, Malin. Anonymization of Electronic Medical Records for Validating Genome- Wide Association Studies. PNAS 10 35
36 Preventing identity disclosure in EMR data: Generalization Generalization - replaces items with more general ones (usually with the help of a domain hierarchy) Any Chapters Sections 3-digit ICD codes 5-digit ICD codes Any Generalize ICD-codes to their 3-digit representation benign essential hypertension à 401- essential hypertension Identified EMR data ID ICD Mary Anne Released EMR Data ICD DNA AC T GC C 36
37 Code generalization a case study using Vanderbilt s EMR data Generalizing ICD codes from VNEC* 5-digit ICD codes 3-digit ICD codes 100.0% 90.0% 96.5% % of re-identified sample 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% 75.0% 25.6% 95% no suppression suppression 5% suppression 15% suppression 25% distinguishability (log scale) 95% of the patients remain re-identifiable * Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants Privacy. JAMIA,
38 Complete k-anonymity & k m -anonymity Complete k-anonymity: Knowing that an individual is associated with any itemset, an attacker should not associate this individual to < k transactions ICD DNA AC T GC C CC A CA T Original data ICD DNA AC T GC C CC A CA T 2-complete anonymous data k m -anonymity: Knowing that an individual is associated with any m-itemset, an attacker should not associate this individual to less than k transactions ICD DNA AC T GC C CC A CA T Original data ICD DNA 401 AC T 401 GC C 401 CC A 401 CA T anonymous data 38
39 Applicability of complete k-anonymity and k m -anonymity to medical data Limited in the specification of privacy requirements Assume too powerful attackers all m-itemsets (combinations of m diagnosis codes) need protection but medical data publishers have detailed privacy requirements Explore a small number of possible generalizations Do not take into account utility requirements Attackers know who is diagnosed with abc or defgh They protect all 5-itemsets instead of the 2 itemsets privacy constraints 39
40 Policy-based anonymization model Policy-based anonymization for ICD codes* Global anonymization model Models both generalization and suppression Each original ICD code is replaced by a unique set of ICD codes no need for generalization hierarchies ICD codes Anonymized codes (493.00, ) (296.01, ) Generalized ICD code interpreted as or or both Φ ( ) Suppressed ICD code Not released *Loukides, Gkoulalas- Divanis, Malin. Anonymization of Electronic Medical Records for Validating Genome- Wide Association Studies. PNAS 10 40
41 Policy-based anonymization: Privacy model Data publishers specify diagnosis codes that need protection Privacy Model: Knowing that an individual is associated with one or more specific itemsets (privacy constraints), an attacker should not be able to associate this individual to less than k transactions ICD DNA AC T GC C CC A CA T Original data ICD DNA AC T (401.2, 401.4) GC C CC A (401.2, 401.4) CA T Anonymized data Privacy Policy: The set of all specified privacy constraints Privacy is achieved when all privacy constraints are supported by at least k transactions in the published data or do not appear at all 41
42 Policy-based anonymization: Data utility considerations Utility Constraints: Published data must remain as useful as the original data for conducting a GWAS on a disease or trait à number of cases and controls in a GWAS must be preserved Supporting utility constraints: ICD codes from utility policy are generalized together;; a larger part of the solution space is searched than when using domain generalization hierarchies (296.00, ) 42
43 Policy-based anonymization algorithms Goal: Anonymize medical records so that Privacy is guaranteed Utility is high à many GWAS are supported simultaneously Incurred information loss is minimal Challenging optimization problem NP-hard Feasibility depends on constraints Heuristic algorithms Utility-Guided Anonymization of Clinical Profiles (UGACLIP) Clustering-based Anonymization (CBA) Algorithm Efficiency Scalability UGACLIP CBA Utility 43
44 Anonymization algorithms: UGACLIP Privacy Policy Utility Policy k=2 EMR data ICD DNA CT A AC T GC C UGACLIP Algorithm Data is protected;; {296.00, , } appears 2 times Anonymized EMR data ICD DNA (296.00, ) CT A AC T (296.00, ) GC C Data remains useful for GWAS on Bipolar disorder;; associations between (296.00, ) and DNA region CT A are preserved 44
45 Anonymization algorithms: CBA Sketch of CBA Retrieve the ICD codes that need less protection from the Privacy Policy Gradually build a cluster of codes that can be anonymized according to the utility policy and with minimal UL If the ICD codes are not protected Suppress no more ICD codes than required to protect privacy privacy req. are met p 1 = {i 1 } p 2 = {i 5, i 6 } k=3 clusters merging (driven by UL) singleton clusters 45
46 Case Study: EMRs from Vanderbilt University Medical Center Datasets VNEC 2762 de-identified EMRs from Vanderbilt involved in a GWAS VNECkc subset of VNEC, we know which diseases are controls for others BIOVU all de-identified EMRs (79087) from Vanderbilt s biobank (the largest dataset in medical data privacy literature)* Methods UGACLIP and CBA ACLIP (state-of-the-art method it does not take utility policy into account) *Loukides, Gkoulalas- Divanis. Utility- aware anonymization of diagnosis codes. IEEE TITB
47 UGACLIP & CBA: First algorithms to offer data utility in GWAS Setting: k = 5, protecting single-visits of patients, 18 GWAS-related diseases* no utility constraints Diseases related to all GWAS reported in Manolio* Best competitor Result of ACLIP is useless for validating GWAS UGACLIP preserves 11 out of 18 GWAS CBA preserves 14 out of 18 GWAS simultaneously * Manolio et al. A HapMap harvest of insights into the genetics of common disease. J Clinic. Inv
48 Utility beyond GWAS Supporting clinical case counts in addition to GWAS learn number of patients with sets of codes in 10% of the records useful for epidemiology and data mining applications act. estim. act. VNECkc Queries can be estimated accurately VNECkc (ARE <1.25), comparable to ACLIP Anonymized data can support both GWAS and studies on clinical case counts 48
49 Anonymizing the BIOVU (79K EMR) Supporting clinical case counts in BIOVU Very low error in query answering (Average Relative Error <1) All EMRs in the VUMC biobank can be anonymized and remain useful 49
50 Anonymization of RT-datasets (e.g., demographics + diagnoses codes) Privacy Threat: Attackers know some relational attribute values (e.g., demographics) plus some sensitive items (e.g., diagnoses) for an individual. *G. Poulis, G. Loukides, A. Gkoulalas- Divanis, S. Skiadopoulos. Anonymizing data with relational and transactional attributes. PKDD
51 Focus Application II: Identification of Privacy Vulnerabilities How can we analyze datasets to discover privacy vulnerabilities and select the best protection mechanism?
52 Privacy-preserving data publishing ID Name Address SSN Birth Gender ZIP Marital status A1 A2 0 Maria 10 NY E. Avenue /64 Female Divorced 1 Jenny 5 Brighton Street /64 Female Divorced 2 Nick 12 Doyle Ave /64 Male Widow 3 Tom 154 West End Av /64 Male Married 4 John 93 Somers Str /63 Male Married 5 Bob 35 University Av /63 Male Married 6 Noeleen 63 Mirror Street /64 Female Married 7 Eleni 67 Common Av /61 Female Married 8 Dave 65 Main Str /61 Male Single 9 Thomas 84 Main Ave /61 Male Single direct identifiers quasi-identifiers sensitive/other info 52
53 Identification of Privacy Vulnerabilities (IPV) Privacy-preserving data publishing Goal: Automatically analyze a dataset to expose privacy vulnerabilities, which could lead to privacy attacks, and to validate its protection level Methods: Identify existing, publicly available datasets that could be used by adversaries to perform triangulation attacks Identify privacy risks in the data itself (e.g., outliers, uniqueness or rarity of certain records based on some attributes, frequent behaviour of an individual that is infrequent for many others, etc.) Outcome: A list of identified privacy vulnerabilities and configuration options for the data anonymization algorithm 53
54 Discover QIDs based on publicly available datasets Name Birth ZIP Disease John 03/ HIV Bob 03/ Cancer Noeleen 09/ Flu Name Gender ZIP Salary Maria Female $100K Jenny Female $1M Nick Male $20K Name ZIP Marital status Nick Widow Tom Married John Married Considered publicly available datasets ID Birth Gender ZIP Marital status 0 09/64 Female Divorced 1 09/64 Female Divorced 2 04/64 Male Widow 3 04/64 Male Married 4 03/63 Male Married 5 03/63 Male Married 6 09/64 Female Married 7 09/61 Female Married 8 05/61 Male Single 9 05/61 Male Single Dataset to be published q Have we found all the existing publicly available datasets that could be linked to our data? q What if a dataset becomes available after the data has been published and can be linked to the published dataset, exposing the identity of individuals? 54
55 Discover privacy vulnerabilities on the data itself ID Birth Gender ZIP Marital status 0 09/64 Female Divorced 1 09/64 Female Divorced Discovery of privacy vulnerabilities: q What could lead to re-identification attacks in my data, assuming that sufficient background knowledge and/or public datasets are available to attackers? 2 04/64 Male Widow 3 04/64 Male Married 4 03/63 Male Married 5 03/63 Male Married 6 09/64 Female Married 7 09/61 Female Married 8 05/61 Male Single 9 05/61 Male Single Dataset to be published q What is unique or rare in a record? q Are there any outliers in my data? Quantification of privacy risk: q How many vulnerabilities exist in my dataset? q How powerful an attacker should be in order to reidentify the most vulnerable individual in the data? q How many individuals can be re-identified? 55
56 Identification of Privacy Vulnerabilities Our recent algorithms: Algorithm Description Part of IPV Tool MTUI (Multi-Threaded Uniques Identification) & FPVI (Fast Algorithm for privacy Vulnerabilities Identification) MTRA (Multi-Threaded Risk Assessment) MTS2 (Multi-Threaded Sample uniques identification algorithm) MTLVI (Multi-Threaded Location-based Vulnerabilities Identification) These algorithms compute the quasi-identifiers of a dataset to protect the data from re-identification attacks. They expose all minimal combinations of attributes that lead to unique (or rare) records. This algorithm calculates the vulnerability index for each combination of attributes in a dataset, by reporting the cardinality of the smallest group of records that share the same values for each combination. This algorithm identifies the specific records that are unique, along with the combination of attributes for which they are unique (or rare) in the dataset. This algorithm identifies and reports on a series of privacy vulnerabilities in the context of location/mobility data (user trajectories), such as sensitive sequences of locations (or location-time pairs) for a user, sensitive user itineraries, sensitive places-of-interest (POIs), infrequent user movement behavior, etc. 56
57 Example: Identification of Privacy Vulnerabilities ID Birth Gender ZIP Marital status 0 09/64 Female Divorced 1 09/64 Female Divorced 2 04/64 Male Widow 3 04/64 Male Married 4 03/63 Male Married 5 03/63 Male Married 6 09/64 Female Married 7 09/61 Female Married 8 05/61 Male Single 9 05/61 Male Single Search space (lattice) for discovering privacy vulnerabilities in a relational/transaction dataset Discovering quasi-identifiers Goal: Find all minimal combinations of attributes in the dataset that lead to at most k-individuals FPVI algorithm: {M à 2}, {BZ à 1}, {GZ à 0} (the number corresponds to the first entry where a unique was found) Discovering vulnerability indexes Goal: Find the vulnerability index for each combination of attributes MTRA algorithm: {B à 2}, {G à 4}, {Z à 2}, {M à1}, {BG à 2}, (the number corresponds to the vulnerability index for the corresponding combination) 57
58 Scalability of the MTUI and MTRA approaches Scalability on other datasets with various characteristics (Records: 165K 11M, Attributes: 9 50): MTUI significantly outperforms the state-of-the-art methods in all tested datasets and with all tested parameters. It can also analyze datasets that the other algorithms are unable to process (see (g)-(h)). 58
59 Scalability of the FPVI algorithm Scalability on other datasets with various characteristics (Records: 165K 11M, Attributes: 9 50): FPVI significantly outperforms the state-of-the-art methods & MTUI in all tested datasets and can scale to datasets of millions of records and tens of attributes without requiring excessive time. 59
60 Risk Utility Confidentiality Maps Identifying a good trade-off between data privacy and data utility R-U Confidentiality map to track the trade-off* high Disclosure risk c c c Original data publishing c b minimum level of protection required Data publisher decides the desired trade-off low a No publishing minimum level of acceptable utility Utility high An intuitive tool that allows comparing different anonymizations * Duncan et al. Disclosure Risk vs. Data Utility: The R-U Confidentiality map. Tech. Rep LA-UR , Los Alamos National Library,
61 Risk Utility Confidentiality Maps Selecting data anonymizations with a desired trade-off R-U confidentiality map Allows data publishers to compare data anonymizations Enables the selection of the desired anonymization Disclosure risk high c Solutions of PCTA d d d d c c a No publishing low Utility d Original data publishing b d c c c c high Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, 2011 Anonymization with best utility/privacy trade-off 61
Privacy Challenges and Solutions for Data Sharing. Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland
Privacy Challenges and Solutions for Data Sharing Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland June 8, 2015 Content Introduction / Motivation for data privacy Focus Application
More informationMedical Data Sharing: Privacy Challenges and Solutions
Medical Data Sharing: Privacy Challenges and Solutions Aris Gkoulalas-Divanis agd@zurich.ibm.com IBM Research - Zurich Grigorios Loukides g.loukides@cs.cf.ac.uk Cardiff University ECML/PKDD, Athens, September
More informationPrivacy-Preserving Medical Data Sharing
Privacy-Preserving Medical Data Sharing Aris Gkoulalas-Divanis* arisdiva@ie.ibm.com IBM Research - Ireland Grigorios Loukides* g.loukides@cs.cf.ac.uk Cardiff University SIAM Data Mining, Anaheim, CA, USA,
More informationCS346: Advanced Databases
CS346: Advanced Databases Alexandra I. Cristea A.I.Cristea@warwick.ac.uk Data Security and Privacy Outline Chapter: Database Security in Elmasri and Navathe (chapter 24, 6 th Edition) Brief overview of
More informationAnonymization of Administrative Billing Codes with Repeated Diagnoses Through Censoring
Anonymization of Administrative Billing Codes with Repeated Diagnoses Through Censoring Acar Tamersoy, Grigorios Loukides PhD, Joshua C. Denny MD MS, and Bradley Malin PhD Department of Biomedical Informatics,
More informationPrivacy Challenges of Telco Big Data
Dr. Günter Karjoth June 17, 2014 ITU telco big data workshop Privacy Challenges of Telco Big Data Mobile phones are great sources of data but we must be careful about privacy 1 / 15 Sources of Big Data
More informationInformation Security in Big Data using Encryption and Decryption
International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842 Information Security in Big Data using Encryption and Decryption SHASHANK -PG Student II year MCA S.K.Saravanan, Assistant Professor
More informationDATA MINING - 1DL360
DATA MINING - 1DL360 Fall 2013" An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/per1ht13 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationPrinciples and Best Practices for Sharing Data from Environmental Health Research: Challenges Associated with Data-Sharing: HIPAA De-identification
Principles and Best Practices for Sharing Data from Environmental Health Research: Challenges Associated with Data-Sharing: HIPAA De-identification Daniel C. Barth-Jones, M.P.H., Ph.D Assistant Professor
More informationPrivacy by Design für Big Data
Dr. Günter Karjoth 26. August 2013 Sommerakademie Kiel Privacy by Design für Big Data 1 / 34 2013 IBM Coorporation Privacy by Design (PbD) proposed by Ann Cavoukin, Privacy Commissioner Ontario mostly
More informationCS377: Database Systems Data Security and Privacy. Li Xiong Department of Mathematics and Computer Science Emory University
CS377: Database Systems Data Security and Privacy Li Xiong Department of Mathematics and Computer Science Emory University 1 Principles of Data Security CIA Confidentiality Triad Prevent the disclosure
More information(Big) Data Anonymization Claude Castelluccia Inria, Privatics
(Big) Data Anonymization Claude Castelluccia Inria, Privatics BIG DATA: The Risks Singling-out/ Re-Identification: ADV is able to identify the target s record in the published dataset from some know information
More informationPRIVACY-PRESERVING DATA ANALYSIS AND DATA SHARING
PRIVACY-PRESERVING DATA ANALYSIS AND DATA SHARING Chih-Hua Tai Dept. of Computer Science and Information Engineering, National Taipei University New Taipei City, Taiwan BENEFIT OF DATA ANALYSIS Many fields
More informationPrivacy Techniques for Big Data
Privacy Techniques for Big Data The Pros and Cons of Syntatic and Differential Privacy Approaches Dr#Roksana#Boreli# SMU,#Singapore,#May#2015# Introductions NICTA Australia s National Centre of Excellence
More informationDe-Identification 101
De-Identification 101 We live in a world today where our personal information is continuously being captured in a multitude of electronic databases. Details about our health, financial status and buying
More informationDifferential privacy in health care analytics and medical research An interactive tutorial
Differential privacy in health care analytics and medical research An interactive tutorial Speaker: Moritz Hardt Theory Group, IBM Almaden February 21, 2012 Overview 1. Releasing medical data: What could
More informationChallenges of Data Privacy in the Era of Big Data. Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014
Challenges of Data Privacy in the Era of Big Data Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014 1 Outline Why should we care? What is privacy? How do achieve privacy? Big
More informationData attribute security and privacy in distributed database system
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. V (Mar-Apr. 2014), PP 27-33 Data attribute security and privacy in distributed database system
More informationPrivacy Committee. Privacy and Open Data Guideline. Guideline. Of South Australia. Version 1
Privacy Committee Of South Australia Privacy and Open Data Guideline Guideline Version 1 Executive Officer Privacy Committee of South Australia c/o State Records of South Australia GPO Box 2343 ADELAIDE
More informationARX A Comprehensive Tool for Anonymizing Biomedical Data
ARX A Comprehensive Tool for Anonymizing Biomedical Data Fabian Prasser, Florian Kohlmayer, Klaus A. Kuhn Chair of Biomedical Informatics Institute of Medical Statistics and Epidemiology Rechts der Isar
More informationPrivacy Preserving Data Mining
Privacy Preserving Data Mining Technion - Computer Science Department - Ph.D. Thesis PHD-2011-01 - 2011 Arie Friedman Privacy Preserving Data Mining Technion - Computer Science Department - Ph.D. Thesis
More informationPrivacy-preserving Data Mining: current research and trends
Privacy-preserving Data Mining: current research and trends Stan Matwin School of Information Technology and Engineering University of Ottawa, Canada stan@site.uottawa.ca Few words about our research Universit[é
More informationPolicy-based Pre-Processing in Hadoop
Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden yi.cheng@ericsson.com, christian.schaefer@ericsson.com Abstract While big data analytics provides
More informationPrivacy Preserving Health Data Publishing using Secure Two Party Algorithm
IJIRST International Journal for Innovative Research in Science & Technology Volume 2 Issue 1 June 2015 ISSN (online): 2349-6010 Privacy Preserving Health Data Publishing using Secure Two Party Algorithm
More informationHow To Protect Your Health Data From Being Used For Research
Big Data: Research Ethics, Regulation and the Way Forward Tia Powell, MD AAIC Washington, DC, 2015 1854 Broad Street Cholera Outbreak Federal Office of Personnel Management Data Breach, 2015 Well-known
More informationA GENERAL SURVEY OF PRIVACY-PRESERVING DATA MINING MODELS AND ALGORITHMS
Chapter 2 A GENERAL SURVEY OF PRIVACY-PRESERVING DATA MINING MODELS AND ALGORITHMS Charu C. Aggarwal IBM T. J. Watson Research Center Hawthorne, NY 10532 charu@us.ibm.com Philip S. Yu IBM T. J. Watson
More informationCurrent Developments of k-anonymous Data Releasing
Current Developments of k-anonymous Data Releasing Jiuyong Li 1 Hua Wang 1 Huidong Jin 2 Jianming Yong 3 Abstract Disclosure-control is a traditional statistical methodology for protecting privacy when
More informationWorkload-Aware Anonymization Techniques for Large-Scale Datasets
Workload-Aware Anonymization Techniques for Large-Scale Datasets KRISTEN LeFEVRE University of Michigan DAVID J. DeWITT Microsoft and RAGHU RAMAKRISHNAN Yahoo! Research Protecting individual privacy is
More informationRESEARCH. Acar Tamersoy. Thesis. Submitted to the Faculty of the. Graduate School of Vanderbilt University. for the degree of MASTER OF SCIENCE
ANONYMIZATION OF LONGITUDINAL ELECTRONIC MEDICAL RECORDS FOR CLINICAL RESEARCH By Acar Tamersoy Thesis Submitted to the Faculty of the Graduate School of Vanderbilt University in partial fulllment of the
More informationGONZABA MEDICAL GROUP PATIENT REGISTRATION FORM
GONZABA MEDICAL GROUP PATIENT REGISTRATION FORM DATE: CHART#: GUARANTOR INFORMATION LAST NAME: FIRST NAME: MI: ADDRESS: HOME PHONE: ADDRESS: CITY/STATE: ZIP CODE: **************************************************************************************
More informationEfficient Algorithms for Masking and Finding Quasi-Identifiers
Efficient Algorithms for Masking and Finding Quasi-Identifiers Rajeev Motwani Stanford University rajeev@cs.stanford.edu Ying Xu Stanford University xuying@cs.stanford.edu ABSTRACT A quasi-identifier refers
More informationSTATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
More informationTechnical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0
Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0 Guidance Document Prepared by: PCORnet Data Privacy Task Force Submitted to the PMO Approved by the PMO Submitted
More informationIntroduction to Data Mining
Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:
More informationImpact of Breast Cancer Genetic Testing on Insurance Issues
Impact of Breast Cancer Genetic Testing on Insurance Issues Prepared by the Health Research Unit September 1999 Introduction The discoveries of BRCA1 and BRCA2, two cancer-susceptibility genes, raise serious
More informationHow to De-identify Data. Xulei Shirley Liu Department of Biostatistics Vanderbilt University 03/07/2008
How to De-identify Data Xulei Shirley Liu Department of Biostatistics Vanderbilt University 03/07/2008 1 Outline The problem Brief history The solutions Examples with SAS and R code 2 Background The adoption
More informationPharmaSUG2011 Paper HS03
PharmaSUG2011 Paper HS03 Using SAS Predictive Modeling to Investigate the Asthma s Patient Future Hospitalization Risk Yehia H. Khalil, University of Louisville, Louisville, KY, US ABSTRACT The focus of
More informationObfuscation of sensitive data in network flows 1
Obfuscation of sensitive data in network flows 1 D. Riboni 2, A. Villani 1, D. Vitali 1 C. Bettini 2, L.V. Mancini 1 1 Dipartimento di Informatica,Universitá di Roma, Sapienza. E-mail: {villani, vitali,
More informationData Privacy and Biomedicine Syllabus - Page 1 of 6
Data Privacy and Biomedicine Syllabus - Page 1 of 6 Course: Data Privacy in Biomedicine (BMIF-380 / CS-396) Instructor: Bradley Malin, Ph.D. (b.malin@vanderbilt.edu) Semester: Spring 2015 Time: Mondays
More informationSocietal benefits vs. privacy: what distributed secure multi-party computation enable? Research ehelse 2015 21-22 April Oslo
Privacy Societal benefits vs. privacy: what distributed secure multi-party computation enable? Research ehelse 2015 21-22 April Oslo Kassaye Yitbarek Yigzaw UiT The Arctic University of Norway Outline
More informationDRAFT NISTIR 8053 De-Identification of Personally Identifiable Information
1 2 3 4 5 6 7 8 DRAFT NISTIR 8053 De-Identification of Personally Identifiable Information Simson L. Garfinkel 9 10 11 12 13 14 15 16 17 18 NISTIR 8053 DRAFT De-Identification of Personally Identifiable
More informationProximity-Aware Local-Recoding Anonymization with MapReduce for Scalable Big Data Privacy Preservation in Cloud
IEEE TRANSACTIONS ON COMPUTERS, TC-2013-12-0869 1 Proximity-Aware Local-Recoding Anonymization with MapReduce for Scalable Big Data Privacy Preservation in Cloud Xuyun Zhang, Wanchun Dou, Jian Pei, Fellow,
More informationPrivacy-Preserving Big Data Publishing
Privacy-Preserving Big Data Publishing Hessam Zakerzadeh 1, Charu C. Aggarwal 2, Ken Barker 1 SSDBM 15 1 University of Calgary, Canada 2 IBM TJ Watson, USA Data Publishing OECD * declaration on access
More informationDe-identification Koans. ICTR Data Managers Darren Lacey January 15, 2013
De-identification Koans ICTR Data Managers Darren Lacey January 15, 2013 Disclaimer There are several efforts addressing this issue in whole or part Over the next year or so, I believe that the conversation
More informationProbabilistic Prediction of Privacy Risks
Probabilistic Prediction of Privacy Risks in User Search Histories Joanna Biega Ida Mele Gerhard Weikum PSBD@CIKM, Shanghai, 07.11.2014 Or rather: On diverging towards user-centric privacy Traditional
More informationDegrees of De-identification of Clinical Research Data
Vol. 7, No. 11, November 2011 Can You Handle the Truth? Degrees of De-identification of Clinical Research Data By Jeanne M. Mattern Two sets of U.S. government regulations govern the protection of personal
More informationNSF Workshop on Big Data Security and Privacy
NSF Workshop on Big Data Security and Privacy Report Summary Bhavani Thuraisingham The University of Texas at Dallas (UTD) February 19, 2015 Acknowledgement NSF SaTC Program for support Chris Clifton and
More informationInternational Journal of Advanced Computer Technology (IJACT) ISSN:2319-7900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS
PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS First A. Dr. D. Aruna Kumari, Ph.d, ; Second B. Ch.Mounika, Student, Department Of ECM, K L University, chittiprolumounika@gmail.com; Third C.
More informationInformation Security in Big Data: Privacy and Data Mining (IEEE, 2014) Dilara USTAÖMER 2065787
Information Security in Big Data: Privacy and Data Mining (IEEE, 2014) Dilara USTAÖMER 2065787 2015/5/13 OUTLINE Introduction User Role Based Methodology Data Provider Data Collector Data Miner Decision
More informationPracticing Differential Privacy in Health Care: A Review
TRANSACTIONS ON DATA PRIVACY 5 (2013) 35 67 Practicing Differential Privacy in Health Care: A Review Fida K. Dankar*, and Khaled El Emam* * CHEO Research Institute, 401 Smyth Road, Ottawa, Ontario E mail
More informationMidha Medical Clinic REGISTRATION FORM
Midha Medical Clinic REGISTRATION FORM Today s / / (PLEASE PRINT NEATLY) PATIENT INFORMATION Last Name: First Name: Middle Initial: IS THIS YOUR LEGAL NAME? YES NO IF NOT, WHAT IS YOUR LEGAL NAME DATE
More informationData Mining and risk Management
ES ET DE LA VIE PRIVÉE E 29 th INTERNATIONAL CONFERENCE OF DATA PROTECTION AND PRIVACY COMMISS Data Mining Dr. Bradley A. Malin Assistant Professor Department of Biomedical Informatics Vanderbilt University
More informationOrthodontics on Silver Lake, P.A. Stephanie E. Steckel, D.D.S., M.S. Welcome To Our Office -Please Print-
HEALTH HISTORY Orthodontics on Silver Lake, P.A. Stephanie E. Steckel, D.D.S., M.S. Welcome To Our Office -Please Print- Date: 20 Date of Birth: Patient s name: First Middle Last Name Patient Prefers to
More informationDe-identification, defined and explained. Dan Stocker, MBA, MS, QSA Professional Services, Coalfire
De-identification, defined and explained Dan Stocker, MBA, MS, QSA Professional Services, Coalfire Introduction This perspective paper helps organizations understand why de-identification of protected
More informationHealth Data De-Identification by Dr. Khaled El Emam
RISK-BASED METHODOLOGY DEFENSIBLE COST-EFFECTIVE DE-IDENTIFICATION OPTIMAL STATISTICAL METHOD REPORTING RE-IDENTIFICATION BUSINESS ASSOCIATES COMPLIANCE HIPAA PHI REPORTING DATA SHARING REGULATORY UTILITY
More informationOf Codes, Genomes, and Electronic Health Records: It s Only Sensitive If It Hurts When You Touch It
Of Codes, Genomes, and Electronic Health Records: It s Only Sensitive If It Hurts When You Touch It Daniel Masys, M.D. Affiliate Professor Biomedical and Health Informatics University of Washington Seattle,
More informationProtecting Patient Privacy. Khaled El Emam, CHEO RI & uottawa
Protecting Patient Privacy Khaled El Emam, CHEO RI & uottawa Context In Ontario data custodians are permitted to disclose PHI without consent for public health purposes What is the problem then? This disclosure
More informationData Driven Approaches to Prescription Medication Outcomes Analysis Using EMR
Data Driven Approaches to Prescription Medication Outcomes Analysis Using EMR Nathan Manwaring University of Utah Masters Project Presentation April 2012 Equation Consulting Who we are Equation Consulting
More informationACTA UNIVERSITATIS APULENSIS No 15/2008 A COMPARISON BETWEEN LOCAL AND GLOBAL RECODING ALGORITHMS FOR ACHIEVING MICRODATA P-SENSITIVE K -ANONYMITY
ACTA UNIVERSITATIS APULENSIS No 15/2008 A COMPARISON BETWEEN LOCAL AND GLOBAL RECODING ALGORITHMS FOR ACHIEVING MICRODATA P-SENSITIVE K -ANONYMITY Traian Marius Truta, Alina Campan, Michael Abrinica, John
More informationAnonymizing Healthcare Data: A Case Study on the Blood Transfusion Service
Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service Noman Mohammed Benjamin C. M. Fung Patrick C. K. Hung Cheuk-kwong Lee CIISE, Concordia University, Montreal, QC, Canada University
More informationCONSENT FOR MEDICAL TREATMENT
CONSENT FOR MEDICAL TREATMENT Patient Name DOB Date I, the patient or authorized representative, consent to any examination, evaluation and treatment regarding any illness, injury or other health concern
More informationClassification and Prediction
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
More informationNOTICE OF HEALTH INFORMATION PRIVACY PRACTICES (HIPAA)
NOTICE OF HEALTH INFORMATION PRIVACY PRACTICES (HIPAA) THIS NOTICE OF PRIVACY PRACTICES DESCRIBES HOW HEALTH INFORMATION ABOUT YOU MAY BE USED AND DISCLOSED AND HOW YOU CAN GET ACCESS TO THIS INFORMATION.
More informationIntroduction. A. Bellaachia Page: 1
Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.
More informationPersonal Contact and Insurance Information
Kenneth A. Holt, M.D. 3320 Executive Drive Tele: 919-877-1100 Building E, Suite 222 Fax: 919-877-8118 Raleigh, NC 27609 Personal Contact and Insurance Information Please fill out this form as completely
More informationAuditing EMR System Usage. You Chen Jan, 17, 2013 You.chen@vanderbilt.edu
Auditing EMR System Usage You Chen Jan, 17, 2013 You.chen@vanderbilt.edu Health data being accessed by hackers, lost with laptop computers, or simply read by curious employees Anomalous Usage You Chen,
More informationUnderstanding De-identification, Limited Data Sets, Encryption and Data Masking under HIPAA/HITECH: Implementing Solutions and Tackling Challenges
Understanding De-identification, Limited Data Sets, Encryption and Data Masking under HIPAA/HITECH: Implementing Solutions and Tackling Challenges Daniel C. Barth-Jones, M.P.H., Ph.D. Assistant Professor
More informationComments of the World Privacy Forum To: Office of Science and Technology Policy Re: Big Data Request for Information. Via email to bigdata@ostp.
3108 Fifth Avenue Suite B San Diego, CA 92103 Comments of the World Privacy Forum To: Office of Science and Technology Policy Re: Big Data Request for Information Via email to bigdata@ostp.gov Big Data
More informationBig Data - Security and Privacy
Big Data - Security and Privacy Elisa Bertino CS Department, Cyber Center, and CERIAS Purdue University Cyber Center! Big Data EveryWhere! Lots of data is being collected, warehoused, and mined Web data,
More informationBig Data Analytics in Mobile Environments
1 Big Data Analytics in Mobile Environments 熊 辉 教 授 罗 格 斯 - 新 泽 西 州 立 大 学 2012-10-2 Rutgers, the State University of New Jersey Why big data: historical view? Productivity versus Complexity (interrelatedness,
More informationA Survey of Quantification of Privacy Preserving Data Mining Algorithms
A Survey of Quantification of Privacy Preserving Data Mining Algorithms Elisa Bertino, Dan Lin, and Wei Jiang Abstract The aim of privacy preserving data mining (PPDM) algorithms is to extract relevant
More informationBehavioral Health Consulting Services, LLC
www.bhcsct.org infohealth@bhcsct.org 46 West Avon Road 322 Main St. 530 Middlebury Road Suite 202 Suite 1-G Suite 103 B Avon, CT 06001 Willimantic, CT 06226 Middlebury, CT 06762 Office phone- 1-860-673-0145
More informationGuidance on De-identification of Protected Health Information November 26, 2012.
Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule November 26, 2012 OCR gratefully
More informationBig Data Integration and Governance Considerations for Healthcare
White Paper Big Data Integration and Governance Considerations for Healthcare by Sunil Soares, Founder & Managing Partner, Information Asset, LLC Big Data Integration and Governance Considerations for
More information19235 N Cave Creek Rd #104 Phoenix, AZ 85024 Phone: (602) 485-3414 Fax: (602) 788-0405
19235 N Cave Creek Rd #104 Phoenix, AZ 85024 Phone: (602) 485-3414 Fax: (602) 788-0405 Welcome to our practice. We are happy that you selected us as your eye care provider and appreciate the opportunity
More informationFaculty Group Practice Patient Demographic Form
Name (Last, First, MI) Faculty Group Practice Patient Demographic Form Today s Patient Information Street Address City State Zip Home Phone SSN of Birth Gender Male Female Work Phone Cell Phone Marital
More informationKeweenaw Holistic Family Medicine Patient Registration Form
Keweenaw Holistic Family Medicine Patient Registration Form How did you first learn of our Clinic? Circle one: Attended Lecture Internet KHFM website Newspaper Sign in window Yellow Pages Physician Friend
More informationMajor US Genomic Medicine Programs: NHGRI s Electronic Medical Records and Genomics (emerge) Network
Major US Genomic Medicine Programs: NHGRI s Electronic Medical Records and Genomics (emerge) Network Dan Roden Member, National Advisory Council For Human Genome Research Genomic Medicine Working Group
More informationOHIO VICTIMS OF CRIME COMPENSATION PROGRAM
OHIO VICTIMS OF CRIME COMPENSATION PROGRAM Application for Supplemental Compensation If you or your family members are innocent victims of a violent crime, financial assistance may be available. For more
More informationA generalized Framework of Privacy Preservation in Distributed Data mining for Unstructured Data Environment
www.ijcsi.org 434 A generalized Framework of Privacy Preservation in Distributed Data mining for Unstructured Data Environment V.THAVAVEL and S.SIVAKUMAR* Department of Computer Applications, Karunya University,
More informationA Brief Survey on Anonymization Techniques for Privacy Preserving Publishing of Social Network Data
A Brief Survey on Anonymization Techniques for Privacy Preserving Publishing of Social Network Data Bin Zhou School of Computing Science Simon Fraser University, Canada bzhou@cs.sfu.ca Jian Pei School
More informationGENETIC DATA ANALYSIS
GENETIC DATA ANALYSIS 1 Genetic Data: Future of Personalized Healthcare To achieve personalization in Healthcare, there is a need for more advancements in the field of Genomics. The human genome is made
More informationPatient or Guardian Signature
Co Payment Policy According to the regulations of individual insurance carriers, patients are responsible for paying co payments at the time of each office visit. PAYMENT POLICY FOR SERVICES RENDERED If
More informationOhio Victims of Crime Compensation Program
Ohio Victims of Crime Compensation Program Application for Compensation If you or your family members are innocent victims of a violent crime, financial assistance may be available. The Ohio Victims of
More informationREGISTRATION FORM. How would you like to receive health information? Electronic Paper In Person. Daytime Phone Preferred.
Signature Preferred Pharmacy Referral Info Emergency Contact Guarantor Information Patient Information Name (Last, First, MI) REGISTRATION FORM Today's Date Street Address City State Zip Gender M F SSN
More informationUNIVERSITY OF WISCONSIN MADISON BADGER SPORTS CAMP HEALTH FORM
UNIVERSITY OF WISCONSIN MADISON BADGER SPORTS CAMP HEALTH FORM Event Name: Dates: Participant Name: Participant cell phone with area code: Custodial Parent/Guardian Name: Phone number: Cell phone: Home
More informationWorkshop on Establishing a Central Resource of Data from Genome Sequencing Projects
Report on the Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects Background and Goals of the Workshop June 5 6, 2012 The use of genome sequencing in human research is growing
More informationHKU Big Data and Privacy Workshop. Privacy Risks of Big Data Analytics From a Regulator s Point of View
HKU Big Data and Privacy Workshop Privacy Risks of Big Data Analytics From a Regulator s Point of View 30 November 2015 Henry Chang, IT Advisor Office of the Privacy Commissioner for Personal Data, Hong
More informationA De-identification Strategy Used for Sharing One Data Provider s Oncology Trials Data through the Project Data Sphere Repository
A De-identification Strategy Used for Sharing One Data Provider s Oncology Trials Data through the Project Data Sphere Repository Prepared by: Bradley Malin, Ph.D. 2525 West End Avenue, Suite 1030 Nashville,
More informationBig Genetic Data Opportunities and Threats Johann Eder. Funded by GZ 10.470/0016 II/3/2013
Big Genetic Data Opportunities and Threats Johann Eder Funded by GZ 10.470/0016 II/3/2013 Cholera Outbreak, London, 1854 John Snow showed by cluster analysis that contaminated water, not air, spread cholera
More informationDOB: // // Gender: Male Female. Home: Cell: Work:
Core Physical Therapy Clinics, LLC Paper Registration Form Patient Name Date DOB: // // Gender: Male Female Address: City State: Zip Code Home: Cell: Work: Email: Emergency Contact Employer: Name Insurance
More informationPrivacy in Data Publication and Outsourcing Scenarios
Privacy in Data Publication and Outsourcing Scenarios Pierangela Samarati Dipartimento di Informatica Università degli Studi di Milano pierangela.samarati@unimi.it 12th International School on Foundations
More informationFind the signal in the noise
Find the signal in the noise Electronic Health Records: The challenge The adoption of Electronic Health Records (EHRs) in the USA is rapidly increasing, due to the Health Information Technology and Clinical
More informationLi Xiong, Emory University
Healthcare Industry Skills Innovation Award Proposal Hippocratic Database Technology Li Xiong, Emory University I propose to design and develop a course focused on the values and principles of the Hippocratic
More informationExample application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health
Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining
More informationLinking Hospitalizations and Death Certificates across Minnesota Hospitals
Linking Hospitalizations and Death Certificates across Minnesota Hospitals AcademyHealth, Baltimore, June 2013 JMNaessens,ScD; SMPeterson; MBPine,MD; JSchindler; MSonneborn; JRoland; ASRahman; MGJohnson;
More informationfuture proof data privacy
2809 Telegraph Avenue, Suite 206 Berkeley, California 94705 leapyear.io future proof data privacy Copyright 2015 LeapYear Technologies, Inc. All rights reserved. This document does not provide you with
More informationFaculty Group Practice Patient Demographic Form
Name (Last, First, MI) Faculty Group Practice Patient Demographic Form Today s Date Patient Information Street Address City State Zip Home Phone Work Phone Cell Phone ( ) Preferred ( ) Preferred ( ) Preferred
More informationProtein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
More information