Privacy Challenges and Solutions for Data Sharing. Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland

Size: px
Start display at page:

Download "Privacy Challenges and Solutions for Data Sharing. Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland"

Transcription

1 Privacy Challenges and Solutions for Data Sharing Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland June 8, 2015

2 Content Introduction / Motivation for data privacy Focus Application Privacy Preserving Medical Data Publishing Electronic Medical Records (EMR) and their use in research Privacy threats and their effectiveness Privacy models for relational data (demographics) Privacy models for transaction/set-valued data (diagnoses codes) Policy-based anonymization model Rule-based anonymization model Anonymization of RT (relational-transaction)-datasets SECRETA anonymization toolkit Summary 2

3 Data sharing Individuals data are increasing shared Netflix published movie ratings of 500K subscribers AOL published 20M search query terms of 658K web users TomTom sold customers location (GPS) data to the Dutch police emerge consortium published patient data related to genome-wide association studies to biorepositories (dbgap) Orange provided call information about its mobile subscribers, as part of the D4D challenge on mobile phone data ( Benefits of data sharing Personalization (e.g., Netflix s data mining contest aimed to improve movie recommendation based on personal preferences) Marketing (e.g., Tesco made 53M from selling shopping patterns to retailers and manufacturers, such as Nestle and Unilever, last year) Social benefits (e.g., promote medical research studies, improve traffic management, etc.) 3

4 Data sharing must guarantee privacy and accommodate utility A popular data sharing scenario (data publishing) Original data Released data data owners data publisher (trusted) data recipient (untrusted) Threats to data privacy Identity disclosure Sensitive information disclosure Membership disclosure Inferential disclosure Data utility requirements Minimal data distortion (general purpose use) Support of specific applications / workloads (e.g., building accurate predictive models, GWAS, etc.) 4

5 Privacy Preserving Medical Data Publishing How can we share medical data in a way that protects patients privacy while supporting research studies? The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

6 Electronic Medical Records (EMR) Relational data Registration and demographic data Transaction (set-valued) data Billing information ICD codes* are represented as numbers (up to 5 digits) and denote signs, findings, and causes of injury or disease** Sequential data DNA Text data Clinical notes Electronic Medical Records (EMR) Name YOB ICD DNA Clinical notes Jim , 185 C T (doc1) Mary , A G (doc2) Mary C G (doc3) Carol C G (doc4) Anne , G C (doc5) Anne A T (doc6) * International Statistical Classification of Diseases and Related Health Problems ** Centers for Medicare & Medicaid Services - 6

7 EMR data use in analytics Statistical analysis Correlation between YOB and ICD code 185 (Malignant neoplasm of prostate) Querying Clustering Control epidemics* Classification Predict domestic violence** Association rule mining Electronic Medical Records Name YOB ICD Formulate a government policy on hypertension management*** IF age in [43,48] AND smoke = yes AND exercise=no AND drink=yes; DNA Jim , C T Mary A G Mary , C G Carol , C G Anne , G C Anne A T THEN hypertension=yes (sup=2.9%; conf=26%)0 * Tildesley et al. Impact of spatial clustering on disease transmission and optimal control, PNAS, ** Reis et al. Longitudinal Histories as Predictors of Future Diagnoses of Domestic Abuse: Modelling Study, BMJ: British Medical Journal, 2011 *** Chae et al. Data mining approach to policy analysis in a health insurance domain. Int. J. of Med. Inf.,

8 Need for privacy Why we need privacy in medical data sharing? If privacy is breached, there are consequences to patients Consequences to patients Emotional and economical embarrassment 62% of individuals worry their EMRs will not remain confidential* 35% expressed privacy concerns regarding the publishing of their data to dbgap** Opt-out or provide fake data à difficulty to conduct statistically powered studies * Health Confidence Survey 2008, Employee Benefit Research Institute ** Ludman et al. Glad You Asked: Participants Opinions of Re-Consent for dbgap Data Submission. Journal of Empirical Research on Human Research Ethics,

9 Need for privacy If privacy is breached, there are consequences to organizations Legal à HIPAA, EU legislation (95/46/EC, 2002/58/EC, 2009/136/EC etc.) Financial à The average cost of a single data breach is $5.85M in the US and $4.74M in Germany; these countries have the highest per capita cost. Healthcare and Education are the most heavily regulated industries * Ponemon Institute Research Report 2014 Cost of Data Breach Study: Global Analysis. 9

10 Protecting data privacy: data masking / removal of identifiers Removing / masking direct identifiers data owners data publisher (trusted) data recipient (untrusted) Original data De-identified data 1. Locate the direct identifiers (attributes that uniquely identify an individual), such as SSN, Patient ID, Phone number etc. 2. Remove or mask them from the data prior to data publishing Name John Doe Thelma Arnold Search Query Terms Harry potter, King s speech Hand tremors, bipolar, dry mouth, effect of nicotine on the body 10

11 Protecting data privacy: data masking / removal of identifiers Masking or removal of direct identifiers is not sufficient! data owners data publisher (trusted) data recipient (untrusted) Original data Released data Main types of threats to data privacy Identity disclosure Sensitive information disclosure Inferential disclosure External data Background Knowledge 11

12 Privacy Threats: Identity disclosure Identity disclosure in relational data (e.g., patients demographics) Individuals are linked to their published records based on quasi-identifiers (attributes that in combination can identify an individual) Age Postcode Sex 20 NW10 M 45 NW15 M 22 NW30 M 50 NW25 F De-identified data Name Age Postcode Sex Greg 20 NW10 M Jim 45 NW15 M Jack 22 NW30 M Anne 50 NW25 F External data * Sweeney, k- anonymity: a model for protec7ng privacy. IJUFKS, % of US citizens can be identified by Age, DOB, 5-digit ZIP code* 12

13 Identity disclosure in sharing patients diagnosis codes Identity disclosure in transaction data (e.g., diagnosis codes) Identified EMR data ID ICD Jim Mary Anne Released EMR Data ICD DNA CT A AC T GC C Mary is diagnosed with benign essential hypertension (ICD code 401.1) the second record belongs to her à all her diagnosis codes Disclosure based on diagnosis codes* à general problem for other medical terminologies (e.g., ICD-10 used in EU) à sharing data susceptible to the attack is against legislation * Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants Privacy. JAMIA,

14 Real-world identity disclosure cases involving medical data Group Insurance Commission data Voter list of Cambridge, MA William Weld, Former Governor of MA Chicago Homicide database Social security death index 35% of murder victims Adverse Drug Reaction Database Public obituaries 26-year old girl who died from drug 14

15 Issuing attacks on medical datasets Two-step attack using publicly available voter registration lists and hospital discharge summaries voter(name,..., zip, dob, sex) summary(zip, dob, sex, diagnoses) release(diagnoses, DNA) 87% of US citizens can be identified by {dob, gender, ZIP-code} voter list & discharge summary à privacy breach * Sweeney, k- anonymity: a model for protec7ng privacy. IJUFKS,

16 Issuing attacks on medical datasets One-step attack using EMRs* Insider s attack EMR (name,..., diagnoses) release(, diagnoses, DNA) * Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants Privacy. JAMIA,

17 Case study: Evaluating the effectiveness of the insider s attack De-identified / Masked EMR population from VUMC Population: 1.2M records (patients) from Vanderbilt A unique random number for ID de-identified EMR (ID,..., diagnoses) VNEC(, diagnoses, DNA) VNEC de-identified / masked EMR sample 2762 records (patients) derived from the population Patients from VNEC were involved in a study (GWAS) for the Native Electrical Conduction of the heart Patients EMR were to be deposited into dbgap Data would be made available to support other studies (GWAS)? 17

18 Case study: Evaluating the effectiveness of the insider s attack Vanderbilt s EMR - VNEC dataset linkage on ICD codes 100.0% % of re-identified sample 90.0% 80.0% 70.0% 96.5% We assume that all ICD codes are used to issue an attack 96.5% of patients susceptible to identity disclosure 60.0% Distinguishability (log scale) Number of times a set of ICD codes appears in the population (Support count in the data mining literature) 18

19 Case study: Evaluating the effectiveness of the insider s attack Vanderbilt s EMR - VNEC dataset linkage on ICD codes 1 ICD code 2 ICD code combination 3 ICD code combination 10 ICD code combination % of re-identifiable sample 100% 80% 60% 40% 20% 0% Distinguishability (log scale) A random subset of ICD codes that can be used in attack Knowing a random combination of 2 ICD codes can lead to unique re-identification Number of times a set of ICD codes appears in the population (equiv. to support count in data mining literature) 19

20 Case study: Evaluating the effectiveness of the insider s attack VNEC dataset linkage on ICD codes Hospital discharge records Number of times a set of ICD codes appears in the VNEC (Support count in data mining literature) All ICD codes associated with a patient for a single visit Difficult to know ICD codes that span visits when public discharge summaries are used 46% uniquely re-identifiable patients in VNEC 20

21 Privacy Threats: Sensitive information disclosure Individuals are associated with sensitive information Name Age Postcode Sex Greg 20 NW10 M Background knowledge Identified EMR data ID ICD Jim Mary Sensitive Attribute (SA) Age Postcode YOB Disease 20 NW HIV 20 NW HIV 20 NW HIV 20 NW HIV De-identified data Released EMR Data ID ICD DNA Jim C A Mary A T Mary is diagnosed with and 401.1à she has Schizophrenia Schizophrenia Sensitive information disclosure can occur without identity disclosure 21

22 Sensitive information disclosure in Netflix movie rating sharing 100M dated ratings from 480K users to 18K movies data mining contest ($1M prize) to improve movie recommendation based on personal preferences movies reveal political, religious, and sexual beliefs and need protection according to Video Protection Act Anonymized De-identification A lawsuit Sampling, was filed, date modification, Netflix settled rate suppression the lawsuit Movie title and year published in full We will find new ways to collaborate with researchers Researchers inferred movie rates of subscribers* Data are linked with IMDB w.r.t. ratings and/or dates * Narayanan et al. Robust De- anonymiza7on of Large Sparse Datasets. IEEE Symposium on Security and Privacy

23 Privacy Threats: Inferential disclosure Sensitive knowledge patterns are exposed by data mining 75% of patients visit the same physician more than 4 times Unsolicited advertisement 60% of the white males > 50 suffer from diabetes Stream data collected by health monitoring systems Electronic medical records Customer discrimination Drug orders & costs Business rivals can harm data publishers and insurance, pharmaceutical & marketing companies can harm data owners* * G. Das and N. Zhang. Privacy risks in health databases from aggregate disclosure. PETRA,

24 Anonymization of demographics k-anonymity principle* Each record in a relational table T should have the same value over quasi-identifiers with at least k-1 other records in T These records collectively form a k-anonymous group k-anonymity protects from identity disclosure Protects data from linkage to external sources (triangulation attacks) The probability that an individual is correctly associated with their record is at most 1/k Name Age Postcode Sex Greg 40 NW10 M Jim 45 NW15 M Jack 22 NW30 M Anne 50 NW25 F External data Age Postcode Sex 4* NW1* M 4* NW1* M * NW* * * NW* * 2-anonymous data * Sweeney. Achieving k- anonymity privacy protec7on using generaliza7on and suppression. IJUFKS

25 Anonymization of demographics k-anonymity Pros A baseline model Intuitive Has been implemented in many real-world systems Follows & impacts privacy legislation Name Age Postcode Sex Greg 40 NW10 M Jim 45 NW15 M Jack 22 NW30 M Anne 50 NW25 F External data Cons Does not protect against sensitive information disclosure Requires the data owner to specify the quasi-identifiers (QIDs) and the k-value Age Postcode Sex 4* NW1* M 4* NW1* M * NW* * * NW* * 2-anonymous data 25

26 Attack on k-anonymous data Homogeneity attack* All sensitive values in a k-anonymous group are the same à sensitive information disclosure Name Age Postcode Greg 40 NW10 External data Age Postcode Disease 4* NW1* HIV 4* NW1* HIV 5* NW* Ovarian Cancer 5* NW* Flu 2-anonymous data Attacker is confident that Greg suffers from HIV * Machanavajjhala et al, l-diversity: Privacy Beyond k-anonymity. ICDE

27 l-diversity principle for demographics l -diversity* A relational table is l-diverse if all groups of records with the same values over quasi-identifiers (QID groups) contain no less than l well-represented values for the sensitive attribute (SA) 6-anonymous group Distinct l-diversity l well-represented à l distinct Age Postcode Disease 4* NW1* HIV 4* NW1* HIV 4* NW1* HIV 4* NW1* HIV 4* NW1* Flu 4* NW1* Cancer Three distinct values, but the probability of HIV being disclosed is ~

28 Further improvements over l-diversity Sensitive values may not need the same level of protection (a,k)-anonymity [1] l-diversity is difficult to achieve when the SA values are skewed t-closeness [2] Does not consider semantic similarity of SA values (e,m)-anonymity [3], range diversity [4] Can patients decide the level of protection for their SA values? Personalized privacy [5] [1] Wong et al., (alpha, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing, KDD [2] Li et al., t-closeness: Privacy Beyond k-anonymity and l-diversity, ICDE [3] Li et al. Preservation of proximity privacy in publishing numerical sensitive data. SIGMOD [4] Loukides et al. Preventing range disclosure in k-anonymised data. Expert Syst. Appl [5] Xiao et al. Personalized privacy preservation. SIGMOD,

29 Partition-based algorithms for k-anonymity Main idea of partition-based algorithms A record projected over QIDs is treated as a multidimensional point A subspace (hyper-rectangle) that contains at least k points can form a k-anonymous group à multidimensional global recoding Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity M F How to partition the space? One attribute at a time which to use? How to split the selected attribute? 29

30 Mondrian algorithm Mondrian (D,k)* Find the QID attribute Q with the largest domain Attribute selection Find the median µ of Q Create subspace S with all records of D whose value in Q is less than µ Create subspace S with all records of D whose value in Q is at least µ Attribute split If S k or S k Return Mondrian(S,k) U Mondrian(S,k) Else Return D Recursive execution * LeFevre et al. Mondrian multidimensional k-anonymity, ICDE,

31 Example of applying Mondrian (k=2) M M F F Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity Age Sex Disease [20-26] {M,F} HIV [20-26] {M,F} HIV [20-26] {M,F} Obesity [27-29] F HIV [27-29] F Cancer [27-29] F Obesity 31

32 Other works on partition-based algorithms R-tree based algorithm [1] Optimized partitioning for intended tasks [2] Classification Regression Query answering Algorithms for disk-resident data [3] Extensions to prevent sensitive information disclosure [4] [1] Iwuchukwu et al. K-anonymization as spatial indexing: toward scalable and incremental anonymization, VLDB, [2] LeFevre et al. Workload-aware anonymization. KDD, [3] LeFevre et al. Workload-aware anonymization techniques for large-scale datasets. TODS, [4] Loukides et al. Preventing range disclosure in k-anonymised data. Expert Syst. Appl

33 Clustering-based anonymization algorithms Main idea of clustering-based anonymization 1.Create clusters containing at least k records with similar values over QIDs Seed selection Similarity measurement Stopping criterion 2. Anonymize records in each cluster separately Local recoding and/or Suppression??? 33

34 Clustering-based anonymization algorithms Clusters need to be separated Seed Selection Furthest-first Random Clusters need to contain similar values Similarity measurement Stopping criterion Size-based Quality-based Clusters should not be too large All these heuristics attempt to improve data utility 34

35 Bottom-up clustering algorithm Bottom-up clustering algorithm* Each record is selected as a seed to start a cluster While there exists group G For each group G s.t. s.t. Find group G' s.t. NCP( G G') is min. and merge G and For each group s.t. G Split G into groups s.t. each group has at k least k records Generalize the QID values in each group Return all groups G G < k G > 2 G < k k G' * Xu et al. Utility-Based Anonymization Using Local Recoding, KDD,

36 Example of Bottom-up clustering algorithm (k=2) M M M F F F Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity Age Sex Disease [20-25] M HIV [20-25] M Obesity [23-27] F HIV [23-27] F HIV [28-29] F Cancer [28-29] F Obesity 36

37 Example of top-down clustering algorithm (k=2) M M M F F F Age Sex Disease 20 M HIV 23 F HIV 25 M Obesity 27 F HIV 28 F Cancer 29 F Obesity Age Sex Disease [20-25] {M,F} HIV [20-25] {M,F} HIV [20-25] {M,F} Obesity [27-29] F HIV [27-29] F Cancer [27-29] F Obesity 37

38 Other works on clustering-based anonymization Constant factor approximation algorithms* Publish only the cluster centers along with radius information Combine partitioning with clustering for efficiency** * Aggarwal et al. Achieving anonymity via clustering. ACM Trans. on Algorithms, ** Loukides et al. Preventing range disclosure in k-anonymised data. Expert Syst. Appl

39 Preventing identity disclosure from diagnosis codes: Suppression Suppression Removes items or records from data prior to releasing the data Suppress ICD codes* appearing in less than a certain percent of patient records Intuition: such ICD codes can act as quasi-identifiers Identified EMR data ID ICD Mary Anne Released EMR Data ICD DNA AC T GC C * Vinterbo et al. Hiding information by cell suppression. AMIA Annual Symposium 01 39

40 Code suppression a case study using Vanderbilt s EMR data We had to suppress diagnosis codes appearing in less than 25% of the records in VNEC to prevent re-identification doing so we were left with only 5 out of ~6000 ICD codes! * 5-Digit ICD-9 Codes 3-Digit ICD-9 Codes ICD-9 Sections Benign essential hypertension Other malaise and fatigue 401-Essential hypertension 780- Other soft tissue Pain in limb Other disorders of soft tissues Abdominal pain 789 Other abdomen/pelvis symptoms Hypertensive disease Rheumatism excluding the back Rheumatism excluding the back Symptoms Chest pain 786 -Respiratory system Symptoms *Loukides, Gkoulalas- Divanis, Malin. Anonymiza7on of Electronic Medical Records for Valida7ng Genome- Wide Associa7on Studies. PNAS

41 Preventing identity disclosure in EMR data: Generalization Generalization - replaces items with more general ones (usually with the help of a domain hierarchy) Any Chapters Sections 3-digit ICD codes 5-digit ICD codes Any Generalize ICD-codes to their 3-digit representation benign essential hypertension à 401- essential hypertension Identified EMR data ID ICD Mary Anne Released EMR Data ICD DNA AC T GC C 41

42 Code generalization a case study using Vanderbilt s EMR data Generalizing ICD codes from VNEC* 5-digit ICD codes 3-digit ICD codes 100.0% % of re-identified sample 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% 96.5% 75.0% 25.6% 95% no suppression suppression 5% suppression 15% suppression 25% distinguishability (log scale) 95% of the patients remain re-identifiable * Loukides et al. The Disclosure of Diagnosis Codes Can Breach Research Participants Privacy. JAMIA,

43 Complete k-anonymity & k m -anonymity Complete k-anonymity: Knowing that an individual is associated with any itemset, an attacker should not associate this individual to < k transactions ICD DNA AC T GC C CC A CA T Original data ICD DNA AC T GC C CC A CA T 2-complete anonymous data k m -anonymity: Knowing that an individual is associated with any m-itemset, an attacker should not associate this individual to less than k transactions ICD DNA AC T GC C CC A CA T Original data ICD DNA 401 AC T 401 GC C 401 CC A 401 CA T anonymous data 43

44 Applicability of complete k-anonymity and k m -anonymity to medical data Limited in the specification of privacy requirements Assume too powerful attackers all m-itemsets (combinations of m diagnosis codes) need protection but medical data publishers have detailed privacy requirements Explore a small number of possible generalizations Do not take into account utility requirements Attackers know who is diagnosed with abc or defgh They protect all 5-itemsets instead of the 2 itemsets privacy constraints 44

45 Policy-based anonymization model Policy-based anonymization for ICD codes* Global anonymization model Models both generalization and suppression Each original ICD code is replaced by a unique set of ICD codes no need for generalization hierarchies ICD codes Anonymized codes (493.00, ) (296.01, ) Generalized ICD code interpreted as or or both Φ ( ) Suppressed ICD code Not released *Loukides, Gkoulalas- Divanis, Malin. Anonymiza7on of Electronic Medical Records for Valida7ng Genome- Wide Associa7on Studies. PNAS 10 45

46 Policy-based anonymization: Privacy model Data publishers specify diagnosis codes that need protection Privacy Model: Knowing that an individual is associated with one or more specific itemsets (privacy constraints), an attacker should not be able to associate this individual to less than k transactions ICD DNA AC T GC C CC A CA T Original data ICD DNA AC T (401.2, 401.4) GC C CC A (401.2, 401.4) CA T Anonymized data Privacy Policy: The set of all specified privacy constraints Privacy is achieved when all privacy constraints are supported by at least k transactions in the published data or do not appear at all 46

47 Policy-based anonymization: Data utility considerations Utility Constraints: Published data must remain as useful as the original data for conducting a GWAS on a disease or trait à number of cases and controls in a GWAS must be preserved Supporting utility constraints: ICD codes from utility policy are generalized together; a larger part of the solution space is searched than when using domain generalization hierarchies (296.00, ) 47

48 Policy-based anonymization: Measuring information loss Utility Loss: A measure to quantify the level of information loss incurred by anonymization Favors (493.01) over (493.01, ) captures the introduced uncertainty of interpreting an anonymized item customizable # of items mapped to generalized item weight (semantic closeness) fraction of affected transactions 48

49 Policy-based anonymization algorithms Goal: Anonymize medical records so that Privacy is guaranteed Utility is high à many GWAS are supported simultaneously Incurred information loss is minimal Challenging optimization problem NP-hard Feasibility depends on constraints Heuristic algorithms Utility-Guided Anonymization of Clinical Profiles (UGACLIP) Clustering-based Anonymization (CBA) Algorithm Efficiency Scalability UGACLIP CBA Utility 49

50 Anonymization algorithms: UGACLIP Sketch of UGACLIP (PNAS) Input: EMRs, Privacy Policy, Utility Policy, k Output: Anonymized EMRs While the Privacy Policy is not satisfied Select the privacy constraint p that corresponds to most patients While p is not protected Select the ICD code i in p that corresponds to fewest patients Anonymize i If i can be anonymized according to the Utility Policy Else generalize i to (i,i ) suppress each unprotected ICD code in p Considers privacy constraints in a certain order Protects a privacy constraint by set-based anonymization - Generalization when Utility Policy is satisfied - otherwise suppression 50

51 Anonymization algorithms: UGACLIP Privacy Policy Utility Policy k=2 EMR data ICD DNA CT A AC T GC C UGACLIP Algorithm Data is protected; {296.00, , } appears 2 times Anonymized EMR data ICD DNA (296.00, ) CT A AC T (296.00, ) GC C Data remains useful for GWAS on Bipolar disorder; associations between (296.00, ) and DNA region CT A are preserved 51

52 Anonymization algorithms: CBA Sketch of CBA Retrieve the ICD codes that need less protection from the Privacy Policy Gradually build a cluster of codes that can be anonymized according to the utility policy and with minimal UL If the ICD codes are not protected Suppress no more ICD codes than required to protect privacy privacy req. are met p 1 = {i 1 } p 2 = {i 5, i 6 } p 3 = {i 3, i 4 } k=3 clusters merging (driven by UL) singleton clusters 52

53 Case Study: EMRs from Vanderbilt University Medical Center Datasets VNEC 2762 de-identified EMRs from Vanderbilt involved in a GWAS VNECkc subset of VNEC, we know which diseases are controls for others BIOVU all de-identified EMRs (79087) from Vanderbilt s biobank (the largest dataset in medical data privacy literature)* Methods UGACLIP and CBA ACLIP (state-of-the-art method it does not take utility policy into account) *Loukides, Gkoulalas- Divanis. U7lity- aware anonymiza7on of diagnosis codes. IEEE TITB

54 UGACLIP & CBA: First algorithms to offer data utility in GWAS Setting: k = 5, protecting single-visits of patients, 18 GWAS-related diseases* no utility constraints Diseases related to all GWAS reported in Manolio* Best competitor Result of ACLIP is useless for validating GWAS UGACLIP preserves 11 out of 18 GWAS CBA preserves 14 out of 18 GWAS simultaneously * Manolio et al. A HapMap harvest of insights into the genetics of common disease. J Clinic. Inv

55 Utility beyond GWAS Supporting clinical case counts in addition to GWAS learn number of patients with sets of codes in 10% of the records useful for epidemiology and data mining applications act. estim. act. VNECkc VNECkc Queries can be estimated accurately (ARE <1.25), comparable to ACLIP Anonymized data can support both GWAS and studies on clinical case counts 55

56 Anonymizing the BIOVU (79K EMR) Supporting clinical case counts in BIOVU Very low error in query answering (Average Relative Error <1) All EMRs in the VUMC biobank can be anonymized and remain useful 56

57 Rule-based anonymization Our approach*: We use PS-rules to express protection requirements against both identity and sensitive information disclosure Public items at least k records to support I (preventing identity disclosure) Contributions I à J Rule-based privacy model More flexible and general than existing models Sensitive items at most c x 100% of the records that support I also support J (preventing sensitive information disclosure) Intuitive and able to capture real-world privacy requirements Three anonymization algorithms Effective (better data utility and protection than state-of-the-art) Efficient (efficient rule checking strategies, sampling, etc.) *G. Loukides, A. Gkoulalas- Divanis, J. Shao. Anonymizing transac7on- data to eliminate sensi7ve inferences. DEXA 10 (extended to KAIS) 57

58 Rule-based anonymization An example of offering PS-rule based anonymization Name Diagnoses codes Name Diagnoses codes Mary Bob Tom Anne Brad Jim Name Mary Bob Tom Anne Brad a b c d g h i j e f h i d g j e f g h a b d e i c f j Diagnoses codes a (b,c) d g h i j e f h i d g j e f g h a (b,c) d e i Mary Bob Tom Anne Brad Jim (a,b,c) (d,e,f) g h i j (d,e,f) h i (d,e,f) g j (d,e,f) g h (a,b,c) (d,e,f) i (a,b,c) (d,e,f) j PS-rules Jim (b,c) f j ü PS-rules can be automatically discovered and specified *Grigorios Loukides, Aris Gkoulalas- Divanis, Jianhua Shao: Efficient and flexible anonymiza7on of transac7on data. Knowledge and Informa7on Systems (KAIS), 36(1), pp , a à j c d à g d à h i original dataset 2 2 -anonymous dataset Hierarchy and PS-rules Anonymous dataset based on the PS-rules model ü Support of fine-grained, flexible privacy requirements ü Privacy protection from both identity and sensitive information disclosure ü General privacy model that incorporates existing privacy models

59 Experimental results: Setup Datasets BMS1, BMS2 contain click-stream data and POS contains sales transaction data Evaluation Is anonymized data useful in aggregate query answering? How efficient are the algorithms? Methods Tree-based, Sample-based vs Baseline (no pruning) and Apriori Anonymization* *M. Terrovi7s, N. Mamoulis, P. Kalnis. Privacy- preserving anonymiza7on of set- valued data, PVLDB

60 Data Utility: uniform privacy requirements BMS2 dataset, k=5,p=2 (all 2-itemsets need protection from identity disclosure) Same protection as Apriori for identity disclosure, and additionally thwarts sensitive information disclosure Our algorithms offer many times more accurate query answering 60

61 Data Utility: detailed privacy requirements BMS2 dataset, k=5,p=2, type 2-1 rules of varying number and rules of other types Apriori cannot take the detailed privacy requirements into account and overdistorts data Our algorithms protect data no more than necessary to satisfy these requirements, achieving much higher data utility 61

62 Efficiency BMS2 dataset, k=5,p=2, 5K type 2-1 rules Synthetic data of varying D and I, k=5,p=2, 5K rules of type 2-1 Sample-based is the fastest and most scalable; Apriori is the slowest 62

63 Anonymization of RT-datasets (e.g., demographics + diagnoses codes) Privacy Threat: Attackers know some relational attribute values (e.g., demographics) plus some sensitive items (e.g., diagnoses) for an individual. *G. Poulis, G. Loukides, A. Gkoulalas- Divanis, S. Skiadopoulos. Anonymizing data with rela7onal and transac7onal aaributes. PKDD

64 SECRETA anonymization tool SECRETA*, **: System for Evaluating and Comparing RElational and Transaction Anonymization algorithms. ü Evaluates algorithms for relational, transaction and RT-dataset anonymization ü Integrates 9 popular anonymization algorithms and 3 bounding methods for combining them ü R: Supports Incognito, Cluster, Top-down and Full subtree bottom-up ü T: Supports COAT, PCTA, Apriori, LRA and VPA ü Supports two modes of operation: Evaluation and Comparison *G. Poulis, A. Gkoulalas- Divanis, G. Loukides, C. Tryfonopoulos, S. Skiadopoulos. SECRETA: A system for evalua7ng and comparing rela7onal and transac7on anonymiza7on algorithms. EDBT 14. **G. Poulis, G. Loukides, A. Gkoulalas- Divanis, S. Skiadopoulos. Anonymizing data with rela7onal and transac7onal aaributes. PKDD

65 Summary Explained the need for privacy in medical data sharing Presented the state-of-the-art in privacy-preserving medical data publishing to support intended analyses Elaborated on the policy-based anonymization model, which allows data publishers to specify detailed privacy and utility constraints for the data anonymization process Discussed methods for anonymizing data of co-existing data types, and introduced the SECRETA anonymization tool Thank you! Questions?

66 Internship IBM paid internships per DRL (out of ~100 applications) ~5 unpaid internships Internship duration: 3-4 months, start date: flexible Positions are advertised in December and filled as soon as possible Each candidate identifies an DRL to work with and a project that is of mutual interest to the applicant and to IBM The candidate submits a 1-2 page document on the project that he / she will be involved in if accepted At the end of the internship the candidate gives a talk to the lab on his/ her accomplishments during the internship Need more information? me at: [email protected] 66

Medical Data Sharing: Privacy Challenges and Solutions

Medical Data Sharing: Privacy Challenges and Solutions Medical Data Sharing: Privacy Challenges and Solutions Aris Gkoulalas-Divanis [email protected] IBM Research - Zurich Grigorios Loukides [email protected] Cardiff University ECML/PKDD, Athens, September

More information

CS346: Advanced Databases

CS346: Advanced Databases CS346: Advanced Databases Alexandra I. Cristea [email protected] Data Security and Privacy Outline Chapter: Database Security in Elmasri and Navathe (chapter 24, 6 th Edition) Brief overview of

More information

Privacy Techniques for Big Data

Privacy Techniques for Big Data Privacy Techniques for Big Data The Pros and Cons of Syntatic and Differential Privacy Approaches Dr#Roksana#Boreli# SMU,#Singapore,#May#2015# Introductions NICTA Australia s National Centre of Excellence

More information

Information Security in Big Data using Encryption and Decryption

Information Security in Big Data using Encryption and Decryption International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842 Information Security in Big Data using Encryption and Decryption SHASHANK -PG Student II year MCA S.K.Saravanan, Assistant Professor

More information

DATA MINING - 1DL360

DATA MINING - 1DL360 DATA MINING - 1DL360 Fall 2013" An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/per1ht13 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

CS377: Database Systems Data Security and Privacy. Li Xiong Department of Mathematics and Computer Science Emory University

CS377: Database Systems Data Security and Privacy. Li Xiong Department of Mathematics and Computer Science Emory University CS377: Database Systems Data Security and Privacy Li Xiong Department of Mathematics and Computer Science Emory University 1 Principles of Data Security CIA Confidentiality Triad Prevent the disclosure

More information

Policy-based Pre-Processing in Hadoop

Policy-based Pre-Processing in Hadoop Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden [email protected], [email protected] Abstract While big data analytics provides

More information

Privacy Challenges of Telco Big Data

Privacy Challenges of Telco Big Data Dr. Günter Karjoth June 17, 2014 ITU telco big data workshop Privacy Challenges of Telco Big Data Mobile phones are great sources of data but we must be careful about privacy 1 / 15 Sources of Big Data

More information

De-Identification 101

De-Identification 101 De-Identification 101 We live in a world today where our personal information is continuously being captured in a multitude of electronic databases. Details about our health, financial status and buying

More information

Differential privacy in health care analytics and medical research An interactive tutorial

Differential privacy in health care analytics and medical research An interactive tutorial Differential privacy in health care analytics and medical research An interactive tutorial Speaker: Moritz Hardt Theory Group, IBM Almaden February 21, 2012 Overview 1. Releasing medical data: What could

More information

RESEARCH. Acar Tamersoy. Thesis. Submitted to the Faculty of the. Graduate School of Vanderbilt University. for the degree of MASTER OF SCIENCE

RESEARCH. Acar Tamersoy. Thesis. Submitted to the Faculty of the. Graduate School of Vanderbilt University. for the degree of MASTER OF SCIENCE ANONYMIZATION OF LONGITUDINAL ELECTRONIC MEDICAL RECORDS FOR CLINICAL RESEARCH By Acar Tamersoy Thesis Submitted to the Faculty of the Graduate School of Vanderbilt University in partial fulllment of the

More information

Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service

Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service Noman Mohammed Benjamin C. M. Fung Patrick C. K. Hung Cheuk-kwong Lee CIISE, Concordia University, Montreal, QC, Canada University

More information

Data attribute security and privacy in distributed database system

Data attribute security and privacy in distributed database system IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. V (Mar-Apr. 2014), PP 27-33 Data attribute security and privacy in distributed database system

More information

A GENERAL SURVEY OF PRIVACY-PRESERVING DATA MINING MODELS AND ALGORITHMS

A GENERAL SURVEY OF PRIVACY-PRESERVING DATA MINING MODELS AND ALGORITHMS Chapter 2 A GENERAL SURVEY OF PRIVACY-PRESERVING DATA MINING MODELS AND ALGORITHMS Charu C. Aggarwal IBM T. J. Watson Research Center Hawthorne, NY 10532 [email protected] Philip S. Yu IBM T. J. Watson

More information

(Big) Data Anonymization Claude Castelluccia Inria, Privatics

(Big) Data Anonymization Claude Castelluccia Inria, Privatics (Big) Data Anonymization Claude Castelluccia Inria, Privatics BIG DATA: The Risks Singling-out/ Re-Identification: ADV is able to identify the target s record in the published dataset from some know information

More information

ARX A Comprehensive Tool for Anonymizing Biomedical Data

ARX A Comprehensive Tool for Anonymizing Biomedical Data ARX A Comprehensive Tool for Anonymizing Biomedical Data Fabian Prasser, Florian Kohlmayer, Klaus A. Kuhn Chair of Biomedical Informatics Institute of Medical Statistics and Epidemiology Rechts der Isar

More information

Societal benefits vs. privacy: what distributed secure multi-party computation enable? Research ehelse 2015 21-22 April Oslo

Societal benefits vs. privacy: what distributed secure multi-party computation enable? Research ehelse 2015 21-22 April Oslo Privacy Societal benefits vs. privacy: what distributed secure multi-party computation enable? Research ehelse 2015 21-22 April Oslo Kassaye Yitbarek Yigzaw UiT The Arctic University of Norway Outline

More information

Privacy-Preserving Big Data Publishing

Privacy-Preserving Big Data Publishing Privacy-Preserving Big Data Publishing Hessam Zakerzadeh 1, Charu C. Aggarwal 2, Ken Barker 1 SSDBM 15 1 University of Calgary, Canada 2 IBM TJ Watson, USA Data Publishing OECD * declaration on access

More information

GONZABA MEDICAL GROUP PATIENT REGISTRATION FORM

GONZABA MEDICAL GROUP PATIENT REGISTRATION FORM GONZABA MEDICAL GROUP PATIENT REGISTRATION FORM DATE: CHART#: GUARANTOR INFORMATION LAST NAME: FIRST NAME: MI: ADDRESS: HOME PHONE: ADDRESS: CITY/STATE: ZIP CODE: **************************************************************************************

More information

DRAFT NISTIR 8053 De-Identification of Personally Identifiable Information

DRAFT NISTIR 8053 De-Identification of Personally Identifiable Information 1 2 3 4 5 6 7 8 DRAFT NISTIR 8053 De-Identification of Personally Identifiable Information Simson L. Garfinkel 9 10 11 12 13 14 15 16 17 18 NISTIR 8053 DRAFT De-Identification of Personally Identifiable

More information

Challenges of Data Privacy in the Era of Big Data. Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014

Challenges of Data Privacy in the Era of Big Data. Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014 Challenges of Data Privacy in the Era of Big Data Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014 1 Outline Why should we care? What is privacy? How do achieve privacy? Big

More information

Data Privacy and Biomedicine Syllabus - Page 1 of 6

Data Privacy and Biomedicine Syllabus - Page 1 of 6 Data Privacy and Biomedicine Syllabus - Page 1 of 6 Course: Data Privacy in Biomedicine (BMIF-380 / CS-396) Instructor: Bradley Malin, Ph.D. ([email protected]) Semester: Spring 2015 Time: Mondays

More information

Anonymization of Longitudinal Electronic Medical Records. Acar Tamersoy, Grigorios Loukides, Mehmet Ercan Nergiz, Yucel Saygin, and Bradley Malin

Anonymization of Longitudinal Electronic Medical Records. Acar Tamersoy, Grigorios Loukides, Mehmet Ercan Nergiz, Yucel Saygin, and Bradley Malin IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012 413 Anonymization of Longitudinal Electronic Medical Records Acar Tamersoy, Grigorios Loukides, Mehmet Ercan Nergiz,

More information

Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0

Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0 Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0 Guidance Document Prepared by: PCORnet Data Privacy Task Force Submitted to the PMO Approved by the PMO Submitted

More information

Proposing a Novel Synergized K-Degree L-Diversity T- Closeness Model for Graph Based Data Anonymization

Proposing a Novel Synergized K-Degree L-Diversity T- Closeness Model for Graph Based Data Anonymization Proposing a Novel Synergized K-Degree L-Diversity T- Closeness Model for Graph Based Data Anonymization S.Charanyaa 1, K.Sangeetha 2 M.Tech. Student, Dept of Information Technology, S.N.S. College of Technology,

More information

Privacy Preserving Data Mining

Privacy Preserving Data Mining Privacy Preserving Data Mining Technion - Computer Science Department - Ph.D. Thesis PHD-2011-01 - 2011 Arie Friedman Privacy Preserving Data Mining Technion - Computer Science Department - Ph.D. Thesis

More information

Classification and Prediction

Classification and Prediction Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser

More information

Big Data Integration and Governance Considerations for Healthcare

Big Data Integration and Governance Considerations for Healthcare White Paper Big Data Integration and Governance Considerations for Healthcare by Sunil Soares, Founder & Managing Partner, Information Asset, LLC Big Data Integration and Governance Considerations for

More information

Degrees of De-identification of Clinical Research Data

Degrees of De-identification of Clinical Research Data Vol. 7, No. 11, November 2011 Can You Handle the Truth? Degrees of De-identification of Clinical Research Data By Jeanne M. Mattern Two sets of U.S. government regulations govern the protection of personal

More information

De-identification Koans. ICTR Data Managers Darren Lacey January 15, 2013

De-identification Koans. ICTR Data Managers Darren Lacey January 15, 2013 De-identification Koans ICTR Data Managers Darren Lacey January 15, 2013 Disclaimer There are several efforts addressing this issue in whole or part Over the next year or so, I believe that the conversation

More information

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge Global Journal of Business Management and Information Technology. Volume 1, Number 2 (2011), pp. 85-93 Research India Publications http://www.ripublication.com Static Data Mining Algorithm with Progressive

More information

How to De-identify Data. Xulei Shirley Liu Department of Biostatistics Vanderbilt University 03/07/2008

How to De-identify Data. Xulei Shirley Liu Department of Biostatistics Vanderbilt University 03/07/2008 How to De-identify Data Xulei Shirley Liu Department of Biostatistics Vanderbilt University 03/07/2008 1 Outline The problem Brief history The solutions Examples with SAS and R code 2 Background The adoption

More information

Efficient Algorithms for Masking and Finding Quasi-Identifiers

Efficient Algorithms for Masking and Finding Quasi-Identifiers Efficient Algorithms for Masking and Finding Quasi-Identifiers Rajeev Motwani Stanford University [email protected] Ying Xu Stanford University [email protected] ABSTRACT A quasi-identifier refers

More information

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM J. Arokia Renjit Asst. Professor/ CSE Department, Jeppiaar Engineering College, Chennai, TamilNadu,India 600119. Dr.K.L.Shunmuganathan

More information

ARTICLE 29 DATA PROTECTION WORKING PARTY

ARTICLE 29 DATA PROTECTION WORKING PARTY ARTICLE 29 DATA PROTECTION WORKING PARTY 0829/14/EN WP216 Opinion 05/2014 on Anonymisation Techniques Adopted on 10 April 2014 This Working Party was set up under Article 29 of Directive 95/46/EC. It is

More information

International Journal of Advanced Computer Technology (IJACT) ISSN:2319-7900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

International Journal of Advanced Computer Technology (IJACT) ISSN:2319-7900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS First A. Dr. D. Aruna Kumari, Ph.d, ; Second B. Ch.Mounika, Student, Department Of ECM, K L University, [email protected]; Third C.

More information

Data Driven Approaches to Prescription Medication Outcomes Analysis Using EMR

Data Driven Approaches to Prescription Medication Outcomes Analysis Using EMR Data Driven Approaches to Prescription Medication Outcomes Analysis Using EMR Nathan Manwaring University of Utah Masters Project Presentation April 2012 Equation Consulting Who we are Equation Consulting

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October-2013 ISSN 2229-5518 1582

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October-2013 ISSN 2229-5518 1582 1582 AN EFFICIENT CRYPTOGRAPHIC APPROACH FOR PRESERVING PRIVACY IN DATA MINING T.Sujitha 1, V.Saravanakumar 2, C.Saravanabhavan 3 1. M.E. Student, [email protected] 2. Assistant Professor, [email protected]

More information

CONSENT FOR MEDICAL TREATMENT

CONSENT FOR MEDICAL TREATMENT CONSENT FOR MEDICAL TREATMENT Patient Name DOB Date I, the patient or authorized representative, consent to any examination, evaluation and treatment regarding any illness, injury or other health concern

More information

Health Data De-Identification by Dr. Khaled El Emam

Health Data De-Identification by Dr. Khaled El Emam RISK-BASED METHODOLOGY DEFENSIBLE COST-EFFECTIVE DE-IDENTIFICATION OPTIMAL STATISTICAL METHOD REPORTING RE-IDENTIFICATION BUSINESS ASSOCIATES COMPLIANCE HIPAA PHI REPORTING DATA SHARING REGULATORY UTILITY

More information

Obfuscation of sensitive data in network flows 1

Obfuscation of sensitive data in network flows 1 Obfuscation of sensitive data in network flows 1 D. Riboni 2, A. Villani 1, D. Vitali 1 C. Bettini 2, L.V. Mancini 1 1 Dipartimento di Informatica,Universitá di Roma, Sapienza. E-mail: {villani, vitali,

More information

A Q&A with the Commissioner: Big Data and Privacy Health Research: Big Data, Health Research Yes! Personal Data No!

A Q&A with the Commissioner: Big Data and Privacy Health Research: Big Data, Health Research Yes! Personal Data No! A Q&A with the Commissioner: Big Data and Privacy Health Research: Big Data, Health Research Yes! Personal Data No! Ann Cavoukian, Ph.D. Information and Privacy Commissioner Ontario, Canada THE AGE OF

More information

Big Data Analytics for Healthcare

Big Data Analytics for Healthcare Big Data Analytics for Healthcare Jimeng Sun Chandan K. Reddy Healthcare Analytics Department IBM TJ Watson Research Center Department of Computer Science Wayne State University 1 Healthcare Analytics

More information

Healthcare data analytics. Da-Wei Wang Institute of Information Science [email protected]

Healthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Healthcare data analytics Da-Wei Wang Institute of Information Science [email protected] Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics

More information

The Use of Patient Records (EHR) for Research

The Use of Patient Records (EHR) for Research The Use of Patient Records (EHR) for Research Mary Devereaux, Ph.D. Director, Biomedical Ethics Seminars Assistant Director, Research Ethics Program & San Diego Research Ethics Consortium Abstract The

More information

Auditing EMR System Usage. You Chen Jan, 17, 2013 [email protected]

Auditing EMR System Usage. You Chen Jan, 17, 2013 You.chen@vanderbilt.edu Auditing EMR System Usage You Chen Jan, 17, 2013 [email protected] Health data being accessed by hackers, lost with laptop computers, or simply read by curious employees Anomalous Usage You Chen,

More information

MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH

MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH M.Rajalakshmi 1, Dr.T.Purusothaman 2, Dr.R.Nedunchezhian 3 1 Assistant Professor (SG), Coimbatore Institute of Technology, India, [email protected]

More information

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior N.Jagatheshwaran 1 R.Menaka 2 1 Final B.Tech (IT), [email protected], Velalar College of Engineering and Technology,

More information

The De-identification of Personally Identifiable Information

The De-identification of Personally Identifiable Information The De-identification of Personally Identifiable Information Khaled El Emam (PhD) www.privacyanalytics.ca 855.686.4781 [email protected] 251 Laurier Avenue W, Suite 200 Ottawa, ON Canada K1P 5J6

More information

Contents QUALIFICATIONS PACK - OCCUPATIONAL STANDARDS FOR ALLIED HEALTHCARE

Contents QUALIFICATIONS PACK - OCCUPATIONAL STANDARDS FOR ALLIED HEALTHCARE h QUALIFICATIONS PACK - OCCUPATIONAL STANDARDS FOR ALLIED HEALTHCARE Contents 1. Introduction and Contacts.....P.1 2. Qualifications Pack....P.2 3. Glossary of Key Terms.....P.4 OS describe what individuals

More information

Data Outsourcing based on Secure Association Rule Mining Processes

Data Outsourcing based on Secure Association Rule Mining Processes , pp. 41-48 http://dx.doi.org/10.14257/ijsia.2015.9.3.05 Data Outsourcing based on Secure Association Rule Mining Processes V. Sujatha 1, Debnath Bhattacharyya 2, P. Silpa Chaitanya 3 and Tai-hoon Kim

More information

Project Proposal: SAP Big Data Analytics on Mobile Usage Inferring age and gender of a person through his/her phone habits

Project Proposal: SAP Big Data Analytics on Mobile Usage Inferring age and gender of a person through his/her phone habits George Mason University SYST 699: Masters Capstone Project Spring 2014 Project Proposal: SAP Big Data Analytics on Mobile Usage Inferring age and gender of a person through his/her phone habits February

More information

Travis Goodwin & Sanda Harabagiu

Travis Goodwin & Sanda Harabagiu Automatic Generation of a Qualified Medical Knowledge Graph and its Usage for Retrieving Patient Cohorts from Electronic Medical Records Travis Goodwin & Sanda Harabagiu Human Language Technology Research

More information

Adult Information Form Page 1

Adult Information Form Page 1 Adult Information Form Page 1 Client Name: Age: DOB: Date: Address: City: State: Zip: Home Phone: ( ) OK to leave message? Yes No Work Phone: ( ) OK to leave message? Yes No Current Employer (or school

More information

PharmaSUG2011 Paper HS03

PharmaSUG2011 Paper HS03 PharmaSUG2011 Paper HS03 Using SAS Predictive Modeling to Investigate the Asthma s Patient Future Hospitalization Risk Yehia H. Khalil, University of Louisville, Louisville, KY, US ABSTRACT The focus of

More information

Healthcare Big Data Exploration in Real-Time

Healthcare Big Data Exploration in Real-Time Healthcare Big Data Exploration in Real-Time Muaz A Mian A Project Submitted in partial fulfillment of the requirements for degree of Masters of Science in Computer Science and Systems University of Washington

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Module outline. CS 458 / 658 Computer Security and Privacy. (Relational) Databases. Module outline. Module 6 Database Security and Privacy.

Module outline. CS 458 / 658 Computer Security and Privacy. (Relational) Databases. Module outline. Module 6 Database Security and Privacy. Module outline CS 458 / 658 Computer Security and Privacy Module 6 Database Security and Privacy Fall 2008 1 Introduction to databases 2 Security requirements 3 Data disclosure and inference 4 Multilevel

More information

Notice of Privacy Practices Walter L Cohen High School School-based Health Center. Effective as of August 6, 2004

Notice of Privacy Practices Walter L Cohen High School School-based Health Center. Effective as of August 6, 2004 Effective as of August 6, 2004 THIS NOTICE DESCRIBES HOW MEDICAL INFORMATION ABOUT YOU MAY BE USED AND DISCLOSED AND HOW YOU CAN GET ACCESS TO THIS INFORMATION. PLEASE REVIEW IT CAREFULLY. We are required

More information

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph Janani K 1, Narmatha S 2 Assistant Professor, Department of Computer Science and Engineering, Sri Shakthi Institute of

More information

Privacy Committee. Privacy and Open Data Guideline. Guideline. Of South Australia. Version 1

Privacy Committee. Privacy and Open Data Guideline. Guideline. Of South Australia. Version 1 Privacy Committee Of South Australia Privacy and Open Data Guideline Guideline Version 1 Executive Officer Privacy Committee of South Australia c/o State Records of South Australia GPO Box 2343 ADELAIDE

More information

Understanding De-identification, Limited Data Sets, Encryption and Data Masking under HIPAA/HITECH: Implementing Solutions and Tackling Challenges

Understanding De-identification, Limited Data Sets, Encryption and Data Masking under HIPAA/HITECH: Implementing Solutions and Tackling Challenges Understanding De-identification, Limited Data Sets, Encryption and Data Masking under HIPAA/HITECH: Implementing Solutions and Tackling Challenges Daniel C. Barth-Jones, M.P.H., Ph.D. Assistant Professor

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

ACKNOWLEDGEMENT OF RECEIPT OF WESTERN DENTAL S NOTICE OF PRIVACY PRACTICE

ACKNOWLEDGEMENT OF RECEIPT OF WESTERN DENTAL S NOTICE OF PRIVACY PRACTICE ACKNOWLEDGEMENT OF RECEIPT OF WESTERN DENTAL S NOTICE OF PRIVACY PRACTICE By signing this document, I acknowledge that I have received a copy of Western Dental s Joint Notice of Privacy Practices. Name

More information