Anonymization of Longitudinal Electronic Medical Records. Acar Tamersoy, Grigorios Loukides, Mehmet Ercan Nergiz, Yucel Saygin, and Bradley Malin

Size: px
Start display at page:

Download "Anonymization of Longitudinal Electronic Medical Records. Acar Tamersoy, Grigorios Loukides, Mehmet Ercan Nergiz, Yucel Saygin, and Bradley Malin"

Transcription

1 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY Anonymization of Longitudinal Electronic Medical Records Acar Tamersoy, Grigorios Loukides, Mehmet Ercan Nergiz, Yucel Saygin, and Bradley Malin Abstract Electronic medical record (EMR) systems have enabled healthcare providers to collect detailed patient information from the primary care domain. At the same time, longitudinal data from EMRs are increasingly combined with biorepositories to generate personalized clinical decision support protocols. Emerging policies encourage investigators to disseminate such data in a deidentified form for reuse and collaboration, but organizations are hesitant to do so because they fear such actions will jeopardize patient privacy. In particular, there are concerns that residual demographic and clinical features could be exploited for reidentification purposes. Various approaches have been developed to anonymize clinical data, but they neglect temporal information and are, thus, insufficient for emerging biomedical research paradigms. This paper proposes a novel approach to share patient-specific longitudinal data that offers robust privacy guarantees, while preserving data utility for many biomedical investigations. Our approach aggregates temporal and diagnostic information using heuristics inspired from sequence alignment and clustering methods. We demonstrate that the proposed approach can generate anonymized data that permit effective biomedical analysis using several patient cohorts derived from the EMR system of the Vanderbilt University Medical Center. Index Terms Anonymization, data privacy, electronic medical records (EMRs), longitudinal data. I. INTRODUCTION ADVANCES in health information technology have facilitated the collection of detailed, patient-level clinical data to enable efficiency, effectiveness, and safety in healthcare operations [1]. Such data are often stored in electronic medical record (EMR) systems [2], [3] and are increasingly repurposed to support clinical research (see, e.g., [4] [7]). Recently, EMRs have been combined with biorepositories to enable genome-wide association studies (GWAS) with clinical phenomena in the hopes of tailoring healthcare to genetic variants [8]. To demonstrate feasibility, EMR-based GWAS have focused on static phenotypes; i.e., where a patient is designated as disease positive Manuscript received April 7, 2011; revised September 27, 2011; accepted January 16, Date of publication January 27, 2012; date of current version May 4, This work was supported by the National Institutes of Health under Grant U01HG006378, Grant U01HG006385, and Grant R01LM009989, and by a Fellowship from the Royal Academy of Engineering Research. A. Tamersoy and B. Malin are with the Department of Biomedical Informatics, Vanderbilt University, Nashville, TN USA ( acar.tamersoy@vanderbilt.edu; b.malin@vanderbilt.edu). G. Loukides is with the School of Computer Science and Informatics, Cardiff University, Cardiff, CF24 3AA, U.K. ( g.loukides@cs.cf.ac.uk). M. E. Nergiz is with the Department of Computer Engineering, Zirve University, Gaziantep 27260, Turkey ( mehmet.nergiz@zirve.edu.tr). Y. Saygin is with the Department of Computer Science and Engineering, Sabanci University, Istanbul 34956, Turkey ( ysaygin@sabanciuniv.edu). Digital Object Identifier /TITB or negative (see, e.g., [9] [11]). As these studies mature, they will support personalized clinical decision support tools [12] and will require longitudinal data to understand how treatment influences a phenotype [13], [14]. Meanwhile, there are challenges to conducting GWAS on a scale necessary to institute changes in healthcare. First, to generate appropriate statistical power, scientists may require access to populations larger than those available in local EMR systems [15]. Second, the cost of a GWAS incurred in the setup and application of software to process medical records as well as in genome sequencing is nontrivial [16]. Thus, it can be difficult for scientists to generate novel, or validate published, associations. To mitigate this problem, the U.S. National Institutes of Health (NIH) encourages investigators to share data from NIH-supported GWAS [17] into the Database of Genotypes and Phenotypes (dbgap) [18]. This, however, may lead to privacy breaches if patients clinical or genomic information is associated with their identities. As a first line of defense against this threat, the NIH recommends investigators deidentify data by removing an enumerated list of attributes that could identify patients (e.g., personal names and residential addresses) prior to dbgap submission [19]. However, a patients DNA may still be reidentified via residual demographics [20] and clinical information (e.g., standardized International Classification of Diseases (ICD) codes) [21], as illustrated in the following example. Example 1: Each record in Fig. 1(a) corresponds to a fictional deidentified patient and is comprised of ICD codes, patient s age when a code was received, and a DNA sequence. For instance, the second record denotes that a patient was diagnosed with benign essential hypertension (code 401.1) at ages 38 and 40 and has the DNA sequence GC...A. The clinical and genomic data are derived from an EMR system and a research project beyond primary care, respectively. Publishing the data of Fig. 1(a) could allow a hospital employee with access to the EMR to associate Jane with her DNA sequence. This is because the identified record, shown in Fig. 1(b), can only be linked to the second record in Fig. 1(a) based on the ICD code and ages 38 and 40. Methods to mitigate reidentification via demographic and clinical features [22], [23] have been proposed, but they are not applicable to the longitudinal scenario. These methods assume the clinical profile is devoid of temporal or replicated diagnosis information. Consequently, these methods produce data that are unlikely to permit meaningful longitudinal investigations. In this paper, we propose the first approach to formally anonymize longitudinal patient records. Our work makes the following specific contributions /$ IEEE

2 414 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012 Fig. 1. Longitudinal data privacy problem. (a) Longitudinal research data. (b) Identified EMR. (c) 2-anonymization based on the proposed approach. 1) We propose a framework to transform each longitudinal patient record into a form that is indistinguishable from at least k 1 other records. This is achieved by iteratively clustering records and applying generalization, which replaces ICD codes and age values with more general values, and suppression, which removes ICD codes and age values. For example, applying our approach with k =2to the data of Fig. 1(a) will generate the anonymized data of Fig. 1(c). Observe that Jane s record is now linked to two DNA sequences because the diagnosis code in the first pair of this record has been replaced by the more general code ) We evaluate our approach with several cohorts of patient records from the Vanderbilt University Medical Center (VUMC) EMR system. Our results demonstrate that the anonymized data produced allow many studies focusing on clinical case counts to be performed accurately. The remainder of this paper is organized as follows. In Section II, we review related research on anonymization and its application to biomedical data. In Section III, we formalize the notions of privacy and utility and the anonymization problem addressed in this paper. We present our anonymization framework and discuss its extensions and limitations in Sections IV and VI, respectively. Finally, Section V reports the experimental results and Section VII concludes the paper. A. Relational Data We first review methods for preventing reidentification in relational data (e.g., demographics), where records have a fixed number of attributes and one value per attribute. The first category of protection methods transforms attribute values so that they no longer correspond to real individuals. Popular approaches in this category are noise addition, data swapping, and synthetic data generation (see [30] [32] for surveys). While such methods generate data that preserve aggregate statistics (e.g., the average age), they do not guarantee data that can be analyzed at the record level. This is a significant limitation that hampers the ability to use these data in various biomedical studies, including epidemiological studies [33] and GWAS [22]. In contrast, methods based on generalization and suppression preserve data truthfulness [34] [36]. Many of these methods are based on a principle called k-anonymity [24], [34], which states that each record of the published data must be equivalent to at least k 1 other records with respect to quasi-identifiers (QI) (i.e., attributes that can be linked with external resources for reidentification purposes) [37]. To enhance the utility of the anonymized data, these methods employ various search strategies, including binary search [34], [35], clustering [38], [39], evolutionary search [40], and partitioning [36]. There are methods that have been successfully applied to biomedical data [35], [41]. II. RELATED RESEARCH Reidentification concerns for clinical data via seemingly innocuous attributes were first raised in [24]. Specifically, it was shown that patients could be uniquely reidentified by linking publicly available voter registration lists to hospital discharge summaries via demographics, such as date of birth, gender, and five-digit residential zip code. The reidentification phenomenon has since attracted interest in domains beyond healthcare, and numerous techniques to guard against attacks have emerged (see [25] and [26] for surveys). In this section, we survey research related to privacy-preserving data publishing, with a focus on biomedical data. We note that the reidentification problem is not addressed by access control and encryption-based methods [27] [29] because data need to be shared beyond a small number of authorized recipients. B. Transactional Data Next, we turn our attention to approaches that deal with more complex data. Specifically, we consider transactional data, in which records have a large and variable number of values per attribute (e.g., diagnosis codes assigned to a patient during a hospital visit). Transactional data can also facilitate reidentification in the biomedical domain. For instance, deidentified clinical records can be linked to patients based on combinations of diagnosis codes that are additionally contained in publicly available hospital discharge summaries and EMR systems from which the records have been derived [21]. As it was shown in [21], more than 96% of 2700 patient records, collected in the context of a GWAS, would be susceptible to reidentification based on diagnosis codes if shared without additional controls. From the protection perspective, there are several approaches that have been developed to anonymize transactional data.

3 TAMERSOY et al.: ANONYMIZATION OF LONGITUDINAL ELECTRONIC MEDICAL RECORDS 415 For instance, Terrovitis et al. [42] proposed k m -anonymity, a principle, and several heuristic algorithms to prevent attackers from linking an individual to less than k records. This model assumes that the adversary knows at most m attribute values of any transaction. To anonymize patient records in transactional form, Loukides et al. [22] introduced a privacy principle to ensure that sets of potentially identifying diagnosis codes are protected from reidentification, while remaining useful for GWAS validations. To enforce this principle, they proposed an algorithm that employs generalization and suppression to group semantically close diagnosis codes together in a way that enhances data utility [22], [23]. Additionally, Tamersoy et al. [43] considered protecting data in which a certain diagnosis code may occur multiple times in a patient record. They designed an algorithm which preserves patients privacy through suppressing a subset of the replications of a diagnosis code. Our work differs from the aforementioned research along two principal dimensions. First, we address reidentification in longitudinal data publishing. Second, contrary to the approaches in [22] and [23] which group diagnosis codes together, our framework is based on grouping of records, which has been shown to be highly effective in retaining data utility due to the direct identification of records being anonymized [38], [39], [44]. C. Spatiotemporal Data Spatiotemporal data are related to the problem studied in this paper. They are time and location dependent, and these unique characteristics make them challenging to protect against reidentification. Such data are typically produced as a result of queries issued by mobile subscribers to location-based service providers, who, in turn, supply information services based on specific physical locations. The principle of k-anonymity has been extended to anonymize spatiotemporal data. Abul et al. [45] proposed a technique to group at least k objects that correspond to different subscribers and appear within a certain radius of the path of every object in the same time period. In addition to generalization and suppression, Abul et al. [45] considered adding noise to the original paths so that objects appear at the same time and spatial trajectory volume. Assuming that the locations of subscribers constitute sensitive information, Terrovitis and Mamoulis [46] proposed a suppression-based methodology to prevent attackers from inferring these locations. Finally, Nergiz et al. [44] proposed an approach that employs k- anonymity, enforced using generalization, together with reconstruction to protect data against boundary-based attacks. Our heuristics are inspired from [44]; however, we employ both generalization and suppression to further enhance data utility, and we do not use reconstruction, so as to preserve data truthfulness. The aforementioned approaches are developed for anonymizing spatiotemporal data and cannot be applied to longitudinal data due to different semantics. Specifically, the data we consider record patients diagnoses and not their locations. Consequently, the objective of our approach is not to hide the locations of patients, but to prevent reidentification based on their diagnosis and time information. Fig. 2. General architecture of the longitudinal data anonymization process. III. BACKGROUND AND PROBLEM FORMULATION This section begins with a high-level overview of the proposed approach. Next, we present the notation and the definitions for the privacy and adversarial models, the data transformation strategies, and the information loss metrics. We conclude the section with a formal problem description. A. Architectural Overview Fig. 2 provides an overview of the data anonymization process. The process is initiated when the data owner supplies the following information: 1) a dataset of longitudinal patient records, each of which consists of (ICD, Age) pairs and a DNA sequence, and 2) a parameter k that expresses the desired level of privacy. Given this information, the process invokes our anonymization framework. To satisfy the k-anonymity principle, our framework forms clusters of at least k records of the original dataset, which are modified using generalization and suppression. B. Notation A dataset D consists of longitudinal records of the form T,DNA T, where T is a trajectory 1 and DNA T is a genomic sequence. Each trajectory corresponds to a distinct patient in D and is a multiset 2 of pairs (i.e., T = {t 1,...,t m }) drawn from two attributes, namely ICD and Age [i.e., t i = (u ICD,v Age)], which contain the diagnosis codes assigned to a patient and their age, respectively. D denotes the number of records in D and T the length of T, defined as the number of pairs in T.Weusethe. operator to refer to a specific attribute value in a pair (e.g., t i.icd or t i.age). To study the data temporally, we order the pairs in T with respect to Age, such that t i 1.age t i.age. C. Adversarial Model We assume that an adversary has access to the original dataset D, such as in Fig. 1(a). An adversary may perform a reidentification attack in several ways. 1) Using identified EMR data: The adversary links D with the identified EMR data, such as those of Fig. 1(b), based 1 We use the term trajectory since the diagnosis codes at different ages can be seen as a route for the patient throughout his/her life. 2 Contrary to a set, a multiset can contain an element more than once.

4 416 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012 Fig. 3. Example of the DGH structure for Age. on (ICD, Age) pairs. This scenario requires the adversary to have access to the identified EMR data, which is the case of an employee of the institution from which the longitudinal data were derived. 2) Using publicly available hospital discharge summaries and identified resources: The adversary first links D with hospital discharge summaries based on (ICD, Age) pairs to associate patients with certain demographics. In turn, these demographics are exploited in another linkage with public records, such as voter registration lists, which contain identity information. Note that in both cases, an adversary is able to link patients to their DNA sequences, which suggests that a formal approach to longitudinal data anonymization is desirable. D. Privacy Model The formal definition of k-anonymity in the longitudinal data context is provided in Definition 1. Since each trajectory often contains multiple (ICD, Age) pairs, it is difficult to know which can be used by an adversary to perform reidentification attacks. Thus, we consider the worst-case scenario in which any combination of (ICD, Age) pairs can be exploited. Regardless, k-anonymity limits an adversary s ability to perform reidentification based on (ICD, Age) pairs, because each trajectory is associated with no less than k patients. Definition 1 (k-anonymity): An anonymized dataset D, produced from D,isk-anonymous if each trajectory in D, projected over QI, appears at least k times for any QI in D. E. Data Transformation Strategies Generalization and suppression are typically guided by a domain generalization hierarchy (DGH) (Definition 2) [47]. Definition 2 (DGH): A DGH for attribute A, referred to as H A, is a partially ordered tree structure which defines valid mappings between specific and generalized values of A. The root of H A is the most generalized value of A and is returned by a function root. Example 2: Consider H Age in Fig. 3. The values in the domain of Age (i.e., 33, 34,...,40) form the leaves of H Age. These values are then mapped to two, to four, and eventually to eightyear intervals. The root of H Age is returned by root(h Age ) as [33 40]. Our approach does not impose any constraints on the structure of an attribute s DGH, such that the data owners have complete freedom in its design. For instance, for ICD codes, data owners can use the standard ICD-9-CM hierarchy. 3 For ages, data owners can use a predefined hierarchy (e.g., the age hierarchy in the HIPAA Safe Harbor Policy 4 ) or design a DGH manually. 5 According to Definition 3, each specific value of an attribute generalizes to its direct ancestor in a DGH. However, a specific value can be projected up multiple levels in a DGH via a sequence of generalizations. As a result, a generalized value A i is interpreted as any one of the leaf nodes in the subtree rooted by A i in H A. Definition 3 (Generalization and Suppression): Given a node A i root(h A ) in H A, generalization is performed using a function f:a i A j which replaces A i with its direct ancestor A j. Suppression is a special case of generalization and is performed using a function g:a i A r which replaces A i with root(h A ). Example 3: Consider the last trajectory in Fig. 1(c). The first pair (401.1, [39 40]) is interpreted as either (401.1, 39) or (401.1, 40). F. Information Loss Generalization and suppression incur information loss because values are replaced by more general ones or eliminated. To capture the amount of information loss incurred by these operations, we quantify the normalized loss for each ICD code and Age value in a pair based on the Loss Metric (LM) (Definition 4) [40]. Definition 4 (LM): The information loss incurred by replacing a node A i with its ancestor A j in H A is LM(A i, A j )= A j A i A where A i and A j denote the number of leaf nodes in the subtree rooted by A i and A j in H A, respectively, and A denotes the domain size of attribute A. Example 4: Consider H Age in Fig. 3. The information loss incurred by generalizing [33 34] to [33 36] is =0.25 because the leaf-level descendants of [33 34] are 33 and 34, those of [33 36] are 33, 34, 35, and 36, and the domain of Age consists of the values To introduce the combined LM, which captures the total LM of replacing two nodes with their ancestor, provided in Definition 6, we use the notation of lowest common ancestor (LCA), provided in Definition 5. Definition 5 (LCA): The LCA A l of nodes A i and A j in H A is the farthest node (in terms of height) from root(h A ) such that (1) A i = A l or f n (A i )=A l and (2) A j = A l orf m (A j )=A l, and is returned by a function lca. Definition 6 (Combined LM): The combined LM of replacing nodes A i and A j with their LCA A l is LM(A i + A j, A l )=LM(A i, A l )+LM(A j, A l ). (2) 3 More information is available at 4 HIPAA Safe Harbor states all ages under 89 can be retained intact, while 90 or greater must be grouped together. 5 We further note that our approach is extendible to other categorical attributes, such as SNOMED-CT and Date, provided that a DGH can be specified for each of the attributes. Such extensions, however, are beyond the scope of this paper. (1)

5 TAMERSOY et al.: ANONYMIZATION OF LONGITUDINAL ELECTRONIC MEDICAL RECORDS 417 Next, we define the LM for an anonymized trajectory (Definition 7) and dataset (Definition 8), which we keep separate for each attribute. Definition 7 (LM for an Anonymized Trajectory): Given an anonymized trajectory T and an attribute A, the LM with respect to A is computed as T LM( T,A) = LM( t i.a, t i.a) (3) i=1 where t i A denotes the value t i A is replaced with. Definition 8 (LM for an Anonymized Dataset): Given an anonymized dataset D and an attribute A, the LM with respect to attribute A is computed as LM( D, A) = 1 LM( T,A) D T D T. (4) For clarity, we refer to an LM related to ICD and Age by ILM and ALM, respectively (e.g., we use ILM( D) instead of LM( D, ICD)). G. Problem Statement The longitudinal data anonymization problem is formally defined as follows. Problem: Given a longitudinal dataset D, a privacy parameter k, and DGHs for attributes ICD and Age, construct an anonymized dataset D, such that 1) D is k-anonymous, 2) the order of the pairs in each trajectory of D is preserved in D, and 3) ILM( D)+ALM( D) is minimized. IV. ANONYMIZATION FRAMEWORK In this section, we present our framework for longitudinal data anonymization. Many clustering algorithms can be applied to produce k- anonymous data [48], [49]. This involves organizing records into clusters of size at least k, which are anonymized together. In the context of longitudinal data, the challenge is to define a distance metric for trajectories such that a clustering algorithm groups similar trajectories. We define the distance between two trajectories as the cost (i.e., incurred information loss) of their anonymization as defined by the LM. The problem then reduces to finding an anonymized version T of two given trajectories such that ILM( T )+ALM( T ) is minimized. Finding an anonymization of two trajectories can be achieved by finding a matching between the pairs of trajectories that minimizes their cost of anonymization. This problem, which is commonly referred to as sequence alignment, has been extensively studied in various domains, notably for the alignment of DNA sequences to identify regions of similarity in a way that the total pairwise edit distance between the sequences is minimized [50], [51]. To solve the longitudinal data anonymization problem, we propose Longitudinal Data Anonymizer, a framework that incorporates alignment and clustering as separate components, as shown in Fig. 2. The objective of each component is summarized as follows. 1) Alignment attempts to find a minimal cost pair matching between two trajectories. 2) Clustering interacts with the Alignment component to create clusters of at least k records. Next, we examine each component in detail and develop methodologies to achieve their objectives. A. Alignment There are no directly comparable approaches to the method we developed in this study. So, we introduce a simple heuristic, called Baseline, for comparison purposes. Given trajectories X = {x 1,...,x m } and Y = {y 1,...,y n },ILM(X) and ALM(X), and DGHs H ICD and H Age, Baseline aligns X and Y by matching their pairs on the same index. 6 The pseudocode for Baseline is provided in Algorithm 1. This algorithm initializes an empty trajectory T to hold the output of the alignment and then assigns ILM(X) and ALM(X) to variables i and a, respectively (steps 1 and 2). Then, it determines the length of the shorter trajectory (step 3) and performs pair matching (steps 4 9). Specifically, for the pairs of the trajectories that have the same index, Baseline constructs a pair containing the LCAs of the ICD codes and Age values in these pairs (step 5), appends the constructed pair to T (step 6), and updates i and a with the information loss incurred by the generalizations (steps 7 8). Next, Baseline updates i and a with the amount of information loss incurred by suppressing the ICD codes and Age values from the unmatched pairs in the longer trajectory (steps 10 14). Finally, this algorithm returns T along with i and a, which correspond to ILM( T ) and ALM( T ), respectively (step 15). 6 ILM(X) and ALM(X) are provided as input because X may already be an anonymized version of two other trajectories.

6 418 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012 To help preserve data utility, we provide Alignment using Generalization and Suppression (A-GS), an algorithm that uses dynamic programming to construct an anonymized trajectory that incurs minimal cost. Before discussing A-GS, we briefly discuss the application of dynamic programming. The latter technique can be used to solve problems based on combining the solutions to subproblems which are not independent and share subsubproblems [52]. A dynamic programming algorithm stores the solution of a subsubproblem in a table to which it refers every time the subsubproblem is encountered. To give an example, for trajectories X = {x 1,...,x m } and Y = {y 1,...,y n }, a subproblem may be to find a minimal cost pair matching between the first to the jth pairs. A solution to this subproblem can be determined using solutions for the following subsubproblems and applying the respective operations: 1) Align X = {x 1,...,x j 1 } and Y = {y 1,...,y j 1 }, and generalize x j with y j ; 2) Align X = {x 1,...,x j 1 } and Y = {y 1,...,y j }, and suppress x j ; 3) Align X = {x 1,...,x j } and Y = {y 1,...,y j 1 }, and suppress y j. Each case is associated with a cost. Our objective is to find an anonymized trajectory T, such that ILM( T ) +ALM( T ) is minimized, so we examine each possible solution and select the one with minimum information loss. A-GS uses a similar approach to align trajectories. The algorithm accepts the same inputs as Baseline as well as weights w ICD and w Age. The latter allow A-GS to control the information loss incurred by anonymizing the values of each attribute. The data owners specify the attribute weights such that w ICD 0, w Age 0, and w ICD + w Age =1. The pseudocode for A-GS is provided in Algorithm 2. In step 1, A-GS initializes three matrices: i, a, and r. The first row (indexed 0) of each of these matrices corresponds to a null value, and starting from index 1, each row corresponds to a value in X. Similarly, the first column (indexed 0) of each of these matrices corresponds to a null value, and starting from index 1, each column corresponds to a value in Y. Specifically, for indices h and j, r h,j records which of the following operations incurs minimum information loss: 1) generalizing x h and y j (denoted with < >), 2) suppressing x h (denoted with < >), and 3) suppressing y j (denoted with < >). The entries in i h,j and a h,j keep the total ILM and ALM for aligning the subtrajectories X sub = {x 1,...,x h } and Y sub = {y 1,...,y j }, respectively. In step 2, A-GS assigns ILM(X) and ALM(X) to i 0,0 and a 0,0, respectively. We include null values in the rows and columns of i, a, and r because at some point during alignment A-GS may need to suppress some portion of the trajectories. Therefore, in steps 3 7 and 8 12, A-GS initializes i, a, and r for the values in X and Y, respectively. Specifically, for indices h and j, i h,0 and i 0,j keep the ILM for suppressing every pair in the subtrajectories X sub = {x 1,...,x h } and Y sub = {y 1,...,y j }, respectively. Similar reasoning applies to matrix a.the first row and column of r holds < > and < > for suppressing values from X and Y, respectively. In steps 13 25, A-GS performs dynamic programming. Specifically, for indices h and j, A-GS determines a minimal cost pair matching of the subtrajectories X sub = {x 1,...,x h } and Y sub = {y 1,...,y j } based on the three cases listed previously. Specifically, in steps 15 21, A-GS constructs two temporary arrays, c and g, to store the ILM and ALM for each possible solution, respectively. Next, in steps 22 and 23, A- GS determines the solution with the minimum information loss and assigns the ILM, ALM, and operation associated with the

7 TAMERSOY et al.: ANONYMIZATION OF LONGITUDINAL ELECTRONIC MEDICAL RECORDS 419 Fig. 4. Example of the hypertension subtree in the ICD DGH. Fig. 5. Matrices i, a,andr for the subset of records from Fig. 1. This alignment uses the DGHs in Figs. 3 and 4 and assumes that w ICD = w Age =0.5. solution to i h,j, a h,j, and r h,j, respectively. If there is a tie between the solutions, A-GS selects generalization as the operation for the sake of retaining more information. In steps 26 36, A-GS constructs the anonymized trajectory T by traversing the matrix r. Specifically, for two pairs in the trajectories, if generalization incurs minimum information loss, A-GS appends to T a pair containing the LCAs of the ICD codes and Age values in these pairs. The unmatched pairs in the trajectories are ignored during this process because A-GS suppresses these pairs. Finally, in step 37, Baseline returns T along with i m,n and a m,n, which correspond to ILM( T ) and ALM( T ), respectively. Example 5: Consider applying A-GS to T 1 and T 4 in Fig. 1(a) using the DGHs shown in Figs. 3 and 4 and assuming that w ICD = w Age =0.5. The matrices i, a, and r are illustrated in Fig. 5. As T 1 and T 4 are not anonymized, we initialize i 0,0 = a 0,0 =0. Subsequently, A-GS computes the values for the entries in the first row and column of the matrices. For instance, i 0,3 keeps the ILM for suppressing all ICD codes from T 1 and has a value of 1+(1 0.5) = 1.5. This is computed by summing the ILM for suppressing the first two ICD codes (i.e., the value stored in i 0,2 ) with the weight-adjusted ILM for suppressing the third ICD code. Then, A-GS performs dynamic programming. The process starts with aligning T 1,sub = {(401.1, 33)} and T 4,sub = {(401.9, 33)}. The possible solutions for this subproblem are as follows. 1) Align T 1,sub = { } and T 4,sub = { }, and generalize with and 33 with 33. 2) Align T 1,sub = {(401.1, 33)} and T 4,sub = { }, and suppress and 33. 3) Align T 1,sub = { } and T 4,sub = {(401.9, 33)}, and suppress and 33. The ILM and ALM for the subsubproblem in the first solution are stored in i 0,0 and a 0,0, respectively. Generalizing with has an ILM of (1 + 1) 0.5 =1, and generalizing 33 with 33 has an ALM of 0. Therefore, the first solution has a total LM of 1. The ILM and ALM for the subsubproblem in the second solution are stored in i 0,1 and a 0,1, respectively. The suppression of and 33 has an ILM and ALM of =0.5. It can be seen that the second and third solutions each have a total LM of 2. The solution with the minimum information loss is the first one; hence, A-GS stores 1, 0 and < > in i 1,1, a 1,1 and r 1,1, respectively. After the values for the remaining entries are computed, A-GS uses r to construct the anonymized trajectory T. The process starts with examining the bottom-right entry, which denotes a generalization. As a result, A-GS appends (401.1, 35) to T. The process continues by following the symbols, such that A-GS returns T = {(401.1, 33), (401.1, 34), (401.1, 35)} along with i 4,3 and a 4,3, which correspond to ILM( T ) and ALM( T ), respectively. B. Clustering We base our methodology for the clustering component on the maximum distance to average vector (MDAV) algorithm [53], [54], an efficient heuristic for k-anonymity. The pseudocode for MDAV 7 and its helper function, formcluster, are provided in Algorithms 3 and 4, respectively. MDAV iteratively selects the most frequent trajectory in a longitudinal dataset (steps 3 and 12), finds its most distant trajectory X (steps 4 and 13), and forms a cluster of k trajectories around the latter trajectory 7 We refer to our algorithm as MDAV to avoid confusion.

8 420 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012 TABLE I DESCRIPTIVE SUMMARY STATISTICS OF THE DATASETS (steps 5 and 6 and 14 and 15). Cluster formation is performed by formcluster, a function that constructs a cluster C by aligning n trajectories in a consecutive manner and returns C, along with ILM(C) and ALM(C) which are the ILM and ALM for the anonymized trajectory resulting from the alignment, respectively. MDAV minimizes intercluster similarity by constructing a cluster around a trajectory Y that is most distant to X (steps 7 9). We define the distance between two trajectories as the cost of their anonymization. As such, the most distant trajectory is the one that maximizes the sum of ILM and ALM returned from A-GS. A similar reasoning applies when we form a cluster. We add the trajectory that minimizes the sum of ILM and ALM returned from A-GS. MDAV forms a final cluster of size at least k using the remaining trajectories in the dataset (steps 17 19) and returns D, ak-anonymized version of the longitudinal dataset, along with ILM( D) and ALM( D) (step 20). V. EXPERIMENTAL EVALUATION This section presents an experimental evaluation of the anonymization framework. We compare the anonymization methods on data utility, as indicated by the LM measure (see Section V-B) and aggregate query answering accuracy (see Section V-C). Furthermore, we show that our method allows balancing the level of information loss incurred by anonymizing ICD codes and Age values (see Section V-D). This is important to support different types of biomedical studies, such as geriatric and epidemiology studies that are supported well when the information contained in Age and ICD attributes, respectively, is preserved in the anonymized data. A. Experimental Setup We worked with three datasets derived from the Synthetic Derivative (SD), a collection of deidentified information extracted from the EMR system of the VUMC [55]. We issued a query to retrieve the records of patients whose DNA samples were genotyped and stored in BioVU, VUMC s DNA repository linked to the SD. Then, using the phenotype specification in [56], we identified the patients eligible to participate in a GWAS on native electrical conduction within the ventricles of the heart. Subsequently, we created a dataset called D Pop 50 by Fig. 6. Comparison of information loss for D Pop 50 using various k values. restricting our query to the 50 most frequent ICD codes that occur in at least 5% of the records in BioVU. Next, we created a dataset called D Pop 4, which is a subset of D Pop 50, containing the following comorbid ICD codes selected for [43]: 250 (diabetes mellitus), 272 (disorders of lipoid metabolism), 401 (essential hypertension), and 724 (other and unspecified disorders of the back). Finally, we created a dataset called D Smp 4, which is a subset of D Pop 4, containing the records of patients who actually participated in the aforementioned GWAS [57]. D Smp 4 is expected to be deposited into the dbgap repository and has been used in [43] with no temporal information. The characteristics of our datasets are summarized in Table I. Throughout our experiments, we varied k between 2 and 15, noting that k =5tends to be applied in practice [41]. Initially, we set w ICD = w Age =0.5. We implemented all algorithms in Java and conducted our experiments on an Intel 2.8 GHz powered system with 4-GB RAM. B. Capturing Data Utility Using LM We first compared the algorithms with respect to the LM. Fig. 6 depicts the results with D Pop 50. The ILM and ALM increase with k for both algorithms, which is expected because as k increases, a larger amount of distortion is needed to satisfy a stricter privacy requirement. Note that Baseline incurred substantially more information loss than A-GS for all k. In fact, Baseline failed to construct a practically useful result when k>2, as it suppressed all values from the dataset. Similar trends were observed between A-GS and Baseline for D Pop 4 and D Smp 4 (omitted for brevity). Interestingly, A-GS, on average, incurred 48% less information loss on D Pop 4 than D Pop 50. This is important because a relatively small number of ICD codes may suffice to study a range of different diseases [11], [43]. It is also worthwhile to note that the information loss incurred by our approach remains relatively low (i.e., below 0.5) ford Smp 4, even though it is, on average, 55% more than that for D Pop 4. This is attributed to the fact that

9 TAMERSOY et al.: ANONYMIZATION OF LONGITUDINAL ELECTRONIC MEDICAL RECORDS 421 Fig. 7. Comparison of query answering accuracy for D Pop 50, D Pop 4,andD Smp 4 using various k values. Fig. 8. A comparison of information loss for D Pop 50 using various w ICD and w Age values. D Smp 4 is more sparse than D Pop 4, which implies it is more difficult to anonymize [58]. The ILM and ALM for different k values and datasets, which correspond to the aforementioned results, can be found in Appendix A. C. Capturing Data Utility Using Average Relative Error We next analyzed the effectiveness of our approach for supporting general biomedical analysis. We assumed a scenario in which a scientist issues queries on anonymized data to retrieve the number of trajectories that harbor a combination of (ICD, Age) pairs that appear in at least 1% of the original trajectories. Such queries are typical in many biomedical data mining applications [59]. To quantify the accuracy of answering such a workload of queries, we used the Average Relative Error (AvgRE) measure [36], which reflects the average number of trajectories that are incorrectly included as part of the query answers. Details about this measure are in Appendix B. Fig. 7 shows the AvgRE scores of running A-GS on the datasets. The results for Baseline are not reported because they were more than 6 times worse than our approach for k =2, and the worst possible for k>2. This is because Baseline suppressed all values. As expected, we find an increase in AvgRE scores as k increases, which is due to the privacy/utility tradeoff. Nonetheless, A-GS allows fairly accurate query answering on each dataset by having an AvgRE score of less than 1 for all k. The results suggest our approach can be effective, even when a high level of privacy is required. Furthermore, we observe that the AvgRE scores for D Pop 4 are lower than those of D Smp 4, which are in turn lower than those of D Pop 50. This implies that query answers are more accurate for small domain sizes and large datasets. D. Prioritizing Attributes Finally, we investigated how configurations of attribute weighting affect information loss. Fig. 8 reports the results for D Pop 50 and k =2when our algorithm is configured with weights ranging from 0.1 to 0.9. Observe that when w ICD =0.1 and w Age =0.9, A-GS distorted Age values much less than ICD values. Similarly, A-GS incurred less information loss for ICD than Age when we specified w ICD =0.9 and w Age =0.1.This result implies that data managers can use weights to achieve desired utility for either attribute. VI. DISCUSSION In this section, we discuss how our approach can be extended to prevent a privacy threat in addition to reidentification and its limitations. A. Attacks Beyond Reidentification Beyond reidentification is the threat of sensitive itemset disclosure, in which a patient is associated with a set of diagnosis codes that reveal some sensitive information (e.g., HIV status). k-anonymity does not guarantee prevention of sensitive itemset disclosure, since a large number of records that are indistinguishable with respect to the potentially identifying diagnosis codes can still contain the same sensitive itemset [59]. We note that our approach can be extended to prevent this attack by controlling generalization and suppression to ensure that an additional principle is satisfied, such as l-diversity [60], which dictates how sensitive information is grouped. This extension, however, is beyond the scope of this paper. B. Limitations The proposed approach is limited in certain aspects, which we highlight to suggest opportunities for further research. First, our algorithm induces minimal distortion to the data in practice, but it does not limit the amount of information loss incurred by generalization and suppression. Designing algorithms that provide this type of guarantee is important to enhance the quality of anonymized GWAS-related datasets, but is also computationally challenging due to the large search spaces involved, particularly for longitudinal data. Second, the approach we propose does not guarantee that the released data remain useful for scenarios in which prespecified analytic tasks, such as the validation of known GWAS [22], are known to data owners apriori.to address such scenarios, we plan to design algorithms that take the tasks for which data are anonymized into account during anonymization.

10 422 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012 VII. CONCLUSION AND FUTURE WORK This study was motivated by the growing need to disseminate patient-specific longitudinal data in a privacy-preserving manner. To the best of our knowledge, we introduced the first approach to sharing such data while providing computational privacy guarantees. Our approach uses sequence alignment and clustering-based heuristics to anonymize longitudinal patient records. Our investigations suggest that it can generate longitudinal data with a low level of information loss and remain useful for biomedical analysis. This was illustrated through extensive experiments with data derived from the EMRs of thousands of patients. The approach is not guided by specific utility (e.g., satisfaction of GWAS validation), but we are confident it can be extended to support such endeavors. APPENDIX A INFORMATION LOSS INCURRED BY ANONYMIZING THE DATASETS USING A-GS pair (401.1, [39 40]). Furthermore, observe that is a leaf node in Fig. 4; hence, the set of possible ICD codes is {401.1}. Similarly, the subtree rooted by [39 40] in Fig. 3 consists of two leaf nodes, hence the set of possible Age values is {39, 40}. Therefore, there are two possible pairs: {(401.1, 39), (401.1, 40)}, and the probability that one of the trajectories was originally harboring (401.1, 39) is 1 2. Then, an approximate answer for the query is computed as 1 2 2=1. The Relative Error (RE) for an arbitrary query Q is computed as RE(Q) = a(q) e(q) /a(q). For instance, the RE for the previous example query is 1 1 /1 = 0since the original dataset in Fig. 1(a) contains one trajectory with (401.1, 39). The AvgRE for a workload of queries is the mean RE of all issued queries. It reflects the mean error in answering the query workload. ACKNOWLEDGMENT The authors would like to thank A. Gkoulalas-Divanis (IBM Research) for helpful discussions on the formulation of this problem. REFERENCES APPENDIX B MEASURING DATA UTILITY USING QUERY WORKLOADS The AvgRE measure captures the accuracy of answering queries on an anonymized dataset. The queries we consider can be modeled as follows: Q: SELECT COUNT(*) FROM dataset WHERE (u ICD, v Age) dataset,... Let a(q) be the answer of a COUNT() query Q when it is issued on the original dataset. The value of a(q) can be easily obtained by counting the number of trajectories in the original dataset that contain the (ICD, Age) pairs in Q. Let e(q) be the answer of Q when it is issued on the anonymized dataset. This is an estimate because a generalized value is interpreted as any leaf node in the subtree rooted by that value in the DGH. Therefore, an anonymized pair may correspond to any pair of possible ICD codes and Age values, assuming each pair is equally likely. The value of e(q) can be obtained by computing the probability that a trajectory in the anonymized dataset satisfies Q, and then summing these probabilities across all trajectories. To illustrate how an estimate can be computed, assume that a data recipient issues a query for the number of patients diagnosed with ICD code at age 39 using the anonymized dataset in Fig. 1(c). Referring to the DGHs in Figs. 3 and 4, it can be seen that the only trajectories that may contain (401.1, 39) are the last two since they contain the generalized [1] D. Blumenthal, Stimulating the adoption of health information technology, NewEngl.J.Med., vol. 360, no. 15, pp , [2] D. A. Ludwick and J. Doucette, Adopting electronic medical records in primary care: Lessons learned from health information systems implementation experience in seven countries, Int. J. Med. Informat., vol. 78, pp , [3] A. K. Jha, C. M. DesRoches, E. G. Campbell, K. Donelan, S. R. Rao, T. G. Ferris, A. Shields, S. Rosenbaum, and D. Blumenthal, Use of electronic health records in U.S. hospitals, New Engl. J. Med., vol. 360, pp , [4] B. B. Dean, J. Lam, J. L. Natoli, Q. Butler, D. Aguilar, and R. J. Nordyke, Review: Use of electronic medical records for health outcomes research: A literature review, Med. Care Res. Rev., vol. 66, pp , [5] K. Holzer and W. Gall, Utilizing IHE-based electronic health record systems for secondary use, Methods Inf. Med., vol. 50, no. 4, pp , [6] C. Safran, M. Bloomrosen, W. Hammond, S. Labkoff, S. Markel-Fox, P. Tang, and D. E. Detmer, Toward a national framework for the secondary use of health data: An American medical informatics association white paper, J. Amer. Med. Informat. Assoc., vol. 14, pp. 1 9, [7] K. Tu, T. Mitiku, H. Guo, D. S. Lee, and J. V. Tu, Myocardial infarction and the validation of physician billing and hospitalization data using electronic medical records, Chron. Diseases Canada, vol. 30, pp , [8] C. A. McCarty, R. L. Chisholm, C. G. Chute, I. J. Kullo, G. P. Jarvik, E. B. Larson, R. Li, D. R. Masys, M. D. Ritchie, D. M. Roden, J. P. Struewing, and W. A. Wolf, The emerge network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies, BMC Med. Genomics, vol. 4, pp , [9] I. J. Kullo, K. Ding, H. Jouni, C. Y. Smith, and C. G. Chute, A genomewide association study of red blood cell traits using the electronic medical record, Public Library Sci. ONE, vol. 5, e13011, [10] J. A. Pacheco, P. C. Avila, J. A. Thompson, M. Law, J. A. Quraishi, A. K. Greiman, E. M. Just, and A. Kho, A highly specific algorithm for identifying asthma cases and controls for genome-wide association studies, in Proc. Amer. Med. Informat. Assoc. Annu. Symp., 2009, pp [11] M. D. Ritchie, J. C. Denny, D. C. Crawford, A. H. Ramirez, J. B. Weiner, J. M. Pulley, M. A. Basford, K. Brown-Gentry, J. R. Balser, D. R. Masys, J. L. Haines, and D. M. Roden, Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record, Amer. J. Human Genet., vol. 86, no. 4, pp , [12] M. N. Liebman, Personalized medicine: A perspective on the patient, disease and causal diagnostics, Personal. Med., vol. 4, no. 2, pp , 2007.

11 TAMERSOY et al.: ANONYMIZATION OF LONGITUDINAL ELECTRONIC MEDICAL RECORDS 423 [13] A. Tucker and D. Garway-Heath, The pseudotemporal bootstrap for predicting glaucoma from cross-sectional visual field data, IEEE Trans. Inf. Technol. Biomed., vol. 14, no. 1, pp , Jan [14] S. Jensen, Mining medical data for predictive and sequential patterns, in Proc. 5th Eur. Conf. Principles Pract. Knowl. Discov. Databases, 2001, pp [15] P. R. Burton, A. L. Hansell, I. Fortier, T. A. Manolio, M. J. Khoury, J. Little, and P. Elliott, Size matters: Just how big is big? Quantifying realistic sample size requirements for human genome epidemiology, Int. J. Epidemiol., vol. 38, pp , [16] C. C. Spencer, Z. Su, P. Donnelly, and J. Marchini, Designing genomewide association studies: Sample size, power, imputation, and the choice of genotyping chip, Public Library Sci. Genet., vol. 5, no. 5, e , [17] Policy for Sharing of Data Obtained in NIH Supported or Conducted Genome-Wide Association Studies (GWAS), Nat. Inst. Health, Bethesda, MD, NOT-OD , [18] M. D. Mailman, M. Feolo, Y. Jin, M. Kimura, K. Tryka, R. Bagoutdinov, L. Hao, A. Kiang, J. Paschall, L. Phan, N. Popova, S. Pretel, L. Ziyabari, M. Lee, Y. Shao, Z. Y. Wang, K. Sirotkin, M. Ward, M. Kholodov, K. Zbicz, J. Beck, M. Kimelman, S. Shevelev, D. Preuss, E. Yaschenko, A. Graeff, J. Ostell, and S. T. Sherry, The NCBI dbgap database of genotypes and phenotypes, Nature genet., vol. 39, no. 10, pp , [19] Standards for Protection of Electronic Health Information, Federal Register, Dept. Health, Human Services, Washington, DC, 45 CFR Pt. 164, [20] K. Benitez and B. Malin, Evaluating re-identification risks with respect to the HIPAA privacy rule, J. Amer. Med. Informat. Assoc., vol. 17,no.2, pp , [21] G. Loukides, J. C. Denny, and B. Malin, The disclosure of diagnosis codes can breach research participants privacy, J. Amer. Med. Informat. Assoc., vol. 17, no. 3, pp , [22] G. Loukides, A. Gkoulalas-Divanis, and B. Malin, Anonymization of electronic medical records for validating genome-wide association studies, Proc. Nat. Acad. Sci., vol. 107, no. 17, pp , [23] G. Loukides, A. Gkoulalas-Divanis, and B. Malin, Privacy-preserving publication of diagnosis codes for effective biomedical analysis, in Proc. 10th IEEE Int. Conf. Inf. Technol. Appl. Biomed., 2010, pp [24] L. Sweeney, k-anonymity: A model for protecting privacy, Int. J. Uncertainty, Fuzz. Knowl.-Based Syst., vol. 10, no. 5, pp , [25] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu, Privacy-preserving data publishing: A survey of recent developments, ACM Comput. Surv., vol. 42, no. 4, pp. 1 53, [26] B.-C. Chen, D. Kifer, K. LeFevre, and A. Machanavajjhala, Privacypreserving data publishing, Found. Trends Databases, vol. 2, no. 1 2, pp , [27] R. S. Sandhu, E. J. Coyne, H. L. Feinstein, and C. E. Youman, Role-based access control models, IEEE Comput., vol. 29, no. 2, pp , Feb [28] B. Pinkas, Cryptographic techniques for privacy-preserving data mining, ACM Spec. Interest Group Knowl. Discov. Data Mining Explorat., vol. 4, no. 2, pp , [29] S. M. Diesburg and A.-I. A. Wang, A survey of confidential data storage and deletion methods, ACM Comput. Surv., vol. 43, no. 1, pp. 1 37, [30] N. R. Adam and J. C. Wortmann, Security-control methods for statistical databases: A comparative study, ACM Comput. Surv., vol. 21, no. 4, pp , [31] C. C. Aggarwal and P. S. Yu, A survey of randomization methods for privacy-preserving data mining, in Privacy-Preserving Data Mining: Models and Algorithms (ser. Advances in Database Systems), vol. 34. New York: Springer-Verlag, 2008, pp [32] L. Willenborg and T. De Waal, Statistical Disclosure Control in Practice (ser. Lecture Notes in Statistics), vol New York: Springer-Verlag, 1996, pp [33] N. Marsden-Haug, V. Foster, P. Gould, E. Elbert, H. Wang, and J. Pavlin, Code-based syndromic surveillance for influenzalike illness by international classification of diseases, ninth revision, Emerg. Infect. Diseases, vol. 13, pp , [34] P. Samarati, Protecting respondents identities in microdata release, IEEE Trans. Knowl. Data Eng., vol. 13, no. 6, pp , Nov./Dec [35] K. El Emam, F. K. Dankar, R. Issa, E. Jonker, D. Amyot, E. Cogo, J. P. Corriveau, M. Walker, S. Chowdhury, R. Vaillancourt, T. Roffey, and J. Bottomley, A globally optimal k-anonymity method for the deidentification of health data, J. Amer. Med. Informat. Assoc., vol. 16, no. 5, pp , [36] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, Mondrian multidimensional k-anonymity, in Proc. 22nd IEEE Int. Conf. Data Eng., 2006, pp [37] T. Dalenius, Finding a needle in a haystack or identifying anonymous census record, J. Offic. Statist., vol. 2, no. 3, pp , [38] M. E. Nergiz and C. Clifton, Thoughts on k-anonymization, Data Knowl. Eng., vol. 63, no. 3, pp , [39] G. Loukides and J. Shao, Capturing data usefulness and privacy protection in k-anonymisation, in Proc. 22nd ACM Symp. Appl. Comput., 2007, pp [40] V. S. Iyengar, Transforming data to satisfy privacy constraints, in Proc. 8th ACM Int. Conf. Knowl. Discov. Data Mining, 2002, pp [41] K. El Emam and F. K. Dankar, Protecting privacy using k-anonymity, J. Amer. Med. Informat. Assoc., vol. 15, no. 5, pp , [42] M. Terrovitis, N. Mamoulis, and P. Kalnis, Privacy-preserving anonymization of set-valued data, in Proc. Very Large Data Bases Endowment, 2008, vol. 1, no. 1, pp [43] A. Tamersoy, G. Loukides, J. C. Denny, and B. Malin, Anonymization of administrative billing codes with repeated diagnoses through censoring, in Proc. Amer. Med. Informat. Assoc. Annu. Symp., 2010, pp [44] M. E. Nergiz, M. Atzori, Y. Saygin, and B. Guc, Towards trajectory anonymization: A generalization-based approach, Trans. Data Privacy, vol. 2, no. 1, pp , [45] O. Abul, F. Bonchi, and M. Nanni, Never walk alone: Uncertainty for anonymity in moving objects databases, in Proc. 24nd IEEE Int. Conf. Data Eng., 2008, pp [46] M. Terrovitis and N. Mamoulis, Privacy preservation in the publication of trajectories, in Proc. 9th Int. Conf. Mobile Data Manag.,2008,pp [47] L. Sweeney, Achieving k-anonymity privacy protection using generalization and suppression, Int. J. Uncertainty, Fuzz. Knowl.-Based Syst., vol. 10, no. 5, pp , [48] C. C. Aggarwal and P. S. Yu, A condensation approach to privacy preserving data mining, in Proc. 9th Int. Conf. Extend. Database Technol., 2004, pp [49] G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, and A. Zhu, Achieving anonymity via clustering, in Proc. 25th ACM Symp. Principles Database Syst., 2006, pp [50] S. B. Needleman and C. D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Molecular Biol., vol. 48, no. 3, pp , [51] T. F. Smith and M. S. Waterman, Identification of common molecular subsequences, J. Molecular Biol., vol. 147, no. 1, pp , [52] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 2nd ed. Cambridge, MA: MIT Press, [53] J. Domingo-Ferrer and J. M. Mateo-Sanz, Practical data-oriented microaggregation for statistical disclosure control, IEEE Trans. Knowl. Data Eng., vol. 14, no. 1, pp , Jan./Feb [54] J. Domingo-Ferrer and V. Torra, Ordinal, continuous and heterogeneous k-anonymity through microaggregation, Data Mining Knowl. Discov., vol. 11, no. 2, pp , [55] D. M. Roden, J. M. Pulley, M. A. Basford, G. R. Bernard, E. W. Clayton, J. R. Balser, and D. R. Masys, Development of a large-scale de-identified DNA biobank to enable personalized medicine, Clin. Pharmacol. Therapeut., vol. 84, no. 3, pp , [56] A. H. Ramirez, J. S. Schildcrout, D. L. Blakemore, D. R. Masys, J. M. Pulley, M. A. Basford, D. M. Roden, and J. C. Denny, Modulators of normal electrocardiographic intervals identified in a large electronic medical record, Heart Rhythm, vol. 8, no. 2, pp , [57] J. C. Denny, M. D. Ritchie, D. C. Crawford, J. S. Schildcrout, A. H. Ramirez, J. M. Pulley, M. A. Basford, D. R. Masys, J. L. Haines, and D. M. Roden, Identification of genomic predictors of atrioventricular conduction: Using electronic medical records as a tool for genome science, Circulation, vol.122, pp ,2010. [58] C. C. Aggarwal, On k-anonymity and the curse of dimensionality, in Proc. 31st Int. Conf. Very Large Data Bases, 2005, pp [59] Y. Xu, K. Wang, A. W.-C. Fu, and P. S. Yu, Anonymizing transaction databases for publication, in Proc. 14th ACM Int. Conf. Knowl. Discov. Data Mining, 2008, pp [60] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam, l-diversity: Privacy beyond k-anonymity, in Proc. 22nd IEEE Int. Conf. Data Eng., 2006, pp Authors photographs and biographies not available at the time of publication.

RESEARCH. Acar Tamersoy. Thesis. Submitted to the Faculty of the. Graduate School of Vanderbilt University. for the degree of MASTER OF SCIENCE

RESEARCH. Acar Tamersoy. Thesis. Submitted to the Faculty of the. Graduate School of Vanderbilt University. for the degree of MASTER OF SCIENCE ANONYMIZATION OF LONGITUDINAL ELECTRONIC MEDICAL RECORDS FOR CLINICAL RESEARCH By Acar Tamersoy Thesis Submitted to the Faculty of the Graduate School of Vanderbilt University in partial fulllment of the

More information

Anonymization of Administrative Billing Codes with Repeated Diagnoses Through Censoring

Anonymization of Administrative Billing Codes with Repeated Diagnoses Through Censoring Anonymization of Administrative Billing Codes with Repeated Diagnoses Through Censoring Acar Tamersoy, Grigorios Loukides PhD, Joshua C. Denny MD MS, and Bradley Malin PhD Department of Biomedical Informatics,

More information

Current Developments of k-anonymous Data Releasing

Current Developments of k-anonymous Data Releasing Current Developments of k-anonymous Data Releasing Jiuyong Li 1 Hua Wang 1 Huidong Jin 2 Jianming Yong 3 Abstract Disclosure-control is a traditional statistical methodology for protecting privacy when

More information

Privacy Preserved Association Rule Mining For Attack Detection and Prevention

Privacy Preserved Association Rule Mining For Attack Detection and Prevention Privacy Preserved Association Rule Mining For Attack Detection and Prevention V.Ragunath 1, C.R.Dhivya 2 P.G Scholar, Department of Computer Science and Engineering, Nandha College of Technology, Erode,

More information

Protecting Patient Privacy. Khaled El Emam, CHEO RI & uottawa

Protecting Patient Privacy. Khaled El Emam, CHEO RI & uottawa Protecting Patient Privacy Khaled El Emam, CHEO RI & uottawa Context In Ontario data custodians are permitted to disclose PHI without consent for public health purposes What is the problem then? This disclosure

More information

DATA MINING - 1DL360

DATA MINING - 1DL360 DATA MINING - 1DL360 Fall 2013" An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/per1ht13 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Data Privacy and Biomedicine Syllabus - Page 1 of 6

Data Privacy and Biomedicine Syllabus - Page 1 of 6 Data Privacy and Biomedicine Syllabus - Page 1 of 6 Course: Data Privacy in Biomedicine (BMIF-380 / CS-396) Instructor: Bradley Malin, Ph.D. (b.malin@vanderbilt.edu) Semester: Spring 2015 Time: Mondays

More information

Privacy Challenges and Solutions for Data Sharing. Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland

Privacy Challenges and Solutions for Data Sharing. Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland Privacy Challenges and Solutions for Data Sharing Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland June 8, 2015 Content Introduction / Motivation for data privacy Focus Application

More information

ARX A Comprehensive Tool for Anonymizing Biomedical Data

ARX A Comprehensive Tool for Anonymizing Biomedical Data ARX A Comprehensive Tool for Anonymizing Biomedical Data Fabian Prasser, Florian Kohlmayer, Klaus A. Kuhn Chair of Biomedical Informatics Institute of Medical Statistics and Epidemiology Rechts der Isar

More information

Data attribute security and privacy in distributed database system

Data attribute security and privacy in distributed database system IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. V (Mar-Apr. 2014), PP 27-33 Data attribute security and privacy in distributed database system

More information

Information Security in Big Data using Encryption and Decryption

Information Security in Big Data using Encryption and Decryption International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842 Information Security in Big Data using Encryption and Decryption SHASHANK -PG Student II year MCA S.K.Saravanan, Assistant Professor

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October-2013 ISSN 2229-5518 1582

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October-2013 ISSN 2229-5518 1582 1582 AN EFFICIENT CRYPTOGRAPHIC APPROACH FOR PRESERVING PRIVACY IN DATA MINING T.Sujitha 1, V.Saravanakumar 2, C.Saravanabhavan 3 1. M.E. Student, Sujiraj.me@gmail.com 2. Assistant Professor, visaranams@yahoo.co.in

More information

Principles and Best Practices for Sharing Data from Environmental Health Research: Challenges Associated with Data-Sharing: HIPAA De-identification

Principles and Best Practices for Sharing Data from Environmental Health Research: Challenges Associated with Data-Sharing: HIPAA De-identification Principles and Best Practices for Sharing Data from Environmental Health Research: Challenges Associated with Data-Sharing: HIPAA De-identification Daniel C. Barth-Jones, M.P.H., Ph.D Assistant Professor

More information

Medical Data Sharing: Privacy Challenges and Solutions

Medical Data Sharing: Privacy Challenges and Solutions Medical Data Sharing: Privacy Challenges and Solutions Aris Gkoulalas-Divanis agd@zurich.ibm.com IBM Research - Zurich Grigorios Loukides g.loukides@cs.cf.ac.uk Cardiff University ECML/PKDD, Athens, September

More information

Policy-based Pre-Processing in Hadoop

Policy-based Pre-Processing in Hadoop Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden yi.cheng@ericsson.com, christian.schaefer@ericsson.com Abstract While big data analytics provides

More information

Developing a Blueprint for Preserving Privacy of Electronic Health Records using Categorical Attributes

Developing a Blueprint for Preserving Privacy of Electronic Health Records using Categorical Attributes Developing a Blueprint for Preserving Privacy of Electronic Health Records using Categorical Attributes T.Kowshiga 1, T.Saranya 2, T.Jayasudha 3, Prof.M.Sowmiya 4 and Prof.S.Balamurugan 5 Department of

More information

A generalized Framework of Privacy Preservation in Distributed Data mining for Unstructured Data Environment

A generalized Framework of Privacy Preservation in Distributed Data mining for Unstructured Data Environment www.ijcsi.org 434 A generalized Framework of Privacy Preservation in Distributed Data mining for Unstructured Data Environment V.THAVAVEL and S.SIVAKUMAR* Department of Computer Applications, Karunya University,

More information

A SECURE DECISION SUPPORT ESTIMATION USING GAUSSIAN BAYES CLASSIFICATION IN HEALTH CARE SERVICES

A SECURE DECISION SUPPORT ESTIMATION USING GAUSSIAN BAYES CLASSIFICATION IN HEALTH CARE SERVICES A SECURE DECISION SUPPORT ESTIMATION USING GAUSSIAN BAYES CLASSIFICATION IN HEALTH CARE SERVICES K.M.Ruba Malini #1 and R.Lakshmi *2 # P.G.Scholar, Computer Science and Engineering, K. L. N College Of

More information

CS346: Advanced Databases

CS346: Advanced Databases CS346: Advanced Databases Alexandra I. Cristea A.I.Cristea@warwick.ac.uk Data Security and Privacy Outline Chapter: Database Security in Elmasri and Navathe (chapter 24, 6 th Edition) Brief overview of

More information

Methods for the de-identification of electronic health records for genomic research

Methods for the de-identification of electronic health records for genomic research R E V I E W Methods for the de-identification of electronic health records for genomic research Khaled El Emam 1,2 * Abstract Electronic health records are increasingly being linked to DNA repositories

More information

Major US Genomic Medicine Programs: NHGRI s Electronic Medical Records and Genomics (emerge) Network

Major US Genomic Medicine Programs: NHGRI s Electronic Medical Records and Genomics (emerge) Network Major US Genomic Medicine Programs: NHGRI s Electronic Medical Records and Genomics (emerge) Network Dan Roden Member, National Advisory Council For Human Genome Research Genomic Medicine Working Group

More information

De-Identification 101

De-Identification 101 De-Identification 101 We live in a world today where our personal information is continuously being captured in a multitude of electronic databases. Details about our health, financial status and buying

More information

Junghyun Ahn Changho Sung Tag Gon Kim. Korea Advanced Institute of Science and Technology (KAIST) 373-1 Kuseong-dong, Yuseong-gu Daejoen, Korea

Junghyun Ahn Changho Sung Tag Gon Kim. Korea Advanced Institute of Science and Technology (KAIST) 373-1 Kuseong-dong, Yuseong-gu Daejoen, Korea Proceedings of the 211 Winter Simulation Conference S. Jain, R. R. Creasey, J. Himmelspach, K. P. White, and M. Fu, eds. A BINARY PARTITION-BASED MATCHING ALGORITHM FOR DATA DISTRIBUTION MANAGEMENT Junghyun

More information

IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD

IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD Journal homepage: www.mjret.in ISSN:2348-6953 IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD Deepak Ramchandara Lad 1, Soumitra S. Das 2 Computer Dept. 12 Dr. D. Y. Patil School of Engineering,(Affiliated

More information

ACTA UNIVERSITATIS APULENSIS No 15/2008 A COMPARISON BETWEEN LOCAL AND GLOBAL RECODING ALGORITHMS FOR ACHIEVING MICRODATA P-SENSITIVE K -ANONYMITY

ACTA UNIVERSITATIS APULENSIS No 15/2008 A COMPARISON BETWEEN LOCAL AND GLOBAL RECODING ALGORITHMS FOR ACHIEVING MICRODATA P-SENSITIVE K -ANONYMITY ACTA UNIVERSITATIS APULENSIS No 15/2008 A COMPARISON BETWEEN LOCAL AND GLOBAL RECODING ALGORITHMS FOR ACHIEVING MICRODATA P-SENSITIVE K -ANONYMITY Traian Marius Truta, Alina Campan, Michael Abrinica, John

More information

Proximity-Aware Local-Recoding Anonymization with MapReduce for Scalable Big Data Privacy Preservation in Cloud

Proximity-Aware Local-Recoding Anonymization with MapReduce for Scalable Big Data Privacy Preservation in Cloud IEEE TRANSACTIONS ON COMPUTERS, TC-2013-12-0869 1 Proximity-Aware Local-Recoding Anonymization with MapReduce for Scalable Big Data Privacy Preservation in Cloud Xuyun Zhang, Wanchun Dou, Jian Pei, Fellow,

More information

Li Xiong, Emory University

Li Xiong, Emory University Healthcare Industry Skills Innovation Award Proposal Hippocratic Database Technology Li Xiong, Emory University I propose to design and develop a course focused on the values and principles of the Hippocratic

More information

International Journal of Advanced Computer Technology (IJACT) ISSN:2319-7900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

International Journal of Advanced Computer Technology (IJACT) ISSN:2319-7900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS First A. Dr. D. Aruna Kumari, Ph.d, ; Second B. Ch.Mounika, Student, Department Of ECM, K L University, chittiprolumounika@gmail.com; Third C.

More information

Privacy-Preserving Medical Data Sharing

Privacy-Preserving Medical Data Sharing Privacy-Preserving Medical Data Sharing Aris Gkoulalas-Divanis* arisdiva@ie.ibm.com IBM Research - Ireland Grigorios Loukides* g.loukides@cs.cf.ac.uk Cardiff University SIAM Data Mining, Anaheim, CA, USA,

More information

A De-identification Strategy Used for Sharing One Data Provider s Oncology Trials Data through the Project Data Sphere Repository

A De-identification Strategy Used for Sharing One Data Provider s Oncology Trials Data through the Project Data Sphere Repository A De-identification Strategy Used for Sharing One Data Provider s Oncology Trials Data through the Project Data Sphere Repository Prepared by: Bradley Malin, Ph.D. 2525 West End Avenue, Suite 1030 Nashville,

More information

Genomics and Health Data Standards: Lessons from the Past and Present for a Genome-enabled Future

Genomics and Health Data Standards: Lessons from the Past and Present for a Genome-enabled Future Genomics and Health Data Standards: Lessons from the Past and Present for a Genome-enabled Future Daniel Masys, MD Professor and Chair Department of Biomedical Informatics Professor of Medicine Vanderbilt

More information

Guidance on De-identification of Protected Health Information November 26, 2012.

Guidance on De-identification of Protected Health Information November 26, 2012. Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule November 26, 2012 OCR gratefully

More information

Towards Privacy aware Big Data analytics

Towards Privacy aware Big Data analytics Towards Privacy aware Big Data analytics Pietro Colombo, Barbara Carminati, and Elena Ferrari Department of Theoretical and Applied Sciences, University of Insubria, Via Mazzini 5, 21100 - Varese, Italy

More information

Workload-Aware Anonymization Techniques for Large-Scale Datasets

Workload-Aware Anonymization Techniques for Large-Scale Datasets Workload-Aware Anonymization Techniques for Large-Scale Datasets KRISTEN LeFEVRE University of Michigan DAVID J. DeWITT Microsoft and RAGHU RAMAKRISHNAN Yahoo! Research Protecting individual privacy is

More information

Data Privacy Aspects in Big Data Facilitating Medical Data Sharing. Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland

Data Privacy Aspects in Big Data Facilitating Medical Data Sharing. Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland Data Privacy Aspects in Big Data Facilitating Medical Data Sharing Aris Gkoulalas-Divanis Smarter Cities Technology Centre IBM Research, Ireland September 2015 Content Introduction / Motivation Focus Application

More information

A Three-Dimensional Conceptual Framework for Database Privacy

A Three-Dimensional Conceptual Framework for Database Privacy A Three-Dimensional Conceptual Framework for Database Privacy Josep Domingo-Ferrer Rovira i Virgili University UNESCO Chair in Data Privacy Department of Computer Engineering and Mathematics Av. Països

More information

Proposing a Novel Synergized K-Degree L-Diversity T- Closeness Model for Graph Based Data Anonymization

Proposing a Novel Synergized K-Degree L-Diversity T- Closeness Model for Graph Based Data Anonymization Proposing a Novel Synergized K-Degree L-Diversity T- Closeness Model for Graph Based Data Anonymization S.Charanyaa 1, K.Sangeetha 2 M.Tech. Student, Dept of Information Technology, S.N.S. College of Technology,

More information

De-Identification Framework

De-Identification Framework A Consistent, Managed Methodology for the De-Identification of Personal Data and the Sharing of Compliance and Risk Information March 205 Contents Preface...3 Introduction...4 Defining Categories of Health

More information

A 2 d -Tree-Based Blocking Method for Microaggregating Very Large Data Sets

A 2 d -Tree-Based Blocking Method for Microaggregating Very Large Data Sets A 2 d -Tree-Based Blocking Method for Microaggregating Very Large Data Sets Agusti Solanas, Antoni Martínez-Ballesté, Josep Domingo-Ferrer and Josep M. Mateo-Sanz Universitat Rovira i Virgili, Dept. of

More information

Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service

Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service Noman Mohammed Benjamin C. M. Fung Patrick C. K. Hung Cheuk-kwong Lee CIISE, Concordia University, Montreal, QC, Canada University

More information

Data Mining and Sensitive Inferences

Data Mining and Sensitive Inferences Template-Based Privacy Preservation in Classification Problems Ke Wang Simon Fraser University BC, Canada V5A S6 wangk@cs.sfu.ca Benjamin C. M. Fung Simon Fraser University BC, Canada V5A S6 bfung@cs.sfu.ca

More information

Privacy-preserving Data Mining: current research and trends

Privacy-preserving Data Mining: current research and trends Privacy-preserving Data Mining: current research and trends Stan Matwin School of Information Technology and Engineering University of Ottawa, Canada stan@site.uottawa.ca Few words about our research Universit[é

More information

Version 1.0. HEAL NY Phase 5 Health IT & Public Health Team. Version Released 1.0. HEAL NY Phase 5 Health

Version 1.0. HEAL NY Phase 5 Health IT & Public Health Team. Version Released 1.0. HEAL NY Phase 5 Health Statewide Health Information Network for New York (SHIN-NY) Health Information Exchange (HIE) for Public Health Use Case (Patient Visit, Hospitalization, Lab Result and Hospital Resources Data) Version

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

The De-identification Maturity Model Authors: Khaled El Emam, PhD Waël Hassan, PhD

The De-identification Maturity Model Authors: Khaled El Emam, PhD Waël Hassan, PhD A PRIVACY ANALYTICS WHITEPAPER The De-identification Maturity Model Authors: Khaled El Emam, PhD Waël Hassan, PhD De-identification Maturity Assessment Privacy Analytics has developed the De-identification

More information

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications Mining Signatures in Healthcare Data Based on Event Sequences and its Applications Siddhanth Gokarapu 1, J. Laxmi Narayana 2 1 Student, Computer Science & Engineering-Department, JNTU Hyderabad India 1

More information

GENETIC DATA ANALYSIS

GENETIC DATA ANALYSIS GENETIC DATA ANALYSIS 1 Genetic Data: Future of Personalized Healthcare To achieve personalization in Healthcare, there is a need for more advancements in the field of Genomics. The human genome is made

More information

Protecting Respondents' Identities in Microdata Release

Protecting Respondents' Identities in Microdata Release 1010 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 13, NO. 6, NOVEMBER/DECEMBER 2001 Protecting Respondents' Identities in Microdata Release Pierangela Samarati, Member, IEEE Computer Society

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

A GENERAL SURVEY OF PRIVACY-PRESERVING DATA MINING MODELS AND ALGORITHMS

A GENERAL SURVEY OF PRIVACY-PRESERVING DATA MINING MODELS AND ALGORITHMS Chapter 2 A GENERAL SURVEY OF PRIVACY-PRESERVING DATA MINING MODELS AND ALGORITHMS Charu C. Aggarwal IBM T. J. Watson Research Center Hawthorne, NY 10532 charu@us.ibm.com Philip S. Yu IBM T. J. Watson

More information

A Survey of Quantification of Privacy Preserving Data Mining Algorithms

A Survey of Quantification of Privacy Preserving Data Mining Algorithms A Survey of Quantification of Privacy Preserving Data Mining Algorithms Elisa Bertino, Dan Lin, and Wei Jiang Abstract The aim of privacy preserving data mining (PPDM) algorithms is to extract relevant

More information

Degrees of De-identification of Clinical Research Data

Degrees of De-identification of Clinical Research Data Vol. 7, No. 11, November 2011 Can You Handle the Truth? Degrees of De-identification of Clinical Research Data By Jeanne M. Mattern Two sets of U.S. government regulations govern the protection of personal

More information

Auditing EMR System Usage. You Chen Jan, 17, 2013 You.chen@vanderbilt.edu

Auditing EMR System Usage. You Chen Jan, 17, 2013 You.chen@vanderbilt.edu Auditing EMR System Usage You Chen Jan, 17, 2013 You.chen@vanderbilt.edu Health data being accessed by hackers, lost with laptop computers, or simply read by curious employees Anomalous Usage You Chen,

More information

Securing Health Care Information by Using Two-Tier Cipher Cloud Technology

Securing Health Care Information by Using Two-Tier Cipher Cloud Technology Securing Health Care Information by Using Two-Tier Cipher Cloud Technology M.Divya 1 J.Gayathri 2 A.Gladwin 3 1 B.TECH Information Technology, Jeppiaar Engineering College, Chennai 2 B.TECH Information

More information

Automated Problem List Generation from Electronic Medical Records in IBM Watson

Automated Problem List Generation from Electronic Medical Records in IBM Watson Proceedings of the Twenty-Seventh Conference on Innovative Applications of Artificial Intelligence Automated Problem List Generation from Electronic Medical Records in IBM Watson Murthy Devarakonda, Ching-Huei

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

Clinic + - A Clinical Decision Support System Using Association Rule Mining

Clinic + - A Clinical Decision Support System Using Association Rule Mining Clinic + - A Clinical Decision Support System Using Association Rule Mining Sangeetha Santhosh, Mercelin Francis M.Tech Student, Dept. of CSE., Marian Engineering College, Kerala University, Trivandrum,

More information

Secure Personal Information by Analyzing Online Social Networks

Secure Personal Information by Analyzing Online Social Networks Secure Personal Information by Analyzing Online Social Networks T. Satyendra kumar 1, C.Vijaya Rani 2 Assistant Professor, Dept. of C.S.E, S.V. Engineering College for Women, Tirupati, Chittoor, Andhra

More information

Preserving Privacy and Frequent Sharing Patterns for Social Network Data Publishing

Preserving Privacy and Frequent Sharing Patterns for Social Network Data Publishing Preserving Privacy and Frequent Sharing Patterns for Social Network Data Publishing Benjamin C. M. Fung CIISE, Concordia University Montreal, QC, Canada fung@ciise.concordia.ca Yan an Jin Huazhong University

More information

A Study of Data Perturbation Techniques For Privacy Preserving Data Mining

A Study of Data Perturbation Techniques For Privacy Preserving Data Mining A Study of Data Perturbation Techniques For Privacy Preserving Data Mining Aniket Patel 1, HirvaDivecha 2 Assistant Professor Department of Computer Engineering U V Patel College of Engineering Kherva-Mehsana,

More information

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior N.Jagatheshwaran 1 R.Menaka 2 1 Final B.Tech (IT), jagatheshwaran.n@gmail.com, Velalar College of Engineering and Technology,

More information

Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects

Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects Report on the Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects Background and Goals of the Workshop June 5 6, 2012 The use of genome sequencing in human research is growing

More information

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph Janani K 1, Narmatha S 2 Assistant Professor, Department of Computer Science and Engineering, Sri Shakthi Institute of

More information

Healthcare Big Data Exploration in Real-Time

Healthcare Big Data Exploration in Real-Time Healthcare Big Data Exploration in Real-Time Muaz A Mian A Project Submitted in partial fulfillment of the requirements for degree of Masters of Science in Computer Science and Systems University of Washington

More information

Anonymization: Enhancing Privacy and Security of Sensitive Data of Online Social Networks

Anonymization: Enhancing Privacy and Security of Sensitive Data of Online Social Networks Anonymization: Enhancing Privacy and Security of Sensitive Data of Online Social Networks Mr.Gaurav.P.R. PG Student, Dept.Of CS&E S.J.M.I.T Chitradurga, India Mr.Gururaj.T M.Tech Associate Professor, Dept.Of

More information

Differential privacy in health care analytics and medical research An interactive tutorial

Differential privacy in health care analytics and medical research An interactive tutorial Differential privacy in health care analytics and medical research An interactive tutorial Speaker: Moritz Hardt Theory Group, IBM Almaden February 21, 2012 Overview 1. Releasing medical data: What could

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

Preventing Denial-of-request Inference Attacks in Location-sharing Services

Preventing Denial-of-request Inference Attacks in Location-sharing Services Preventing Denial-of-request Inference Attacks in Location-sharing Services Kazuhiro Minami Institute of Statistical Mathematics, Tokyo, Japan Email: kminami@ism.ac.jp Abstract Location-sharing services

More information

Data Outsourcing based on Secure Association Rule Mining Processes

Data Outsourcing based on Secure Association Rule Mining Processes , pp. 41-48 http://dx.doi.org/10.14257/ijsia.2015.9.3.05 Data Outsourcing based on Secure Association Rule Mining Processes V. Sujatha 1, Debnath Bhattacharyya 2, P. Silpa Chaitanya 3 and Tai-hoon Kim

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Practicing Differential Privacy in Health Care: A Review

Practicing Differential Privacy in Health Care: A Review TRANSACTIONS ON DATA PRIVACY 5 (2013) 35 67 Practicing Differential Privacy in Health Care: A Review Fida K. Dankar*, and Khaled El Emam* * CHEO Research Institute, 401 Smyth Road, Ottawa, Ontario E mail

More information

Privacy Challenges of Telco Big Data

Privacy Challenges of Telco Big Data Dr. Günter Karjoth June 17, 2014 ITU telco big data workshop Privacy Challenges of Telco Big Data Mobile phones are great sources of data but we must be careful about privacy 1 / 15 Sources of Big Data

More information

Implementation of P2P Reputation Management Using Distributed Identities and Decentralized Recommendation Chains

Implementation of P2P Reputation Management Using Distributed Identities and Decentralized Recommendation Chains Implementation of P2P Reputation Management Using Distributed Identities and Decentralized Recommendation Chains P.Satheesh Associate professor Dept of Computer Science and Engineering MVGR college of

More information

Outcome Data, Links to Electronic Medical Records. Dan Roden Vanderbilt University

Outcome Data, Links to Electronic Medical Records. Dan Roden Vanderbilt University Outcome Data, Links to Electronic Medical Records Dan Roden Vanderbilt University Coordinating Center Type II Diabetes Case Algorithm * Abnormal lab= Random glucose > 200mg/dl, Fasting glucose > 125 mg/dl,

More information

Health Data De-Identification by Dr. Khaled El Emam

Health Data De-Identification by Dr. Khaled El Emam RISK-BASED METHODOLOGY DEFENSIBLE COST-EFFECTIVE DE-IDENTIFICATION OPTIMAL STATISTICAL METHOD REPORTING RE-IDENTIFICATION BUSINESS ASSOCIATES COMPLIANCE HIPAA PHI REPORTING DATA SHARING REGULATORY UTILITY

More information

Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0

Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0 Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0 Guidance Document Prepared by: PCORnet Data Privacy Task Force Submitted to the PMO Approved by the PMO Submitted

More information

Email Spam Detection Using Customized SimHash Function

Email Spam Detection Using Customized SimHash Function International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

More information

EHRs and large scale comparative effectiveness research

EHRs and large scale comparative effectiveness research EHRs and large scale comparative effectiveness research September 16, 2014 Dana C. Crawford, PhD Associate Professor Epidemiology and Biostatistics Institute for Computational Biology Single Nucleotide

More information

Data Driven Approaches to Prescription Medication Outcomes Analysis Using EMR

Data Driven Approaches to Prescription Medication Outcomes Analysis Using EMR Data Driven Approaches to Prescription Medication Outcomes Analysis Using EMR Nathan Manwaring University of Utah Masters Project Presentation April 2012 Equation Consulting Who we are Equation Consulting

More information

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16 Course Director: Dr. Barry Grant (DCM&B, bjgrant@med.umich.edu) Description: This is a three module course covering (1) Foundations of Bioinformatics, (2) Statistics in Bioinformatics, and (3) Systems

More information

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

Presenting data: how to convey information most effectively Centre of Research Excellence in Patient Safety 20 Feb 2015

Presenting data: how to convey information most effectively Centre of Research Excellence in Patient Safety 20 Feb 2015 Presenting data: how to convey information most effectively Centre of Research Excellence in Patient Safety 20 Feb 2015 Biomedical Informatics: helping visualization from molecules to population Dr. Guillermo

More information

Data Mining and Database Systems: Where is the Intersection?

Data Mining and Database Systems: Where is the Intersection? Data Mining and Database Systems: Where is the Intersection? Surajit Chaudhuri Microsoft Research Email: surajitc@microsoft.com 1 Introduction The promise of decision support systems is to exploit enterprise

More information

The De-identification of Personally Identifiable Information

The De-identification of Personally Identifiable Information The De-identification of Personally Identifiable Information Khaled El Emam (PhD) www.privacyanalytics.ca 855.686.4781 info@privacyanalytics.ca 251 Laurier Avenue W, Suite 200 Ottawa, ON Canada K1P 5J6

More information

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VI (Mar-Apr. 2014), PP 44-48 A Survey on Outlier Detection Techniques for Credit Card Fraud

More information

Privacy-Preserving Data Mining through Knowledge Model Sharing

Privacy-Preserving Data Mining through Knowledge Model Sharing Privacy-Preserving Data Mining through Knowledge Model Sharing Patrick Sharkey, Hongwei Tian, Weining Zhang, and Shouhuai Xu Department of Computer Science, University of Texas at San Antonio {psharkey,htian,wzhang,shxu}@cs.utsa.edu

More information

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6

International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6 International Journal of Engineering Research ISSN: 2348-4039 & Management Technology Email: editor@ijermt.org November-2015 Volume 2, Issue-6 www.ijermt.org Modeling Big Data Characteristics for Discovering

More information

The Cost of Privacy: Destruction of Data-Mining Utility in Anonymized Data Publishing

The Cost of Privacy: Destruction of Data-Mining Utility in Anonymized Data Publishing The Cost of Privacy: Destruction of Data-Mining Utility in Anonymized Data Publishing Justin Brickell and Vitaly Shmatikov Department of Computer Sciences The University of Texas at Austin jlbrick@cs.utexas.edu,

More information

Concept and Project Objectives

Concept and Project Objectives 3.1 Publishable summary Concept and Project Objectives Proactive and dynamic QoS management, network intrusion detection and early detection of network congestion problems among other applications in the

More information

DRAFT NISTIR 8053 De-Identification of Personally Identifiable Information

DRAFT NISTIR 8053 De-Identification of Personally Identifiable Information 1 2 3 4 5 6 7 8 DRAFT NISTIR 8053 De-Identification of Personally Identifiable Information Simson L. Garfinkel 9 10 11 12 13 14 15 16 17 18 NISTIR 8053 DRAFT De-Identification of Personally Identifiable

More information

PRIVACY-PRESERVING DATA ANALYSIS AND DATA SHARING

PRIVACY-PRESERVING DATA ANALYSIS AND DATA SHARING PRIVACY-PRESERVING DATA ANALYSIS AND DATA SHARING Chih-Hua Tai Dept. of Computer Science and Information Engineering, National Taipei University New Taipei City, Taiwan BENEFIT OF DATA ANALYSIS Many fields

More information

Keywords: Security; data warehouse; data mining; statistical database security; privacy

Keywords: Security; data warehouse; data mining; statistical database security; privacy Security in Data Warehouses By Edgar R. Weippl, Secure Business Austria, Vienna, Austria Favoritenstrasse 16 1040 Wien Tel: +43-1-503 12 80 Fax: +43-1-505 88 88 E-mail: eweippl@securityresearch.at Keywords:

More information

EFFICIENT DETECTION IN DDOS ATTACK FOR TOPOLOGY GRAPH DEPENDENT PERFORMANCE IN PPM LARGE SCALE IPTRACEBACK

EFFICIENT DETECTION IN DDOS ATTACK FOR TOPOLOGY GRAPH DEPENDENT PERFORMANCE IN PPM LARGE SCALE IPTRACEBACK EFFICIENT DETECTION IN DDOS ATTACK FOR TOPOLOGY GRAPH DEPENDENT PERFORMANCE IN PPM LARGE SCALE IPTRACEBACK S.Abarna 1, R.Padmapriya 2 1 Mphil Scholar, 2 Assistant Professor, Department of Computer Science,

More information

Use of the Research Patient Data Registry at Partners Healthcare, Boston

Use of the Research Patient Data Registry at Partners Healthcare, Boston Use of the Research Patient Data Registry at Partners Healthcare, Boston Advancing Clinical Research with Hospital Clinical Records Shawn Murphy MD, Ph.D. Massachusetts General Hospital Outline of presentation

More information