Anonymization of Longitudinal Electronic Medical Records. Acar Tamersoy, Grigorios Loukides, Mehmet Ercan Nergiz, Yucel Saygin, and Bradley Malin


 Sheena Griffin
 1 years ago
 Views:
Transcription
1 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY Anonymization of Longitudinal Electronic Medical Records Acar Tamersoy, Grigorios Loukides, Mehmet Ercan Nergiz, Yucel Saygin, and Bradley Malin Abstract Electronic medical record (EMR) systems have enabled healthcare providers to collect detailed patient information from the primary care domain. At the same time, longitudinal data from EMRs are increasingly combined with biorepositories to generate personalized clinical decision support protocols. Emerging policies encourage investigators to disseminate such data in a deidentified form for reuse and collaboration, but organizations are hesitant to do so because they fear such actions will jeopardize patient privacy. In particular, there are concerns that residual demographic and clinical features could be exploited for reidentification purposes. Various approaches have been developed to anonymize clinical data, but they neglect temporal information and are, thus, insufficient for emerging biomedical research paradigms. This paper proposes a novel approach to share patientspecific longitudinal data that offers robust privacy guarantees, while preserving data utility for many biomedical investigations. Our approach aggregates temporal and diagnostic information using heuristics inspired from sequence alignment and clustering methods. We demonstrate that the proposed approach can generate anonymized data that permit effective biomedical analysis using several patient cohorts derived from the EMR system of the Vanderbilt University Medical Center. Index Terms Anonymization, data privacy, electronic medical records (EMRs), longitudinal data. I. INTRODUCTION ADVANCES in health information technology have facilitated the collection of detailed, patientlevel clinical data to enable efficiency, effectiveness, and safety in healthcare operations [1]. Such data are often stored in electronic medical record (EMR) systems [2], [3] and are increasingly repurposed to support clinical research (see, e.g., [4] [7]). Recently, EMRs have been combined with biorepositories to enable genomewide association studies (GWAS) with clinical phenomena in the hopes of tailoring healthcare to genetic variants [8]. To demonstrate feasibility, EMRbased GWAS have focused on static phenotypes; i.e., where a patient is designated as disease positive Manuscript received April 7, 2011; revised September 27, 2011; accepted January 16, Date of publication January 27, 2012; date of current version May 4, This work was supported by the National Institutes of Health under Grant U01HG006378, Grant U01HG006385, and Grant R01LM009989, and by a Fellowship from the Royal Academy of Engineering Research. A. Tamersoy and B. Malin are with the Department of Biomedical Informatics, Vanderbilt University, Nashville, TN USA ( G. Loukides is with the School of Computer Science and Informatics, Cardiff University, Cardiff, CF24 3AA, U.K. ( M. E. Nergiz is with the Department of Computer Engineering, Zirve University, Gaziantep 27260, Turkey ( Y. Saygin is with the Department of Computer Science and Engineering, Sabanci University, Istanbul 34956, Turkey ( Digital Object Identifier /TITB or negative (see, e.g., [9] [11]). As these studies mature, they will support personalized clinical decision support tools [12] and will require longitudinal data to understand how treatment influences a phenotype [13], [14]. Meanwhile, there are challenges to conducting GWAS on a scale necessary to institute changes in healthcare. First, to generate appropriate statistical power, scientists may require access to populations larger than those available in local EMR systems [15]. Second, the cost of a GWAS incurred in the setup and application of software to process medical records as well as in genome sequencing is nontrivial [16]. Thus, it can be difficult for scientists to generate novel, or validate published, associations. To mitigate this problem, the U.S. National Institutes of Health (NIH) encourages investigators to share data from NIHsupported GWAS [17] into the Database of Genotypes and Phenotypes (dbgap) [18]. This, however, may lead to privacy breaches if patients clinical or genomic information is associated with their identities. As a first line of defense against this threat, the NIH recommends investigators deidentify data by removing an enumerated list of attributes that could identify patients (e.g., personal names and residential addresses) prior to dbgap submission [19]. However, a patients DNA may still be reidentified via residual demographics [20] and clinical information (e.g., standardized International Classification of Diseases (ICD) codes) [21], as illustrated in the following example. Example 1: Each record in Fig. 1(a) corresponds to a fictional deidentified patient and is comprised of ICD codes, patient s age when a code was received, and a DNA sequence. For instance, the second record denotes that a patient was diagnosed with benign essential hypertension (code 401.1) at ages 38 and 40 and has the DNA sequence GC...A. The clinical and genomic data are derived from an EMR system and a research project beyond primary care, respectively. Publishing the data of Fig. 1(a) could allow a hospital employee with access to the EMR to associate Jane with her DNA sequence. This is because the identified record, shown in Fig. 1(b), can only be linked to the second record in Fig. 1(a) based on the ICD code and ages 38 and 40. Methods to mitigate reidentification via demographic and clinical features [22], [23] have been proposed, but they are not applicable to the longitudinal scenario. These methods assume the clinical profile is devoid of temporal or replicated diagnosis information. Consequently, these methods produce data that are unlikely to permit meaningful longitudinal investigations. In this paper, we propose the first approach to formally anonymize longitudinal patient records. Our work makes the following specific contributions /$ IEEE
2 414 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012 Fig. 1. Longitudinal data privacy problem. (a) Longitudinal research data. (b) Identified EMR. (c) 2anonymization based on the proposed approach. 1) We propose a framework to transform each longitudinal patient record into a form that is indistinguishable from at least k 1 other records. This is achieved by iteratively clustering records and applying generalization, which replaces ICD codes and age values with more general values, and suppression, which removes ICD codes and age values. For example, applying our approach with k =2to the data of Fig. 1(a) will generate the anonymized data of Fig. 1(c). Observe that Jane s record is now linked to two DNA sequences because the diagnosis code in the first pair of this record has been replaced by the more general code ) We evaluate our approach with several cohorts of patient records from the Vanderbilt University Medical Center (VUMC) EMR system. Our results demonstrate that the anonymized data produced allow many studies focusing on clinical case counts to be performed accurately. The remainder of this paper is organized as follows. In Section II, we review related research on anonymization and its application to biomedical data. In Section III, we formalize the notions of privacy and utility and the anonymization problem addressed in this paper. We present our anonymization framework and discuss its extensions and limitations in Sections IV and VI, respectively. Finally, Section V reports the experimental results and Section VII concludes the paper. A. Relational Data We first review methods for preventing reidentification in relational data (e.g., demographics), where records have a fixed number of attributes and one value per attribute. The first category of protection methods transforms attribute values so that they no longer correspond to real individuals. Popular approaches in this category are noise addition, data swapping, and synthetic data generation (see [30] [32] for surveys). While such methods generate data that preserve aggregate statistics (e.g., the average age), they do not guarantee data that can be analyzed at the record level. This is a significant limitation that hampers the ability to use these data in various biomedical studies, including epidemiological studies [33] and GWAS [22]. In contrast, methods based on generalization and suppression preserve data truthfulness [34] [36]. Many of these methods are based on a principle called kanonymity [24], [34], which states that each record of the published data must be equivalent to at least k 1 other records with respect to quasiidentifiers (QI) (i.e., attributes that can be linked with external resources for reidentification purposes) [37]. To enhance the utility of the anonymized data, these methods employ various search strategies, including binary search [34], [35], clustering [38], [39], evolutionary search [40], and partitioning [36]. There are methods that have been successfully applied to biomedical data [35], [41]. II. RELATED RESEARCH Reidentification concerns for clinical data via seemingly innocuous attributes were first raised in [24]. Specifically, it was shown that patients could be uniquely reidentified by linking publicly available voter registration lists to hospital discharge summaries via demographics, such as date of birth, gender, and fivedigit residential zip code. The reidentification phenomenon has since attracted interest in domains beyond healthcare, and numerous techniques to guard against attacks have emerged (see [25] and [26] for surveys). In this section, we survey research related to privacypreserving data publishing, with a focus on biomedical data. We note that the reidentification problem is not addressed by access control and encryptionbased methods [27] [29] because data need to be shared beyond a small number of authorized recipients. B. Transactional Data Next, we turn our attention to approaches that deal with more complex data. Specifically, we consider transactional data, in which records have a large and variable number of values per attribute (e.g., diagnosis codes assigned to a patient during a hospital visit). Transactional data can also facilitate reidentification in the biomedical domain. For instance, deidentified clinical records can be linked to patients based on combinations of diagnosis codes that are additionally contained in publicly available hospital discharge summaries and EMR systems from which the records have been derived [21]. As it was shown in [21], more than 96% of 2700 patient records, collected in the context of a GWAS, would be susceptible to reidentification based on diagnosis codes if shared without additional controls. From the protection perspective, there are several approaches that have been developed to anonymize transactional data.
3 TAMERSOY et al.: ANONYMIZATION OF LONGITUDINAL ELECTRONIC MEDICAL RECORDS 415 For instance, Terrovitis et al. [42] proposed k m anonymity, a principle, and several heuristic algorithms to prevent attackers from linking an individual to less than k records. This model assumes that the adversary knows at most m attribute values of any transaction. To anonymize patient records in transactional form, Loukides et al. [22] introduced a privacy principle to ensure that sets of potentially identifying diagnosis codes are protected from reidentification, while remaining useful for GWAS validations. To enforce this principle, they proposed an algorithm that employs generalization and suppression to group semantically close diagnosis codes together in a way that enhances data utility [22], [23]. Additionally, Tamersoy et al. [43] considered protecting data in which a certain diagnosis code may occur multiple times in a patient record. They designed an algorithm which preserves patients privacy through suppressing a subset of the replications of a diagnosis code. Our work differs from the aforementioned research along two principal dimensions. First, we address reidentification in longitudinal data publishing. Second, contrary to the approaches in [22] and [23] which group diagnosis codes together, our framework is based on grouping of records, which has been shown to be highly effective in retaining data utility due to the direct identification of records being anonymized [38], [39], [44]. C. Spatiotemporal Data Spatiotemporal data are related to the problem studied in this paper. They are time and location dependent, and these unique characteristics make them challenging to protect against reidentification. Such data are typically produced as a result of queries issued by mobile subscribers to locationbased service providers, who, in turn, supply information services based on specific physical locations. The principle of kanonymity has been extended to anonymize spatiotemporal data. Abul et al. [45] proposed a technique to group at least k objects that correspond to different subscribers and appear within a certain radius of the path of every object in the same time period. In addition to generalization and suppression, Abul et al. [45] considered adding noise to the original paths so that objects appear at the same time and spatial trajectory volume. Assuming that the locations of subscribers constitute sensitive information, Terrovitis and Mamoulis [46] proposed a suppressionbased methodology to prevent attackers from inferring these locations. Finally, Nergiz et al. [44] proposed an approach that employs k anonymity, enforced using generalization, together with reconstruction to protect data against boundarybased attacks. Our heuristics are inspired from [44]; however, we employ both generalization and suppression to further enhance data utility, and we do not use reconstruction, so as to preserve data truthfulness. The aforementioned approaches are developed for anonymizing spatiotemporal data and cannot be applied to longitudinal data due to different semantics. Specifically, the data we consider record patients diagnoses and not their locations. Consequently, the objective of our approach is not to hide the locations of patients, but to prevent reidentification based on their diagnosis and time information. Fig. 2. General architecture of the longitudinal data anonymization process. III. BACKGROUND AND PROBLEM FORMULATION This section begins with a highlevel overview of the proposed approach. Next, we present the notation and the definitions for the privacy and adversarial models, the data transformation strategies, and the information loss metrics. We conclude the section with a formal problem description. A. Architectural Overview Fig. 2 provides an overview of the data anonymization process. The process is initiated when the data owner supplies the following information: 1) a dataset of longitudinal patient records, each of which consists of (ICD, Age) pairs and a DNA sequence, and 2) a parameter k that expresses the desired level of privacy. Given this information, the process invokes our anonymization framework. To satisfy the kanonymity principle, our framework forms clusters of at least k records of the original dataset, which are modified using generalization and suppression. B. Notation A dataset D consists of longitudinal records of the form T,DNA T, where T is a trajectory 1 and DNA T is a genomic sequence. Each trajectory corresponds to a distinct patient in D and is a multiset 2 of pairs (i.e., T = {t 1,...,t m }) drawn from two attributes, namely ICD and Age [i.e., t i = (u ICD,v Age)], which contain the diagnosis codes assigned to a patient and their age, respectively. D denotes the number of records in D and T the length of T, defined as the number of pairs in T.Weusethe. operator to refer to a specific attribute value in a pair (e.g., t i.icd or t i.age). To study the data temporally, we order the pairs in T with respect to Age, such that t i 1.age t i.age. C. Adversarial Model We assume that an adversary has access to the original dataset D, such as in Fig. 1(a). An adversary may perform a reidentification attack in several ways. 1) Using identified EMR data: The adversary links D with the identified EMR data, such as those of Fig. 1(b), based 1 We use the term trajectory since the diagnosis codes at different ages can be seen as a route for the patient throughout his/her life. 2 Contrary to a set, a multiset can contain an element more than once.
4 416 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012 Fig. 3. Example of the DGH structure for Age. on (ICD, Age) pairs. This scenario requires the adversary to have access to the identified EMR data, which is the case of an employee of the institution from which the longitudinal data were derived. 2) Using publicly available hospital discharge summaries and identified resources: The adversary first links D with hospital discharge summaries based on (ICD, Age) pairs to associate patients with certain demographics. In turn, these demographics are exploited in another linkage with public records, such as voter registration lists, which contain identity information. Note that in both cases, an adversary is able to link patients to their DNA sequences, which suggests that a formal approach to longitudinal data anonymization is desirable. D. Privacy Model The formal definition of kanonymity in the longitudinal data context is provided in Definition 1. Since each trajectory often contains multiple (ICD, Age) pairs, it is difficult to know which can be used by an adversary to perform reidentification attacks. Thus, we consider the worstcase scenario in which any combination of (ICD, Age) pairs can be exploited. Regardless, kanonymity limits an adversary s ability to perform reidentification based on (ICD, Age) pairs, because each trajectory is associated with no less than k patients. Definition 1 (kanonymity): An anonymized dataset D, produced from D,iskanonymous if each trajectory in D, projected over QI, appears at least k times for any QI in D. E. Data Transformation Strategies Generalization and suppression are typically guided by a domain generalization hierarchy (DGH) (Definition 2) [47]. Definition 2 (DGH): A DGH for attribute A, referred to as H A, is a partially ordered tree structure which defines valid mappings between specific and generalized values of A. The root of H A is the most generalized value of A and is returned by a function root. Example 2: Consider H Age in Fig. 3. The values in the domain of Age (i.e., 33, 34,...,40) form the leaves of H Age. These values are then mapped to two, to four, and eventually to eightyear intervals. The root of H Age is returned by root(h Age ) as [33 40]. Our approach does not impose any constraints on the structure of an attribute s DGH, such that the data owners have complete freedom in its design. For instance, for ICD codes, data owners can use the standard ICD9CM hierarchy. 3 For ages, data owners can use a predefined hierarchy (e.g., the age hierarchy in the HIPAA Safe Harbor Policy 4 ) or design a DGH manually. 5 According to Definition 3, each specific value of an attribute generalizes to its direct ancestor in a DGH. However, a specific value can be projected up multiple levels in a DGH via a sequence of generalizations. As a result, a generalized value A i is interpreted as any one of the leaf nodes in the subtree rooted by A i in H A. Definition 3 (Generalization and Suppression): Given a node A i root(h A ) in H A, generalization is performed using a function f:a i A j which replaces A i with its direct ancestor A j. Suppression is a special case of generalization and is performed using a function g:a i A r which replaces A i with root(h A ). Example 3: Consider the last trajectory in Fig. 1(c). The first pair (401.1, [39 40]) is interpreted as either (401.1, 39) or (401.1, 40). F. Information Loss Generalization and suppression incur information loss because values are replaced by more general ones or eliminated. To capture the amount of information loss incurred by these operations, we quantify the normalized loss for each ICD code and Age value in a pair based on the Loss Metric (LM) (Definition 4) [40]. Definition 4 (LM): The information loss incurred by replacing a node A i with its ancestor A j in H A is LM(A i, A j )= A j A i A where A i and A j denote the number of leaf nodes in the subtree rooted by A i and A j in H A, respectively, and A denotes the domain size of attribute A. Example 4: Consider H Age in Fig. 3. The information loss incurred by generalizing [33 34] to [33 36] is =0.25 because the leaflevel descendants of [33 34] are 33 and 34, those of [33 36] are 33, 34, 35, and 36, and the domain of Age consists of the values To introduce the combined LM, which captures the total LM of replacing two nodes with their ancestor, provided in Definition 6, we use the notation of lowest common ancestor (LCA), provided in Definition 5. Definition 5 (LCA): The LCA A l of nodes A i and A j in H A is the farthest node (in terms of height) from root(h A ) such that (1) A i = A l or f n (A i )=A l and (2) A j = A l orf m (A j )=A l, and is returned by a function lca. Definition 6 (Combined LM): The combined LM of replacing nodes A i and A j with their LCA A l is LM(A i + A j, A l )=LM(A i, A l )+LM(A j, A l ). (2) 3 More information is available at 4 HIPAA Safe Harbor states all ages under 89 can be retained intact, while 90 or greater must be grouped together. 5 We further note that our approach is extendible to other categorical attributes, such as SNOMEDCT and Date, provided that a DGH can be specified for each of the attributes. Such extensions, however, are beyond the scope of this paper. (1)
5 TAMERSOY et al.: ANONYMIZATION OF LONGITUDINAL ELECTRONIC MEDICAL RECORDS 417 Next, we define the LM for an anonymized trajectory (Definition 7) and dataset (Definition 8), which we keep separate for each attribute. Definition 7 (LM for an Anonymized Trajectory): Given an anonymized trajectory T and an attribute A, the LM with respect to A is computed as T LM( T,A) = LM( t i.a, t i.a) (3) i=1 where t i A denotes the value t i A is replaced with. Definition 8 (LM for an Anonymized Dataset): Given an anonymized dataset D and an attribute A, the LM with respect to attribute A is computed as LM( D, A) = 1 LM( T,A) D T D T. (4) For clarity, we refer to an LM related to ICD and Age by ILM and ALM, respectively (e.g., we use ILM( D) instead of LM( D, ICD)). G. Problem Statement The longitudinal data anonymization problem is formally defined as follows. Problem: Given a longitudinal dataset D, a privacy parameter k, and DGHs for attributes ICD and Age, construct an anonymized dataset D, such that 1) D is kanonymous, 2) the order of the pairs in each trajectory of D is preserved in D, and 3) ILM( D)+ALM( D) is minimized. IV. ANONYMIZATION FRAMEWORK In this section, we present our framework for longitudinal data anonymization. Many clustering algorithms can be applied to produce k anonymous data [48], [49]. This involves organizing records into clusters of size at least k, which are anonymized together. In the context of longitudinal data, the challenge is to define a distance metric for trajectories such that a clustering algorithm groups similar trajectories. We define the distance between two trajectories as the cost (i.e., incurred information loss) of their anonymization as defined by the LM. The problem then reduces to finding an anonymized version T of two given trajectories such that ILM( T )+ALM( T ) is minimized. Finding an anonymization of two trajectories can be achieved by finding a matching between the pairs of trajectories that minimizes their cost of anonymization. This problem, which is commonly referred to as sequence alignment, has been extensively studied in various domains, notably for the alignment of DNA sequences to identify regions of similarity in a way that the total pairwise edit distance between the sequences is minimized [50], [51]. To solve the longitudinal data anonymization problem, we propose Longitudinal Data Anonymizer, a framework that incorporates alignment and clustering as separate components, as shown in Fig. 2. The objective of each component is summarized as follows. 1) Alignment attempts to find a minimal cost pair matching between two trajectories. 2) Clustering interacts with the Alignment component to create clusters of at least k records. Next, we examine each component in detail and develop methodologies to achieve their objectives. A. Alignment There are no directly comparable approaches to the method we developed in this study. So, we introduce a simple heuristic, called Baseline, for comparison purposes. Given trajectories X = {x 1,...,x m } and Y = {y 1,...,y n },ILM(X) and ALM(X), and DGHs H ICD and H Age, Baseline aligns X and Y by matching their pairs on the same index. 6 The pseudocode for Baseline is provided in Algorithm 1. This algorithm initializes an empty trajectory T to hold the output of the alignment and then assigns ILM(X) and ALM(X) to variables i and a, respectively (steps 1 and 2). Then, it determines the length of the shorter trajectory (step 3) and performs pair matching (steps 4 9). Specifically, for the pairs of the trajectories that have the same index, Baseline constructs a pair containing the LCAs of the ICD codes and Age values in these pairs (step 5), appends the constructed pair to T (step 6), and updates i and a with the information loss incurred by the generalizations (steps 7 8). Next, Baseline updates i and a with the amount of information loss incurred by suppressing the ICD codes and Age values from the unmatched pairs in the longer trajectory (steps 10 14). Finally, this algorithm returns T along with i and a, which correspond to ILM( T ) and ALM( T ), respectively (step 15). 6 ILM(X) and ALM(X) are provided as input because X may already be an anonymized version of two other trajectories.
6 418 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012 To help preserve data utility, we provide Alignment using Generalization and Suppression (AGS), an algorithm that uses dynamic programming to construct an anonymized trajectory that incurs minimal cost. Before discussing AGS, we briefly discuss the application of dynamic programming. The latter technique can be used to solve problems based on combining the solutions to subproblems which are not independent and share subsubproblems [52]. A dynamic programming algorithm stores the solution of a subsubproblem in a table to which it refers every time the subsubproblem is encountered. To give an example, for trajectories X = {x 1,...,x m } and Y = {y 1,...,y n }, a subproblem may be to find a minimal cost pair matching between the first to the jth pairs. A solution to this subproblem can be determined using solutions for the following subsubproblems and applying the respective operations: 1) Align X = {x 1,...,x j 1 } and Y = {y 1,...,y j 1 }, and generalize x j with y j ; 2) Align X = {x 1,...,x j 1 } and Y = {y 1,...,y j }, and suppress x j ; 3) Align X = {x 1,...,x j } and Y = {y 1,...,y j 1 }, and suppress y j. Each case is associated with a cost. Our objective is to find an anonymized trajectory T, such that ILM( T ) +ALM( T ) is minimized, so we examine each possible solution and select the one with minimum information loss. AGS uses a similar approach to align trajectories. The algorithm accepts the same inputs as Baseline as well as weights w ICD and w Age. The latter allow AGS to control the information loss incurred by anonymizing the values of each attribute. The data owners specify the attribute weights such that w ICD 0, w Age 0, and w ICD + w Age =1. The pseudocode for AGS is provided in Algorithm 2. In step 1, AGS initializes three matrices: i, a, and r. The first row (indexed 0) of each of these matrices corresponds to a null value, and starting from index 1, each row corresponds to a value in X. Similarly, the first column (indexed 0) of each of these matrices corresponds to a null value, and starting from index 1, each column corresponds to a value in Y. Specifically, for indices h and j, r h,j records which of the following operations incurs minimum information loss: 1) generalizing x h and y j (denoted with < >), 2) suppressing x h (denoted with < >), and 3) suppressing y j (denoted with < >). The entries in i h,j and a h,j keep the total ILM and ALM for aligning the subtrajectories X sub = {x 1,...,x h } and Y sub = {y 1,...,y j }, respectively. In step 2, AGS assigns ILM(X) and ALM(X) to i 0,0 and a 0,0, respectively. We include null values in the rows and columns of i, a, and r because at some point during alignment AGS may need to suppress some portion of the trajectories. Therefore, in steps 3 7 and 8 12, AGS initializes i, a, and r for the values in X and Y, respectively. Specifically, for indices h and j, i h,0 and i 0,j keep the ILM for suppressing every pair in the subtrajectories X sub = {x 1,...,x h } and Y sub = {y 1,...,y j }, respectively. Similar reasoning applies to matrix a.the first row and column of r holds < > and < > for suppressing values from X and Y, respectively. In steps 13 25, AGS performs dynamic programming. Specifically, for indices h and j, AGS determines a minimal cost pair matching of the subtrajectories X sub = {x 1,...,x h } and Y sub = {y 1,...,y j } based on the three cases listed previously. Specifically, in steps 15 21, AGS constructs two temporary arrays, c and g, to store the ILM and ALM for each possible solution, respectively. Next, in steps 22 and 23, A GS determines the solution with the minimum information loss and assigns the ILM, ALM, and operation associated with the
7 TAMERSOY et al.: ANONYMIZATION OF LONGITUDINAL ELECTRONIC MEDICAL RECORDS 419 Fig. 4. Example of the hypertension subtree in the ICD DGH. Fig. 5. Matrices i, a,andr for the subset of records from Fig. 1. This alignment uses the DGHs in Figs. 3 and 4 and assumes that w ICD = w Age =0.5. solution to i h,j, a h,j, and r h,j, respectively. If there is a tie between the solutions, AGS selects generalization as the operation for the sake of retaining more information. In steps 26 36, AGS constructs the anonymized trajectory T by traversing the matrix r. Specifically, for two pairs in the trajectories, if generalization incurs minimum information loss, AGS appends to T a pair containing the LCAs of the ICD codes and Age values in these pairs. The unmatched pairs in the trajectories are ignored during this process because AGS suppresses these pairs. Finally, in step 37, Baseline returns T along with i m,n and a m,n, which correspond to ILM( T ) and ALM( T ), respectively. Example 5: Consider applying AGS to T 1 and T 4 in Fig. 1(a) using the DGHs shown in Figs. 3 and 4 and assuming that w ICD = w Age =0.5. The matrices i, a, and r are illustrated in Fig. 5. As T 1 and T 4 are not anonymized, we initialize i 0,0 = a 0,0 =0. Subsequently, AGS computes the values for the entries in the first row and column of the matrices. For instance, i 0,3 keeps the ILM for suppressing all ICD codes from T 1 and has a value of 1+(1 0.5) = 1.5. This is computed by summing the ILM for suppressing the first two ICD codes (i.e., the value stored in i 0,2 ) with the weightadjusted ILM for suppressing the third ICD code. Then, AGS performs dynamic programming. The process starts with aligning T 1,sub = {(401.1, 33)} and T 4,sub = {(401.9, 33)}. The possible solutions for this subproblem are as follows. 1) Align T 1,sub = { } and T 4,sub = { }, and generalize with and 33 with 33. 2) Align T 1,sub = {(401.1, 33)} and T 4,sub = { }, and suppress and 33. 3) Align T 1,sub = { } and T 4,sub = {(401.9, 33)}, and suppress and 33. The ILM and ALM for the subsubproblem in the first solution are stored in i 0,0 and a 0,0, respectively. Generalizing with has an ILM of (1 + 1) 0.5 =1, and generalizing 33 with 33 has an ALM of 0. Therefore, the first solution has a total LM of 1. The ILM and ALM for the subsubproblem in the second solution are stored in i 0,1 and a 0,1, respectively. The suppression of and 33 has an ILM and ALM of =0.5. It can be seen that the second and third solutions each have a total LM of 2. The solution with the minimum information loss is the first one; hence, AGS stores 1, 0 and < > in i 1,1, a 1,1 and r 1,1, respectively. After the values for the remaining entries are computed, AGS uses r to construct the anonymized trajectory T. The process starts with examining the bottomright entry, which denotes a generalization. As a result, AGS appends (401.1, 35) to T. The process continues by following the symbols, such that AGS returns T = {(401.1, 33), (401.1, 34), (401.1, 35)} along with i 4,3 and a 4,3, which correspond to ILM( T ) and ALM( T ), respectively. B. Clustering We base our methodology for the clustering component on the maximum distance to average vector (MDAV) algorithm [53], [54], an efficient heuristic for kanonymity. The pseudocode for MDAV 7 and its helper function, formcluster, are provided in Algorithms 3 and 4, respectively. MDAV iteratively selects the most frequent trajectory in a longitudinal dataset (steps 3 and 12), finds its most distant trajectory X (steps 4 and 13), and forms a cluster of k trajectories around the latter trajectory 7 We refer to our algorithm as MDAV to avoid confusion.
8 420 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012 TABLE I DESCRIPTIVE SUMMARY STATISTICS OF THE DATASETS (steps 5 and 6 and 14 and 15). Cluster formation is performed by formcluster, a function that constructs a cluster C by aligning n trajectories in a consecutive manner and returns C, along with ILM(C) and ALM(C) which are the ILM and ALM for the anonymized trajectory resulting from the alignment, respectively. MDAV minimizes intercluster similarity by constructing a cluster around a trajectory Y that is most distant to X (steps 7 9). We define the distance between two trajectories as the cost of their anonymization. As such, the most distant trajectory is the one that maximizes the sum of ILM and ALM returned from AGS. A similar reasoning applies when we form a cluster. We add the trajectory that minimizes the sum of ILM and ALM returned from AGS. MDAV forms a final cluster of size at least k using the remaining trajectories in the dataset (steps 17 19) and returns D, akanonymized version of the longitudinal dataset, along with ILM( D) and ALM( D) (step 20). V. EXPERIMENTAL EVALUATION This section presents an experimental evaluation of the anonymization framework. We compare the anonymization methods on data utility, as indicated by the LM measure (see Section VB) and aggregate query answering accuracy (see Section VC). Furthermore, we show that our method allows balancing the level of information loss incurred by anonymizing ICD codes and Age values (see Section VD). This is important to support different types of biomedical studies, such as geriatric and epidemiology studies that are supported well when the information contained in Age and ICD attributes, respectively, is preserved in the anonymized data. A. Experimental Setup We worked with three datasets derived from the Synthetic Derivative (SD), a collection of deidentified information extracted from the EMR system of the VUMC [55]. We issued a query to retrieve the records of patients whose DNA samples were genotyped and stored in BioVU, VUMC s DNA repository linked to the SD. Then, using the phenotype specification in [56], we identified the patients eligible to participate in a GWAS on native electrical conduction within the ventricles of the heart. Subsequently, we created a dataset called D Pop 50 by Fig. 6. Comparison of information loss for D Pop 50 using various k values. restricting our query to the 50 most frequent ICD codes that occur in at least 5% of the records in BioVU. Next, we created a dataset called D Pop 4, which is a subset of D Pop 50, containing the following comorbid ICD codes selected for [43]: 250 (diabetes mellitus), 272 (disorders of lipoid metabolism), 401 (essential hypertension), and 724 (other and unspecified disorders of the back). Finally, we created a dataset called D Smp 4, which is a subset of D Pop 4, containing the records of patients who actually participated in the aforementioned GWAS [57]. D Smp 4 is expected to be deposited into the dbgap repository and has been used in [43] with no temporal information. The characteristics of our datasets are summarized in Table I. Throughout our experiments, we varied k between 2 and 15, noting that k =5tends to be applied in practice [41]. Initially, we set w ICD = w Age =0.5. We implemented all algorithms in Java and conducted our experiments on an Intel 2.8 GHz powered system with 4GB RAM. B. Capturing Data Utility Using LM We first compared the algorithms with respect to the LM. Fig. 6 depicts the results with D Pop 50. The ILM and ALM increase with k for both algorithms, which is expected because as k increases, a larger amount of distortion is needed to satisfy a stricter privacy requirement. Note that Baseline incurred substantially more information loss than AGS for all k. In fact, Baseline failed to construct a practically useful result when k>2, as it suppressed all values from the dataset. Similar trends were observed between AGS and Baseline for D Pop 4 and D Smp 4 (omitted for brevity). Interestingly, AGS, on average, incurred 48% less information loss on D Pop 4 than D Pop 50. This is important because a relatively small number of ICD codes may suffice to study a range of different diseases [11], [43]. It is also worthwhile to note that the information loss incurred by our approach remains relatively low (i.e., below 0.5) ford Smp 4, even though it is, on average, 55% more than that for D Pop 4. This is attributed to the fact that
9 TAMERSOY et al.: ANONYMIZATION OF LONGITUDINAL ELECTRONIC MEDICAL RECORDS 421 Fig. 7. Comparison of query answering accuracy for D Pop 50, D Pop 4,andD Smp 4 using various k values. Fig. 8. A comparison of information loss for D Pop 50 using various w ICD and w Age values. D Smp 4 is more sparse than D Pop 4, which implies it is more difficult to anonymize [58]. The ILM and ALM for different k values and datasets, which correspond to the aforementioned results, can be found in Appendix A. C. Capturing Data Utility Using Average Relative Error We next analyzed the effectiveness of our approach for supporting general biomedical analysis. We assumed a scenario in which a scientist issues queries on anonymized data to retrieve the number of trajectories that harbor a combination of (ICD, Age) pairs that appear in at least 1% of the original trajectories. Such queries are typical in many biomedical data mining applications [59]. To quantify the accuracy of answering such a workload of queries, we used the Average Relative Error (AvgRE) measure [36], which reflects the average number of trajectories that are incorrectly included as part of the query answers. Details about this measure are in Appendix B. Fig. 7 shows the AvgRE scores of running AGS on the datasets. The results for Baseline are not reported because they were more than 6 times worse than our approach for k =2, and the worst possible for k>2. This is because Baseline suppressed all values. As expected, we find an increase in AvgRE scores as k increases, which is due to the privacy/utility tradeoff. Nonetheless, AGS allows fairly accurate query answering on each dataset by having an AvgRE score of less than 1 for all k. The results suggest our approach can be effective, even when a high level of privacy is required. Furthermore, we observe that the AvgRE scores for D Pop 4 are lower than those of D Smp 4, which are in turn lower than those of D Pop 50. This implies that query answers are more accurate for small domain sizes and large datasets. D. Prioritizing Attributes Finally, we investigated how configurations of attribute weighting affect information loss. Fig. 8 reports the results for D Pop 50 and k =2when our algorithm is configured with weights ranging from 0.1 to 0.9. Observe that when w ICD =0.1 and w Age =0.9, AGS distorted Age values much less than ICD values. Similarly, AGS incurred less information loss for ICD than Age when we specified w ICD =0.9 and w Age =0.1.This result implies that data managers can use weights to achieve desired utility for either attribute. VI. DISCUSSION In this section, we discuss how our approach can be extended to prevent a privacy threat in addition to reidentification and its limitations. A. Attacks Beyond Reidentification Beyond reidentification is the threat of sensitive itemset disclosure, in which a patient is associated with a set of diagnosis codes that reveal some sensitive information (e.g., HIV status). kanonymity does not guarantee prevention of sensitive itemset disclosure, since a large number of records that are indistinguishable with respect to the potentially identifying diagnosis codes can still contain the same sensitive itemset [59]. We note that our approach can be extended to prevent this attack by controlling generalization and suppression to ensure that an additional principle is satisfied, such as ldiversity [60], which dictates how sensitive information is grouped. This extension, however, is beyond the scope of this paper. B. Limitations The proposed approach is limited in certain aspects, which we highlight to suggest opportunities for further research. First, our algorithm induces minimal distortion to the data in practice, but it does not limit the amount of information loss incurred by generalization and suppression. Designing algorithms that provide this type of guarantee is important to enhance the quality of anonymized GWASrelated datasets, but is also computationally challenging due to the large search spaces involved, particularly for longitudinal data. Second, the approach we propose does not guarantee that the released data remain useful for scenarios in which prespecified analytic tasks, such as the validation of known GWAS [22], are known to data owners apriori.to address such scenarios, we plan to design algorithms that take the tasks for which data are anonymized into account during anonymization.
10 422 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012 VII. CONCLUSION AND FUTURE WORK This study was motivated by the growing need to disseminate patientspecific longitudinal data in a privacypreserving manner. To the best of our knowledge, we introduced the first approach to sharing such data while providing computational privacy guarantees. Our approach uses sequence alignment and clusteringbased heuristics to anonymize longitudinal patient records. Our investigations suggest that it can generate longitudinal data with a low level of information loss and remain useful for biomedical analysis. This was illustrated through extensive experiments with data derived from the EMRs of thousands of patients. The approach is not guided by specific utility (e.g., satisfaction of GWAS validation), but we are confident it can be extended to support such endeavors. APPENDIX A INFORMATION LOSS INCURRED BY ANONYMIZING THE DATASETS USING AGS pair (401.1, [39 40]). Furthermore, observe that is a leaf node in Fig. 4; hence, the set of possible ICD codes is {401.1}. Similarly, the subtree rooted by [39 40] in Fig. 3 consists of two leaf nodes, hence the set of possible Age values is {39, 40}. Therefore, there are two possible pairs: {(401.1, 39), (401.1, 40)}, and the probability that one of the trajectories was originally harboring (401.1, 39) is 1 2. Then, an approximate answer for the query is computed as 1 2 2=1. The Relative Error (RE) for an arbitrary query Q is computed as RE(Q) = a(q) e(q) /a(q). For instance, the RE for the previous example query is 1 1 /1 = 0since the original dataset in Fig. 1(a) contains one trajectory with (401.1, 39). The AvgRE for a workload of queries is the mean RE of all issued queries. It reflects the mean error in answering the query workload. ACKNOWLEDGMENT The authors would like to thank A. GkoulalasDivanis (IBM Research) for helpful discussions on the formulation of this problem. REFERENCES APPENDIX B MEASURING DATA UTILITY USING QUERY WORKLOADS The AvgRE measure captures the accuracy of answering queries on an anonymized dataset. The queries we consider can be modeled as follows: Q: SELECT COUNT(*) FROM dataset WHERE (u ICD, v Age) dataset,... Let a(q) be the answer of a COUNT() query Q when it is issued on the original dataset. The value of a(q) can be easily obtained by counting the number of trajectories in the original dataset that contain the (ICD, Age) pairs in Q. Let e(q) be the answer of Q when it is issued on the anonymized dataset. This is an estimate because a generalized value is interpreted as any leaf node in the subtree rooted by that value in the DGH. Therefore, an anonymized pair may correspond to any pair of possible ICD codes and Age values, assuming each pair is equally likely. The value of e(q) can be obtained by computing the probability that a trajectory in the anonymized dataset satisfies Q, and then summing these probabilities across all trajectories. To illustrate how an estimate can be computed, assume that a data recipient issues a query for the number of patients diagnosed with ICD code at age 39 using the anonymized dataset in Fig. 1(c). Referring to the DGHs in Figs. 3 and 4, it can be seen that the only trajectories that may contain (401.1, 39) are the last two since they contain the generalized [1] D. Blumenthal, Stimulating the adoption of health information technology, NewEngl.J.Med., vol. 360, no. 15, pp , [2] D. A. Ludwick and J. Doucette, Adopting electronic medical records in primary care: Lessons learned from health information systems implementation experience in seven countries, Int. J. Med. Informat., vol. 78, pp , [3] A. K. Jha, C. M. DesRoches, E. G. Campbell, K. Donelan, S. R. Rao, T. G. Ferris, A. Shields, S. Rosenbaum, and D. Blumenthal, Use of electronic health records in U.S. hospitals, New Engl. J. Med., vol. 360, pp , [4] B. B. Dean, J. Lam, J. L. Natoli, Q. Butler, D. Aguilar, and R. J. Nordyke, Review: Use of electronic medical records for health outcomes research: A literature review, Med. Care Res. Rev., vol. 66, pp , [5] K. Holzer and W. Gall, Utilizing IHEbased electronic health record systems for secondary use, Methods Inf. Med., vol. 50, no. 4, pp , [6] C. Safran, M. Bloomrosen, W. Hammond, S. Labkoff, S. MarkelFox, P. Tang, and D. E. Detmer, Toward a national framework for the secondary use of health data: An American medical informatics association white paper, J. Amer. Med. Informat. Assoc., vol. 14, pp. 1 9, [7] K. Tu, T. Mitiku, H. Guo, D. S. Lee, and J. V. Tu, Myocardial infarction and the validation of physician billing and hospitalization data using electronic medical records, Chron. Diseases Canada, vol. 30, pp , [8] C. A. McCarty, R. L. Chisholm, C. G. Chute, I. J. Kullo, G. P. Jarvik, E. B. Larson, R. Li, D. R. Masys, M. D. Ritchie, D. M. Roden, J. P. Struewing, and W. A. Wolf, The emerge network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies, BMC Med. Genomics, vol. 4, pp , [9] I. J. Kullo, K. Ding, H. Jouni, C. Y. Smith, and C. G. Chute, A genomewide association study of red blood cell traits using the electronic medical record, Public Library Sci. ONE, vol. 5, e13011, [10] J. A. Pacheco, P. C. Avila, J. A. Thompson, M. Law, J. A. Quraishi, A. K. Greiman, E. M. Just, and A. Kho, A highly specific algorithm for identifying asthma cases and controls for genomewide association studies, in Proc. Amer. Med. Informat. Assoc. Annu. Symp., 2009, pp [11] M. D. Ritchie, J. C. Denny, D. C. Crawford, A. H. Ramirez, J. B. Weiner, J. M. Pulley, M. A. Basford, K. BrownGentry, J. R. Balser, D. R. Masys, J. L. Haines, and D. M. Roden, Robust replication of genotypephenotype associations across multiple diseases in an electronic medical record, Amer. J. Human Genet., vol. 86, no. 4, pp , [12] M. N. Liebman, Personalized medicine: A perspective on the patient, disease and causal diagnostics, Personal. Med., vol. 4, no. 2, pp , 2007.
11 TAMERSOY et al.: ANONYMIZATION OF LONGITUDINAL ELECTRONIC MEDICAL RECORDS 423 [13] A. Tucker and D. GarwayHeath, The pseudotemporal bootstrap for predicting glaucoma from crosssectional visual field data, IEEE Trans. Inf. Technol. Biomed., vol. 14, no. 1, pp , Jan [14] S. Jensen, Mining medical data for predictive and sequential patterns, in Proc. 5th Eur. Conf. Principles Pract. Knowl. Discov. Databases, 2001, pp [15] P. R. Burton, A. L. Hansell, I. Fortier, T. A. Manolio, M. J. Khoury, J. Little, and P. Elliott, Size matters: Just how big is big? Quantifying realistic sample size requirements for human genome epidemiology, Int. J. Epidemiol., vol. 38, pp , [16] C. C. Spencer, Z. Su, P. Donnelly, and J. Marchini, Designing genomewide association studies: Sample size, power, imputation, and the choice of genotyping chip, Public Library Sci. Genet., vol. 5, no. 5, e , [17] Policy for Sharing of Data Obtained in NIH Supported or Conducted GenomeWide Association Studies (GWAS), Nat. Inst. Health, Bethesda, MD, NOTOD , [18] M. D. Mailman, M. Feolo, Y. Jin, M. Kimura, K. Tryka, R. Bagoutdinov, L. Hao, A. Kiang, J. Paschall, L. Phan, N. Popova, S. Pretel, L. Ziyabari, M. Lee, Y. Shao, Z. Y. Wang, K. Sirotkin, M. Ward, M. Kholodov, K. Zbicz, J. Beck, M. Kimelman, S. Shevelev, D. Preuss, E. Yaschenko, A. Graeff, J. Ostell, and S. T. Sherry, The NCBI dbgap database of genotypes and phenotypes, Nature genet., vol. 39, no. 10, pp , [19] Standards for Protection of Electronic Health Information, Federal Register, Dept. Health, Human Services, Washington, DC, 45 CFR Pt. 164, [20] K. Benitez and B. Malin, Evaluating reidentification risks with respect to the HIPAA privacy rule, J. Amer. Med. Informat. Assoc., vol. 17,no.2, pp , [21] G. Loukides, J. C. Denny, and B. Malin, The disclosure of diagnosis codes can breach research participants privacy, J. Amer. Med. Informat. Assoc., vol. 17, no. 3, pp , [22] G. Loukides, A. GkoulalasDivanis, and B. Malin, Anonymization of electronic medical records for validating genomewide association studies, Proc. Nat. Acad. Sci., vol. 107, no. 17, pp , [23] G. Loukides, A. GkoulalasDivanis, and B. Malin, Privacypreserving publication of diagnosis codes for effective biomedical analysis, in Proc. 10th IEEE Int. Conf. Inf. Technol. Appl. Biomed., 2010, pp [24] L. Sweeney, kanonymity: A model for protecting privacy, Int. J. Uncertainty, Fuzz. Knowl.Based Syst., vol. 10, no. 5, pp , [25] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu, Privacypreserving data publishing: A survey of recent developments, ACM Comput. Surv., vol. 42, no. 4, pp. 1 53, [26] B.C. Chen, D. Kifer, K. LeFevre, and A. Machanavajjhala, Privacypreserving data publishing, Found. Trends Databases, vol. 2, no. 1 2, pp , [27] R. S. Sandhu, E. J. Coyne, H. L. Feinstein, and C. E. Youman, Rolebased access control models, IEEE Comput., vol. 29, no. 2, pp , Feb [28] B. Pinkas, Cryptographic techniques for privacypreserving data mining, ACM Spec. Interest Group Knowl. Discov. Data Mining Explorat., vol. 4, no. 2, pp , [29] S. M. Diesburg and A.I. A. Wang, A survey of confidential data storage and deletion methods, ACM Comput. Surv., vol. 43, no. 1, pp. 1 37, [30] N. R. Adam and J. C. Wortmann, Securitycontrol methods for statistical databases: A comparative study, ACM Comput. Surv., vol. 21, no. 4, pp , [31] C. C. Aggarwal and P. S. Yu, A survey of randomization methods for privacypreserving data mining, in PrivacyPreserving Data Mining: Models and Algorithms (ser. Advances in Database Systems), vol. 34. New York: SpringerVerlag, 2008, pp [32] L. Willenborg and T. De Waal, Statistical Disclosure Control in Practice (ser. Lecture Notes in Statistics), vol New York: SpringerVerlag, 1996, pp [33] N. MarsdenHaug, V. Foster, P. Gould, E. Elbert, H. Wang, and J. Pavlin, Codebased syndromic surveillance for influenzalike illness by international classification of diseases, ninth revision, Emerg. Infect. Diseases, vol. 13, pp , [34] P. Samarati, Protecting respondents identities in microdata release, IEEE Trans. Knowl. Data Eng., vol. 13, no. 6, pp , Nov./Dec [35] K. El Emam, F. K. Dankar, R. Issa, E. Jonker, D. Amyot, E. Cogo, J. P. Corriveau, M. Walker, S. Chowdhury, R. Vaillancourt, T. Roffey, and J. Bottomley, A globally optimal kanonymity method for the deidentification of health data, J. Amer. Med. Informat. Assoc., vol. 16, no. 5, pp , [36] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, Mondrian multidimensional kanonymity, in Proc. 22nd IEEE Int. Conf. Data Eng., 2006, pp [37] T. Dalenius, Finding a needle in a haystack or identifying anonymous census record, J. Offic. Statist., vol. 2, no. 3, pp , [38] M. E. Nergiz and C. Clifton, Thoughts on kanonymization, Data Knowl. Eng., vol. 63, no. 3, pp , [39] G. Loukides and J. Shao, Capturing data usefulness and privacy protection in kanonymisation, in Proc. 22nd ACM Symp. Appl. Comput., 2007, pp [40] V. S. Iyengar, Transforming data to satisfy privacy constraints, in Proc. 8th ACM Int. Conf. Knowl. Discov. Data Mining, 2002, pp [41] K. El Emam and F. K. Dankar, Protecting privacy using kanonymity, J. Amer. Med. Informat. Assoc., vol. 15, no. 5, pp , [42] M. Terrovitis, N. Mamoulis, and P. Kalnis, Privacypreserving anonymization of setvalued data, in Proc. Very Large Data Bases Endowment, 2008, vol. 1, no. 1, pp [43] A. Tamersoy, G. Loukides, J. C. Denny, and B. Malin, Anonymization of administrative billing codes with repeated diagnoses through censoring, in Proc. Amer. Med. Informat. Assoc. Annu. Symp., 2010, pp [44] M. E. Nergiz, M. Atzori, Y. Saygin, and B. Guc, Towards trajectory anonymization: A generalizationbased approach, Trans. Data Privacy, vol. 2, no. 1, pp , [45] O. Abul, F. Bonchi, and M. Nanni, Never walk alone: Uncertainty for anonymity in moving objects databases, in Proc. 24nd IEEE Int. Conf. Data Eng., 2008, pp [46] M. Terrovitis and N. Mamoulis, Privacy preservation in the publication of trajectories, in Proc. 9th Int. Conf. Mobile Data Manag.,2008,pp [47] L. Sweeney, Achieving kanonymity privacy protection using generalization and suppression, Int. J. Uncertainty, Fuzz. Knowl.Based Syst., vol. 10, no. 5, pp , [48] C. C. Aggarwal and P. S. Yu, A condensation approach to privacy preserving data mining, in Proc. 9th Int. Conf. Extend. Database Technol., 2004, pp [49] G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, and A. Zhu, Achieving anonymity via clustering, in Proc. 25th ACM Symp. Principles Database Syst., 2006, pp [50] S. B. Needleman and C. D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Molecular Biol., vol. 48, no. 3, pp , [51] T. F. Smith and M. S. Waterman, Identification of common molecular subsequences, J. Molecular Biol., vol. 147, no. 1, pp , [52] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 2nd ed. Cambridge, MA: MIT Press, [53] J. DomingoFerrer and J. M. MateoSanz, Practical dataoriented microaggregation for statistical disclosure control, IEEE Trans. Knowl. Data Eng., vol. 14, no. 1, pp , Jan./Feb [54] J. DomingoFerrer and V. Torra, Ordinal, continuous and heterogeneous kanonymity through microaggregation, Data Mining Knowl. Discov., vol. 11, no. 2, pp , [55] D. M. Roden, J. M. Pulley, M. A. Basford, G. R. Bernard, E. W. Clayton, J. R. Balser, and D. R. Masys, Development of a largescale deidentified DNA biobank to enable personalized medicine, Clin. Pharmacol. Therapeut., vol. 84, no. 3, pp , [56] A. H. Ramirez, J. S. Schildcrout, D. L. Blakemore, D. R. Masys, J. M. Pulley, M. A. Basford, D. M. Roden, and J. C. Denny, Modulators of normal electrocardiographic intervals identified in a large electronic medical record, Heart Rhythm, vol. 8, no. 2, pp , [57] J. C. Denny, M. D. Ritchie, D. C. Crawford, J. S. Schildcrout, A. H. Ramirez, J. M. Pulley, M. A. Basford, D. R. Masys, J. L. Haines, and D. M. Roden, Identification of genomic predictors of atrioventricular conduction: Using electronic medical records as a tool for genome science, Circulation, vol.122, pp ,2010. [58] C. C. Aggarwal, On kanonymity and the curse of dimensionality, in Proc. 31st Int. Conf. Very Large Data Bases, 2005, pp [59] Y. Xu, K. Wang, A. W.C. Fu, and P. S. Yu, Anonymizing transaction databases for publication, in Proc. 14th ACM Int. Conf. Knowl. Discov. Data Mining, 2008, pp [60] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam, ldiversity: Privacy beyond kanonymity, in Proc. 22nd IEEE Int. Conf. Data Eng., 2006, pp Authors photographs and biographies not available at the time of publication.
Trail ReIdentification: Learning Who You Are From Where You Have Been
Trail ReIdentification: Learning Who You Are From Where You Have Been Bradley Malin Data Privacy Laboratory School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 USA malin@cs.cmu.edu
More information1 NOT ALL ANSWERS ARE EQUALLY
1 NOT ALL ANSWERS ARE EQUALLY GOOD: ESTIMATING THE QUALITY OF DATABASE ANSWERS Amihai Motro, Igor Rakov Department of Information and Software Systems Engineering George Mason University Fairfax, VA 220304444
More informationClean Answers over Dirty Databases: A Probabilistic Approach
Clean Answers over Dirty Databases: A Probabilistic Approach Periklis Andritsos University of Trento periklis@dit.unitn.it Ariel Fuxman University of Toronto afuxman@cs.toronto.edu Renée J. Miller University
More informationARTICLE 29 DATA PROTECTION WORKING PARTY
ARTICLE 29 DATA PROTECTION WORKING PARTY 0829/14/EN WP216 Opinion 05/2014 on Anonymisation Techniques Adopted on 10 April 2014 This Working Party was set up under Article 29 of Directive 95/46/EC. It is
More informationTHE PROBLEM OF finding localized energy solutions
600 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1997 Sparse Signal Reconstruction from Limited Data Using FOCUSS: A Reweighted Minimum Norm Algorithm Irina F. Gorodnitsky, Member, IEEE,
More informationPrivacy Impact Assessment: care.data
High quality care for all, now and for future generations Document Control Document Purpose Document Name Information Version 1.0 Publication Date 15/01/2014 Description Associated Documents Issued by
More informationGeneric Statistical Business Process Model
Joint UNECE/Eurostat/OECD Work Session on Statistical Metadata (METIS) Generic Statistical Business Process Model Version 4.0 April 2009 Prepared by the UNECE Secretariat 1 I. Background 1. The Joint UNECE
More informationData Quality Assessment: A Reviewer s Guide EPA QA/G9R
United States Office of Environmental EPA/240/B06/002 Environmental Protection Information Agency Washington, DC 20460 Data Quality Assessment: A Reviewer s Guide EPA QA/G9R FOREWORD This document is
More informationWhy Do We Need Them? What Might They Be? APA Publications and Communications Board Working Group on Journal Article Reporting Standards
Reporting Standards for Research in Psychology Why Do We Need Them? What Might They Be? APA Publications and Communications Board Working Group on Journal Article Reporting Standards In anticipation of
More informationTHE development of methods for automatic detection
Learning to Detect Objects in Images via a Sparse, PartBased Representation Shivani Agarwal, Aatif Awan and Dan Roth, Member, IEEE Computer Society 1 Abstract We study the problem of detecting objects
More informationEngineering Privacy by Design
Engineering Privacy by Design Seda Gürses, Carmela Troncoso, and Claudia Diaz K.U. Leuven/IBBT, ESAT/SCDCOSIC firstname.lastname@esat.kuleuven.be Abstract. The design and implementation of privacy requirements
More informationCLoud Computing is the long dreamed vision of
1 Enabling Secure and Efficient Ranked Keyword Search over Outsourced Cloud Data Cong Wang, Student Member, IEEE, Ning Cao, Student Member, IEEE, Kui Ren, Senior Member, IEEE, Wenjing Lou, Senior Member,
More informationUsing Root Cause Analysis to Handle Intrusion Detection Alarms
Using Root Cause Analysis to Handle Intrusion Detection Alarms Dissertation zur Erlangung des Grades eines Doktors der Naturwissenschaften der Universität Dortmund am Fachbereich Informatik von Klaus Julisch
More informationClinical Decision Support Systems: State of the Art
Clinical Decision Support Systems: State of the Art Prepared for: Agency for Healthcare Research and Quality U.S. Department of Health and Human Services 540 Gaither Road Rockville, MD 20850 www.ahrq.gov
More informationWhat s Strange About Recent Events (WSARE): An Algorithm for the Early Detection of Disease Outbreaks
Journal of Machine Learning Research 6 (2005) 1961 1998 Submitted 8/04; Revised 3/05; Published 12/05 What s Strange About Recent Events (WSARE): An Algorithm for the Early Detection of Disease Outbreaks
More informationCostAware Strategies for Query Result Caching in Web Search Engines
CostAware Strategies for Query Result Caching in Web Search Engines RIFAT OZCAN, ISMAIL SENGOR ALTINGOVDE, and ÖZGÜR ULUSOY, Bilkent University Search engines and largescale IR systems need to cache
More informationChord: A Scalable Peertopeer Lookup Service for Internet Applications
Chord: A Scalable Peertopeer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT Laboratory for Computer Science chord@lcs.mit.edu
More informationSteering User Behavior with Badges
Steering User Behavior with Badges Ashton Anderson Daniel Huttenlocher Jon Kleinberg Jure Leskovec Stanford University Cornell University Cornell University Stanford University ashton@cs.stanford.edu {dph,
More informationRevealing Information while Preserving Privacy
Revealing Information while Preserving Privacy Irit Dinur Kobbi Nissim NEC Research Institute 4 Independence Way Princeton, NJ 08540 {iritd,kobbi }@research.nj.nec.com ABSTRACT We examine the tradeoff
More informationYou Might Also Like: Privacy Risks of Collaborative Filtering
You Might Also Like: Privacy Risks of Collaborative Filtering Joseph A. Calandrino 1, Ann Kilzer 2, Arvind Narayanan 3, Edward W. Felten 1, and Vitaly Shmatikov 2 1 Dept. of Computer Science, Princeton
More informationMisunderstandings between experimentalists and observationalists about causal inference
J. R. Statist. Soc. A (2008) 171, Part 2, pp. 481 502 Misunderstandings between experimentalists and observationalists about causal inference Kosuke Imai, Princeton University, USA Gary King Harvard University,
More informationTop 10 algorithms in data mining
Knowl Inf Syst (2008) 14:1 37 DOI 10.1007/s1011500701142 SURVEY PAPER Top 10 algorithms in data mining Xindong Wu Vipin Kumar J. Ross Quinlan Joydeep Ghosh Qiang Yang Hiroshi Motoda Geoffrey J. McLachlan
More informationAnomalous System Call Detection
Anomalous System Call Detection Darren Mutz, Fredrik Valeur, Christopher Kruegel, and Giovanni Vigna Reliable Software Group, University of California, Santa Barbara Secure Systems Lab, Technical University
More informationExtracting k Most Important Groups from Data Efficiently
Extracting k Most Important Groups from Data Efficiently Man Lung Yiu a, Nikos Mamoulis b, Vagelis Hristidis c a Department of Computer Science, Aalborg University, DK9220 Aalborg, Denmark b Department
More informationClimate Surveys: Useful Tools to Help Colleges and Universities in Their Efforts to Reduce and Prevent Sexual Assault
Climate Surveys: Useful Tools to Help Colleges and Universities in Their Efforts to Reduce and Prevent Sexual Assault Why are we releasing information about climate surveys? Sexual assault is a significant
More informationDefining and Testing EMR Usability: Principles and Proposed Methods of EMR Usability Evaluation and Rating
Defining and Testing EMR Usability: Principles and Proposed Methods of EMR Usability Evaluation and Rating HIMSS EHR Usability Task Force June 2009 CONTENTS EXECUTIVE SUMMARY... 1 INTRODUCTION... 2 WHAT
More informationChapter 2 Survey of Biodata Analysis from a Data Mining Perspective
Chapter 2 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy, Jiawei Han, Lei Liu, and Jiong Yang Summary Recent progress in biology, medical science, bioinformatics, and biotechnology
More informationA Principled Approach to Bridging the Gap between Graph Data and their Schemas
A Principled Approach to Bridging the Gap between Graph Data and their Schemas Marcelo Arenas,2, Gonzalo Díaz, Achille Fokoue 3, Anastasios Kementsietsidis 3, Kavitha Srinivas 3 Pontificia Universidad
More informationRECENTLY, there has been a great deal of interest in
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 47, NO. 1, JANUARY 1999 187 An Affine Scaling Methodology for Best Basis Selection Bhaskar D. Rao, Senior Member, IEEE, Kenneth KreutzDelgado, Senior Member,
More informationNo Free Lunch in Data Privacy
No Free Lunch in Data Privacy Daniel Kifer Penn State University dan+sigmod11@cse.psu.edu Ashwin Machanavajjhala Yahoo! Research mvnak@yahooinc.com ABSTRACT Differential privacy is a powerful tool for
More information