# Trail Re-Identification: Learning Who You Are From Where You Have Been

Save this PDF as:

Size: px
Start display at page:

Download "Trail Re-Identification: Learning Who You Are From Where You Have Been"

## Transcription

3 kinds of releases are provided in this subsection, but first, assumptions inherent in this work are stated. Assumption 2.1 (Per Location Release). Each data-collecting location c releases data that was collected at c and from no external source. Assumption 2.2 (Uniqueness of Tuples). In a location s deidentified and identified tables, each tuple is unique. Therefore, a de-identified or identified table represents a set of references to people, machines, or other entities that have visited the location, but not necessarily the frequency of visits. These references narrowly relate to a person, machine, household or other entity to be identified. Definition 2.2 presents a definition for released data that adheres to a representative property in which only tuples present in the de-identified table have corresponding tuples in the identified table, and vice versa. Example 2.2 provides a sample. Definition 2.2 (Representative) Let the table τ be vertically partitioned by V id and V de such that V id: τ τ + and V de: τ τ -, where τ - is the de-identified table, and τ + is the identified table. The tables τ - and τ + are representative if and only if: (1) t id τ + and t de τ -, V id -1 (t id )=V de -1 (t de ); and, (2) τ + = τ -. 1 In releases that adhere to the representative property, every tuple from the data-collecting location present in the de-identified table is also present in the identified table, and vice versa. Example 2.2 Figure 1 depicts a representative release resulting from a representative vertical partitioning of the data in Figure 2 into a de-identified table (τ - ) and an identified table (τ + ). Each released tuple has values in both tables. Name Birthdate Gender ID Zip DNA John Smith 2/18/45 M acag t Mary Doe 4/9/75 F accg a Bob Little 2/26/49 M cttg a Kate Erwin 11/3/54 F atcg t Fran Booth 1/8/71 F accg t Figure 2. Original data collected by a hospital. Releases that are representative are not always practical. In some situations, a data-collecting location may not have collected both identified and de-identified data on all visitors to their location or may not want to share all collected information. In these cases, either the de-identified table or the identified table is incomplete, providing a release that is appropriate (Definition 2.3). Samples are provided in Examples 2.3 and 2.4. Definition 2.3 (Appropriate) Let the table τ be vertically partitioned by V id and V de such that V id: τ τ + and V de: τ τ -, where τ - is the de-identified table, and τ + is the identified table. The tables τ - and τ + are appropriate if either (1) t id τ +,there exist t de τ - such that V -1 id (t id )=V -1 de (t de ); or, (2) t de τ -,there exist t id τ + such that V -1 id (t id ) = V -1 de (t de ). In (1), τ + is the appropriate table to τ - ; and in (2), τ - is the appropriate table to τ +. Example 2.3 Consider an online store in which all purchases are made at the store s website. An online consumer may visit the 1 V -1 is the inverse function of V. store and not necessarily make a purchase; of so, his de-identified IP address may be collected, but there is no accompanying name and address due to lack of purchase. The store s release of all names of purchasers as the identified table and all logged IP addresses as the de-identified table is a release that adheres to the appropriate property. Example 2.4 Figure 3 depicts a release resulting from an appropriate vertical partitioning of the data in Figure 2 into a deidentified table (τ - ) and an identified table (τ + ). DNA sequences for Bob Little and Kate Erwin appear in τ -, but there are no DNA sequences for John Smith or Mary Doe. τ + τ - Name Birthdate Gender Zip DNA John Smith 2/18/45 M cttg a Mary Doe 4/9/75 F atcg t Bob Little 2/26/49 M Kate Erwin 11/3/54 F Figure 3. Release by a hospital, with partitioning into an identified table (τ + ) of patient demographics and an appropriate de-identified table (τ - ) containing DNA sequences. 2.2 Data Trails Given a set of data-collecting locations, where all locations share either releases in a representative or an appropriate manner, visits across locations can be tracked by observing which locations reported which visits. These observations are made explicit by constructing a matrix of shared de-identified data and a matrix of shared identified data. These matrices are termed a de-identified track and an identified track, respectively, (Definition 2.4). Definition 2.4 (De-identified and Identified Tracks) Let C be the set of data-collecting locations that share their identified tables, τ + c, over the attributes A + and de-identified tables, τ - c,over the attributes A -,wherec C. Let B be a vector containing the members of C, andt + be the set of all {τ + c }andt - be the set of all {τ - c } for each c C. Either: (1) τ + c and τ - c are representative; (2) τ + c is appropriate to τ - c for each c C; or, (3) τ - c is appropriate to τ + c for each c C. The de-identified track, N, is a matrix having A - + C columns. The contents of N are the same as those realized by FillTrack(N, A -, T - ). Similarly, the identified track, P, is a matrix having A + + C columns and the contents of P are the same as those realized by FillTrack(P, A +, T + ). The number of rows in N and P are: C U τ c c= 1 and U C τ + c c= 1, respectively. FillTrack(Track T, Attributes A, Tables {τ c }) Steps: let each cell in T be initialized to 0 for each location c C: For each tuple t i τ c let b be the index of c in B if there does not exist T j[1,, A ] t i, where j=1,, T then: let k be the first unused row in N // has all 0 s N k [1,, A ] = t i and N k[ A +b] =1 Else: N j[ A +b] =1 // another location found // N x[y] is the cell in the x-th row and y-th column of N // N x[a,...,b] is the vector in the x-th row having columns a to b of N 3

4 A de-identified track (and an identified track) is a large matrix where each row contains information about a visit and lists the locations in which that visit was reported. The first group of columns in the track is the information collected about a subject on the subject s visit to a location. The second group of columns is a list of locations. Values associated with locations are 1 if the subject visited the location and a 0 otherwise. In a representative release, the identified track (P) andthede- identified track (N) are representative. If the tables that construct N are each appropriate to the tables that constitute P, track N is appropriate to P. Likewise, if the tables that construct P are each appropriate to the tables that constitute N, track P is appropriate to N. Examples 2.5 and 2.6 provide samples. Example 2.5 Figure 5 shows the identified track (P) and deidentified track (N) for the releases found in Figure 4, which adhere to the representative property. P N Name h 1 h 2 H 3 DNA h 1 h 2 h 3 John acag t Mary accg a Bob cttg a Kate atcg t Figure 4. De-identified track (N) and identified track (P). Example 2.6 Figure 6 shows 3 hospitals performing appropriate releases in which the de-identified DNA tables are appropriate to the tables of names. These releases provide the tracks in Figure 7 in which track N, which results from the DNA tables, is appropriate to P, which results from the tables of names. In both a de-identified track and an identified track, the rightmost columns are associated with locations. The vectors of binary values associated with those columns are trails. They show the locations where the person, machine, or entity that is the subject of the visits has been. Trails are described in Definition 2.5. Definition 2.5 (Trail) Let C be the set of data-collecting locations that share their identified (or de-identified) tables in track T. The shared tables are over the attributes A. Atrail for subject j is the vector T j[ A +1,, A + C ]. For convenience, T j[ A +1,, A + C ]. is written trail(t, j). A trail for a subject is a vector of binary values where a value of 1 indicates the subject visited the location and 0 otherwise. See Example 2.7. Hosp1 (+) Hosp1 (-) Hosp2 (+) Hosp2 (-) Name DNA Name DNA John acag t John acag t Mary accg a Bob cttg a Hosp3 (+) Hosp3 (-) Name DNA Mary accg a Bob cttg a Kate atcg t Figure 5. Releases by 3 hospitals that adhere to the representative property. Example 2.7 Given the identified track P in Figure 5, [1,1,0] is a trail for John and [0,1,1] is a trail for Bob. Given the deidentified track N in Figure 5, [1,0,1] is a trail for accg a and [0,0,1] is a trail for atcg t. The de-identified and identified tracks in Figure 5 were constructed from the releases reported in Figure 4. These tracks adhere to the representative property, and the resulting trails are complete trails. The notion of a complete trail is presented in Definition 2.6. Definition 2.6 (Complete Trail) Let C be the set of datacollecting locations that share their identified (or de-identified) tables in track T such that the shared tables from each location c C is representative. A complete trail is a trail in T. In a complete trail, values represent the unambiguous presence or absence of a subject at a location such that 0 signifies the subject of the trail did not visit the location and 1 signifies the subject of the trail definitely visited the location. If de-identified and identified tracks are constructed from a release that adheres to the appropriate property, then the trails in the appropriate track are all incomplete trails. Definition 2.7 presents an incomplete trail and samples are provided in Example 2.8. Definition 2.7 (Incomplete Trail) Let C be the set of datacollecting locations that share their identified and de-identified tables in tracks T 1 and T 2 such that T 1 is the appropriate track of T 2 for all data holdrs c C. An incomplete trail is a trail in T 1. In an incomplete trail, a value of 1 represents the definite presence of the subject at a location and a value of 0 suggests ambiguity. The subject may or may not have visited the location. Hosp1 (+) Hosp1 (-) Hosp2 (+) Hosp2 (-) Name DNA Name DNA John acag t John acag t Mary Bob cttg a Hosp3 (+) Hosp3 (-) Name DNA Mary atcg t Bob Kate Figure 6. Released identified tables (+) and de-identified tables (-) with the appropriate property. Example 2.8 In Figure 7 the de-identified track N has the following incomplete trails: [0,1,0]; [1,0,0]; [0,1,0]; and [0,0,1]. All the trails in the identified track P are complete. P N Name h 1 h 2 h 3 DNA h 1 h 2 h 3 John acag t Mary accg a Bob cttg a Kate atcg t Figure 7. De-identified track (N) and identified track (P) from the partitions in Figure 6. P has complete trails. N has incomplete trails. An incomplete trail can match ambiguously to several complete trails. This notion of containment forms the basis for subtrails and supertrails. See Definition 2.8 and Example 2.9. Definition 2.8 (Subtrails / Supertrails) Let C be the set of datacollecting locations that share their identified and de-identified 4

5 tables in tracks T 1 and T 2 such that T 1 is appropriate to T 2. The shared tables are over the attributes A. Let x beatrailsfromt 1 and y beatrailfromt 2. xis a subtrail of y (written x y)and y is a supertrail of x (written y x) if and only if: T 1[x] [d] T 2[y][d] for d= A +1,, A + C. Example 2.9 [1,0,0], [0,1,0], and [1,1,0] are subtrails of [1,1,0]. [1,1,0] and [0,1,1] are supertrails of [0,1,0]. With respect to tracks, recall the properties of representative (2.2) and appropriate (2.3). If tracks N and P are constructed from representative releases, then for any particular entity x in the tracks, N[x][d] must equal P[x][d] for all locations d. Furthermore, if N and P are constructed from releases that are appropriate, then for any particular entity x in the tracks, N[x][d] must be P [x][d] for all locations d if N is appropriate to P. N consists of incomplete trails and P consists of complete trails. The converse is true if P is appropriate to N. Tracks A and B are one-to-one if for every entity x represented by a trail in track Athere exists only one trail in track B that correctly corresponds to x. Tracks A and B are one-to-many if every trail in A may correctly be linked to one or more trails in track B. Now that de-identified and identified tracks are understood and how complete and incomplete trails relate to these tracks, the trail re-identification problem can be presented. See Definition 2.9 and Example Definition 2.9 (Trail Re-identification Problem) Let C be the set of data-collecting locations whose shared tables result in deidentified track N and identified track P over the attributes A - and A +, respectively. Let there exist a function f: A B, wherea {N, P}and B={N, P}-{A}. A trail re-identification results for a subject s when there exists an i, such that f(a s[1,, A - ]) = B i[1,, A + ]. The goal is to determine the proper function f. Example 2.10 The de-identified track N and identified track P in Figure 4 result from the hospital releases shown in Figure 5. f([ acag t ]) = [ John ], f([ accg a ]) = [ Mary ], f([ cttg a ]) = Bob ], and f([ atcg t ]) = [ Kate ] are all correct trail reidentifications. This section precisely described how people machines, and other entities leave information behind at visited locations, how that information can be shared resulting in trails and how those trails can pose a trail re-identification problem. In the next section, three novel algorithms for performing trail re-identifications are presented. Afterwards, results are reported on real-world data. The paper ends with discussions on related work and pertinent issues. 3. REIDIT ALGORITHMS Given C, the set of data-collecting locations whose shared tables result in de-identified track N and identified track P over the attributes A - and A +, respectively, algorithms that exploit the uniqueness of trails in N and P can be written to perform trail reidentifications. The three algorithms presented in this section are variants of this approach. Collectively, they are termed Reidentification of Data in Trails (REIDIT). 3.1 REIDIT-Complete The first algorithm is named REIDIT-C. It performs exact match on the trails of N and P. REIDIT-C assumes that both N and P are representative, and therefore, REIDIT-C only works on complete trails. For every trail in N, REIDIT-C determines if there exists one and only one trail in P such that the trails are equal. When there is an exact and unique match, then trail(n,n) is re-identified to explicitly identifying information in P. If trail(n,n) is equal to trail(p,p), and there exists another trail(p,p ) also equal to trail(n,n), then there is ambiguity and no re-identification can occur. The formalization of REIDIT-C is provided in Figure 8. Complexity. First, the outer loop iterates over all of the records in N, whichis N iterations. Second, for each iteration in N, the algorithm iterates a maximum of P times. This provides O( N P ) or O( N 2 ) because N = P. This is an artifact of the way in which the pseudo code is written. Another version could be written in which each set of trails are sorted and then compared, resulting in O( N log N ). Algorithm: REIDIT-C(N, P) Input: De-identified and Identified Tracks N and P over attributes A - and A +, respectively, for the same data-collecting locations. Output: Set of trail re-identifications R Assumes: 1)N and P are representative, 2) N and P are one-toone Steps: let R = for n=1 to N let M = for p= 1to P if trail(n,n) trail(p,p) M = M P p[1,, A + ] s = p if M 1 R = R {(P s[1,, A + ], N n[1,, A - )} return R Figure 8. Pseudocode for REIDIT-C. Theorem 3.1 Trail re-identifications from REIDIT-C are correctly re-identified. PROOF: First, recall the underlying assumption of the completerelease model: tuples of both tables N and P consist only of complete trails. Thus, at location i, a visit from an entity must be - recorded in both T i and T + i. Since this holds true for every location, for each trail(n,n), there must exist at minimum one equivalent trail(p,p). If there exists more than one equivalent trail in P for trail(n,n), then multiple trails will be recognized and the singleton requirement will not be satisfied. No re-identification will be recorded. 3.2 REIDIT-Incomplete The second algorithm is named REIDIT-I. It performs subtrail/suptertrail matching on the trails of N and P. REIDIT-I assumes either N is appropriate to P or P is appropriate to N. When an incomplete trail can be matched to a single complete trail, a trail re-identification occurs. Unlike REIDIT-C, equality cannot be used for matching trails. Instead, containment of subtrails by supertrails is used. For each trail in the track containing incomplete trails, the set of its supertrails from the 5

6 track containing complete trails are found. If there is only one supertrail, then a trail re-identification has occurred. The reidentified trails from N and from P are removed. Processing continues until no more re-identifications can be made because one of two conditions is satisfied: either (1) N or P have no more trails to process; or, (2) there are no re-identifications made in the current iteration. REIDIT-I appears in Figure 9. Complexity. Let X be N and Y be P. First, the outer loop iterates over all of the records in N, whichis N iterations. Second, for each iteration in N, the algorithm iterates a maximum of P times. Finally, the nested for process continues until no re-identifications are made during the while loop and the while may iterate a maximum of N. times. This provides O( N 2 P ). Algorithm: REIDIT-I (X, Y) Input: From de-identified and Identified Tracks N and P over attributes A - and A +, respectively, for the same data-collecting locations, X is the appropriate table of N or P and Y is the other table. Output: Set of trail re-identifications R Assumes: 1) X has incomplete trails and Y has complete trails. 2) X and Y are one-to-one. Steps let R = Do FoundOne = False for n=1 to X let M = for p= 1to Y if trail(x,n) [null,, null] and trail(y,p) [null,, null] and trail(x,n) trail(y,p) M = M Y p[1,, A + ] s = p If M 1 R = R {(Y s[1,, A + ], X n[1,, A - )} X n[ A - +1,, A - + C ] = [null,, null] // remove Y s[ A + +1,, A + + C ] = [null,, null] // remove FoundOne = True while X has non-null tuples and FoundOne false return R Figure 9. Pseudocode for REIDIT-I. Theorem 3.2 Trail re-identifications from REIDIT-I are correctly re-identified. PROOF: For convenience, assume that N is appropriate to P. N has incomplete trails and P has complete trails. From the definition of incomplete trails, all 1 s are correct, so it must be true that for an arbitrary trail in N, there must exist a non-null set of supertrails whose identifying information appears in M ( M 1) for trail(n,n). If M is equal to 1, then there exists only one complete supertrail that could be reconstructed for trail(n,n) through the replacement of 0 s with 1 s. Therefore trail(n,n) is re-identified in M. In the event when M > 1, then the algorithm can still converge to a correct re-identification as follows. Let M equal k. When a re-identification is made for a trail other than trail(n,n), then M decreases by 1. Because it is already known that M has a minimum of 1, if M -1 re-identifications are made for trails of N, excluding trail(n,n), each with a member from M, then the remaining member of M must re-identify trail(n,n). 3.3 REIDIT-Multiple The third algorithm is named REIDIT-M. It allows multiple references in P to be related to only one reference in N, orvice versa. For example, multiple individuals in a shared setting, such as a household, can use the same computer. Online purchasers, in this case, would have multiple identities related to the same IP address. The reverse is also possible. One person could use more than one computer and therefore one reference in P would relate to multiple references in N. REIDIT-M addresses collocation issues, such as these. REIDIT-M assumes either N is appropriate to P or P is appropriate to N. Unlike REIDIT-I, the REIDIT-M algorithm relaxes the assumption that there must be one-to-one relationship between trails. If an incomplete trail is a subtrail of only one supertrail, then a re-identification occurs via a linkage between these two trails. Multiple subtrails can map to the same supertrail and permit a re-identification. REIDIT-M is provided in Figure 10. Algorithm: REIDIT-M (X, Y) Input: From de-identified and Identified Tracks N and P over attributes A - and A +, respectively, for the same data-collecting locations, X is the appropriate table of N or P and Y is the other table. Output: Set of trail re-identifications R Assumes: 1) X has incomplete trails and Y has complete trails. 2) X to Y is one-to-many. Steps let R = for n=1 to X let M = for p= 1to Y if trails(x,n) trails(y,p) M = M Y p[1,, A + ] s = p if M 1 R = R {(Y s[1,, A + ], X n[1,, A - )} return R Figure 10. Pseudocode for REIDIT-M. Complexity. Let X be N and Y be P. First, the outer loop iterates over all of the records in N, whichis N iterations. Second, for each iteration in N, the algorithm iterates a maximum of P times. Thus the algorithm is O( N P ). 4. THEORETICAL VS. ACTUAL RE- IDENTIFICATION Theoretically, for both REIDIT-C and REIDIT-I, the maximum number of trail re-identifications is dependent on the number of permutations of a binary string. Given an identified track P, containing references to subjects and the locations visited, and a set of data-collecting locations C, if P C, then the maximum number of trail re-identifications is bounded by the number of subjects P, which implicates that all trails may be re-identified. When P > C, the maximum number of trail re-identifications is bounded by the number of locations in the exponential manner 2 C -1. When P >2 C, it will be impossible to re-identify all trails. In contrast, for REIDIT-M, the number of re-identifications is independent of the number of data-collecting locations, because it is possible for multiple trails in P to be mapped to a single 6

10 7. CONCLUDING REMARKS AND FUTURE RESEARCH The REIDIT algorithms provide deterministic methods for learning who (by name or explicit identity) has been where. The methodology involves constructing trails across locations from small amounts of seemingly anonymous or innocuous evidence the person has been there. Trails are also constructed on places where the person has left explicit information of their presence. Identifying uniqueness and inferences across these two sets of the trails relates information about where the person has been to who they are. Only binary trails were examined in this work. However, one possible extension of this research is in the design and evaluation of models that allow for the probabilistic qualification of trail bits. This qualification would permit an interesting optimization problem for re-identification, allowing some locations to be weighted more than others. Future research will also need to address such issues as error and other kinds of incomplete and data quality issues. The REIDIT algorithms are challenging and timely to society. The American public wants to feel safe and is therefore looking for protection through various kinds of electronic surveillance systems. An extension of the REIDIT algorithms can be used to track when suspicious people tend to travel together or be in the same places. The REIDIT algorithms are also important to society because Americans are seeking more safety without compromising privacy unnecessarily. Clearly, the REIDIT algorithms exasperate privacy concerns. The fact that trail re-identification can be done, as evidenced by the existence of this work, informs society and data privacy researchers of a real challenge. Currently, there is no work documenting how particular protection schemas might thwart the various trail re-identification methods presented herein. Because trail re-identification is a novel strategy for reidentification, it provides a new and important direction for future research in data privacy. The challenge is for some researchers to attempt to thwart these approaches by improving data privacy methods, while others try to improve their ability to learn. In this open and aggressive pursuit on both sides, we as computer scientists can best inform society and help play a crucial role in the debates of our time. [2] Fellegi, I.P., and Sunter, A.B. A theory for record linkage. Journal of the American Statistical Association. 64: , [3] Kraut, R., Kiesler, S., Boneva, B., Cummings, J., Helgeson, V., and Crawford, A. Internet paradox revisited. Journal of Social Issues. 58: 49-74, [4] Malin, B., and Sweeney, L. Compromising privacy in distributed population-based databases with trail matching: A DNA Example. Tech Report CMU-CS School of Computer Science, Carnegie Mellon University, Pittsburgh, PA. Dec [5] Malin, B., and Sweeney, L.. Re-identification of DNA through an automated process. In Proc American Medical Informatics Association Annual Symposium, Washington, DC, pp , Nov [6] Newcombe, H.B., Kennedy, J.M., Axford, S.J., and James, A.P. Automatic linkage of vital records. Science. 130: , [7] State of Illinois Health Care Cost Containment Council. Data release overview. Springfield, 1998.Sweeney, L. Guaranteeing anonymity when sharing medical data, the Datafly system. In Proc American Medical Informatics Association Annual Symposium, pp 51-55, Washington, DC, Nov [8] Sweeney, L. Uniqueness of simple demographics in the U.S. population. LIDAP-WP4. Laboratory for International Data Privacy, Carnegie Mellon University [9] Sweeney, L. Information explosion. Confidentiality, disclosure, and data access: theory and practical applications for statistical agencies, L. Zayatz, P. Doyle, J. Theeuwes and J. Lane (eds), Urban Institute, DC, [10] Sweeney, L. K-anonymity: A model for protecting privacy. International Journal on Uncertainty, Fuzziness, and Knowledge-based Systems. 10(7): , [11] Torra, V. Re-identifying individuals using OWA operators. In Proceedings of the 6 th International Conference on Soft Computing. Iizuka, Fukuoka, Japan [12] Winkler, W.E. Matching and record linkage. In Business Survey Methods, New York, J. Wiley. pp , ACKNOWLEDGEMENTS The authors would like to thank Yiheng Li, Robert Murphy, Rema Padman, and Victor Weedn for useful discussions. Additional thanks are extended to Alan Montgomery, Robert Kraut, and the Homenet project for the use of their data. We also thank Joe Lombardo and Julie Pavlin for their support. This work was funded by the Data Privacy Laboratory at Carnegie Mellon University. REFERENCES [1] Altman, R.B. and Klein, T.E. Challenges for biomedical informatics and pharmacogenomics. Annu Rev Pharmacol Toxicol, 42, ,

### A Secure Protocol to Distribute Unlinkable Health Data

A Secure Protocol to Distribute Unlinkable Health Data Bradley Malin MS and Latanya Sweeney PhD Institute for Software Research International, School of Computer Science Carnegie Mellon University, Pittsburgh,

### Respected Chairman and the Members of the Board, thank you for the opportunity to testify today on emerging technologies that are impacting privacy.

Statement of Latanya Sweeney, PhD Associate Professor of Computer Science, Technology and Policy Director, Data Privacy Laboratory Carnegie Mellon University before the Privacy and Integrity Advisory Committee

### Data Privacy and Biomedicine Syllabus - Page 1 of 6

Data Privacy and Biomedicine Syllabus - Page 1 of 6 Course: Data Privacy in Biomedicine (BMIF-380 / CS-396) Instructor: Bradley Malin, Ph.D. (b.malin@vanderbilt.edu) Semester: Spring 2015 Time: Mondays

### Anonymization of Administrative Billing Codes with Repeated Diagnoses Through Censoring

Anonymization of Administrative Billing Codes with Repeated Diagnoses Through Censoring Acar Tamersoy, Grigorios Loukides PhD, Joshua C. Denny MD MS, and Bradley Malin PhD Department of Biomedical Informatics,

### Preserving Privacy by De-identifying Facial Images

Preserving Privacy by De-identifying Facial Images Elaine Newton Latanya Sweeney Bradley Malin March 2003 CMU-CS-03-119 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213-3890 Abstract

### Guidance on De-identification of Protected Health Information November 26, 2012.

Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule November 26, 2012 OCR gratefully

### Introduction to Data Mining

Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association

### IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

### Degrees of De-identification of Clinical Research Data

Vol. 7, No. 11, November 2011 Can You Handle the Truth? Degrees of De-identification of Clinical Research Data By Jeanne M. Mattern Two sets of U.S. government regulations govern the protection of personal

### DATA MINING - 1DL360

DATA MINING - 1DL360 Fall 2013" An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/per1ht13 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

### Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

### How to De-identify Data. Xulei Shirley Liu Department of Biostatistics Vanderbilt University 03/07/2008

How to De-identify Data Xulei Shirley Liu Department of Biostatistics Vanderbilt University 03/07/2008 1 Outline The problem Brief history The solutions Examples with SAS and R code 2 Background The adoption

### De-identification, defined and explained. Dan Stocker, MBA, MS, QSA Professional Services, Coalfire

De-identification, defined and explained Dan Stocker, MBA, MS, QSA Professional Services, Coalfire Introduction This perspective paper helps organizations understand why de-identification of protected

### Private Record Linkage with Bloom Filters

To appear in: Proceedings of Statistics Canada Symposium 2010 Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Private Record Linkage with Bloom Filters Rainer Schnell,

### FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

### 6 Scalar, Stochastic, Discrete Dynamic Systems

47 6 Scalar, Stochastic, Discrete Dynamic Systems Consider modeling a population of sand-hill cranes in year n by the first-order, deterministic recurrence equation y(n + 1) = Ry(n) where R = 1 + r = 1

### Two General Methods to Reduce Delay and Change of Enumeration Algorithms

ISSN 1346-5597 NII Technical Report Two General Methods to Reduce Delay and Change of Enumeration Algorithms Takeaki Uno NII-2003-004E Apr.2003 Two General Methods to Reduce Delay and Change of Enumeration

Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden yi.cheng@ericsson.com, christian.schaefer@ericsson.com Abstract While big data analytics provides

### Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0

Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0 Guidance Document Prepared by: PCORnet Data Privacy Task Force Submitted to the PMO Approved by the PMO Submitted

### Sharing Solutions for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences

Sharing Solutions for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco Fortini 1, Miguel Guigò 2, Francisco Hernandez 2,

### SESSION DEPENDENT DE-IDENTIFICATION OF ELECTRONIC MEDICAL RECORDS

SESSION DEPENDENT DE-IDENTIFICATION OF ELECTRONIC MEDICAL RECORDS A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Bachelor of Science with Honors Research Distinction in Electrical

### Regression analysis of probability-linked data

Regression analysis of probability-linked data Ray Chambers University of Wollongong James Chipperfield Australian Bureau of Statistics Walter Davis Statistics New Zealand 1 Overview 1. Probability linkage

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

Advanced record linkage methods: scalability, classification, and privacy Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University

### Quality and Complexity Measures for Data Linkage and Deduplication

Quality and Complexity Measures for Data Linkage and Deduplication Peter Christen and Karl Goiser Department of Computer Science, Australian National University, Canberra ACT 0200, Australia {peter.christen,karl.goiser}@anu.edu.au

### Data Driven Approaches to Prescription Medication Outcomes Analysis Using EMR

Data Driven Approaches to Prescription Medication Outcomes Analysis Using EMR Nathan Manwaring University of Utah Masters Project Presentation April 2012 Equation Consulting Who we are Equation Consulting

### Tomislav Križan Consultancy Poslovna Inteligencija d.o.o

In-Situ Anonymization of Big Data Tomislav Križan Consultancy Director @ Poslovna Inteligencija d.o.o Abstract Vast amount of data is being generated from versatile sources and organizations are primarily

### A Statistical Text Mining Method for Patent Analysis

A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, shjun@cju.ac.kr Abstract Most text data from diverse document databases are unsuitable for analytical

### Principles and Best Practices for Sharing Data from Environmental Health Research: Challenges Associated with Data-Sharing: HIPAA De-identification

Principles and Best Practices for Sharing Data from Environmental Health Research: Challenges Associated with Data-Sharing: HIPAA De-identification Daniel C. Barth-Jones, M.P.H., Ph.D Assistant Professor

### Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

CS346: Advanced Databases Alexandra I. Cristea A.I.Cristea@warwick.ac.uk Data Security and Privacy Outline Chapter: Database Security in Elmasri and Navathe (chapter 24, 6 th Edition) Brief overview of

### Mark Elliot October 2014

Final Report on the Disclosure Risk Associated with the Synthetic Data Produced by the SYLLS Team Mark Elliot October 2014 1. Background I have been asked by the SYLLS project, University of Edinburgh

### Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Formal Languages and Automata Theory - Regular Expressions and Finite Automata - Samarjit Chakraborty Computer Engineering and Networks Laboratory Swiss Federal Institute of Technology (ETH) Zürich March

### Privacy Aspects in Big Data Integration: Challenges and Opportunities

Privacy Aspects in Big Data Integration: Challenges and Opportunities Peter Christen Research School of Computer Science, The Australian National University, Canberra, Australia Contact: peter.christen@anu.edu.au

### Network (Tree) Topology Inference Based on Prüfer Sequence

Network (Tree) Topology Inference Based on Prüfer Sequence C. Vanniarajan and Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai 600036 vanniarajanc@hcl.in,

### International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

### Comparing Methods to Identify Defect Reports in a Change Management Database

Comparing Methods to Identify Defect Reports in a Change Management Database Elaine J. Weyuker, Thomas J. Ostrand AT&T Labs - Research 180 Park Avenue Florham Park, NJ 07932 (weyuker,ostrand)@research.att.com

### Challenges of Data Privacy in the Era of Big Data. Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014

Challenges of Data Privacy in the Era of Big Data Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014 1 Outline Why should we care? What is privacy? How do achieve privacy? Big

### ES ET DE LA VIE PRIVÉE E 29 th INTERNATIONAL CONFERENCE OF DATA PROTECTION AND PRIVACY COMMISSIONERS

ES ET DE LA VIE PRIVÉE E 29 th INTERNATIONAL CONFERENCE OF DATA PROTECTION AND PRIVACY COMMISS Data Mining Dr. Bradley A. Malin Assistant Professor Department of Biomedical Informatics Vanderbilt University

### Online Classification on a Budget

Online Classification on a Budget Koby Crammer Computer Sci. & Eng. Hebrew University Jerusalem 91904, Israel kobics@cs.huji.ac.il Jaz Kandola Royal Holloway, University of London Egham, UK jaz@cs.rhul.ac.uk

### December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation

### 2. Basic Relational Data Model

2. Basic Relational Data Model 2.1 Introduction Basic concepts of information models, their realisation in databases comprising data objects and object relationships, and their management by DBMS s that

### A Brief Introduction to Property Testing

A Brief Introduction to Property Testing Oded Goldreich Abstract. This short article provides a brief description of the main issues that underly the study of property testing. It is meant to serve as

### Automata and Languages

Automata and Languages Computational Models: An idealized mathematical model of a computer. A computational model may be accurate in some ways but not in others. We shall be defining increasingly powerful

### EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

### Globally Optimal Crowdsourcing Quality Management

Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford

### BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 seema@iasri.res.in Genomics A genome is an organism s

### Protein Protein Interaction Networks

Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

### Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia

### Compression algorithm for Bayesian network modeling of binary systems

Compression algorithm for Bayesian network modeling of binary systems I. Tien & A. Der Kiureghian University of California, Berkeley ABSTRACT: A Bayesian network (BN) is a useful tool for analyzing the

### 1. LINEAR EQUATIONS. A linear equation in n unknowns x 1, x 2,, x n is an equation of the form

1. LINEAR EQUATIONS A linear equation in n unknowns x 1, x 2,, x n is an equation of the form a 1 x 1 + a 2 x 2 + + a n x n = b, where a 1, a 2,..., a n, b are given real numbers. For example, with x and

### Securing Big Data Learning and Differences from Cloud Security

Securing Big Data Learning and Differences from Cloud Security Samir Saklikar RSA, The Security Division of EMC Session ID: DAS-108 Session Classification: Advanced Agenda Cloud Computing & Big Data Similarities

### Email Alias Detection Using Social Network Analysis

Email Alias Detection Using Social Network Analysis Ralf Hölzer Information Networking Institute Carnegie Mellon University Pittsburgh, PA 1513 rholzer@cmu.edu Bradley Malin School of Computer Science

### Do you know? "7 Practices" for a Reliable Requirements Management. by Software Process Engineering Inc. translated by Sparx Systems Japan Co., Ltd.

Do you know? "7 Practices" for a Reliable Requirements Management by Software Process Engineering Inc. translated by Sparx Systems Japan Co., Ltd. In this white paper, we focus on the "Requirements Management,"

### SART Data-Sharing Manual USAID Ed Strategy 2011 2015 Goal 1

SART Data-Sharing Manual USAID Ed Strategy 2011 2015 Goal 1 This manual was developed for USAID s Secondary Analysis for Results Tracking (SART) project by Optimal Solutions Group, LLC (Prime Contractor),

### Machine Learning Logistic Regression

Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

### A very brief introduction to genetic algorithms

A very brief introduction to genetic algorithms Radoslav Harman Design of experiments seminar FACULTY OF MATHEMATICS, PHYSICS AND INFORMATICS COMENIUS UNIVERSITY IN BRATISLAVA 25.2.2013 Optimization problems:

### RE-IDENTIFICATION RISK IN SWAPPED MICRODATA RELEASE

RE-IDENTIFICATION RISK IN SWAPPED MICRODATA RELEASE Krish Muralidhar, Department of Marketing and Supply Chain Management, University of Oklahoma, Norman, OK 73019, (405)-325-2677, krishm@ou.edu ABSTRACT

### Risk Knowledge Capture in the Riskit Method

Risk Knowledge Capture in the Riskit Method Jyrki Kontio and Victor R. Basili jyrki.kontio@ntc.nokia.com / basili@cs.umd.edu University of Maryland Department of Computer Science A.V.Williams Building

### A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms

A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms Ian Davidson 1st author's affiliation 1st line of address 2nd line of address Telephone number, incl country code 1st

### Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

CSE599s: Extremal Combinatorics November 21, 2011 Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs Lecturer: Anup Rao 1 An Arithmetic Circuit Lower Bound An arithmetic circuit is just like

### Data, Measurements, Features

Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

### Online Failure Prediction in Cloud Datacenters

Online Failure Prediction in Cloud Datacenters Yukihiro Watanabe Yasuhide Matsumoto Once failures occur in a cloud datacenter accommodating a large number of virtual resources, they tend to spread rapidly

### Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

### Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research

Algorithmic Techniques for Big Data Analysis Barna Saha AT&T Lab-Research Challenges of Big Data VOLUME Large amount of data VELOCITY Needs to be analyzed quickly VARIETY Different types of structured

### Protecting Respondents' Identities in Microdata Release

1010 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 13, NO. 6, NOVEMBER/DECEMBER 2001 Protecting Respondents' Identities in Microdata Release Pierangela Samarati, Member, IEEE Computer Society

### Database security. André Zúquete Security 1. Advantages of using databases. Shared access Many users use one common, centralized data set

Database security André Zúquete Security 1 Advantages of using databases Shared access Many users use one common, centralized data set Minimal redundancy Individual users do not have to collect and maintain

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### Notes on Complexity Theory Last updated: August, 2011. Lecture 1

Notes on Complexity Theory Last updated: August, 2011 Jonathan Katz Lecture 1 1 Turing Machines I assume that most students have encountered Turing machines before. (Students who have not may want to look

### The Basics of Graphical Models

The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

### Time-Series Prediction with Applications to Traffic and Moving Objects Databases

Time-Series Prediction with Applications to Traffic and Moving Objects Databases Bo Xu Department of Computer Science University of Illinois at Chicago Chicago, IL 60607, USA boxu@cs.uic.edu Ouri Wolfson

### Elements of probability theory

The role of probability theory in statistics We collect data so as to provide evidentiary support for answers we give to our many questions about the world (and in our particular case, about the business

### Chapter 6: Episode discovery process

Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing

### DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

### Lecture 3: Linear Programming Relaxations and Rounding

Lecture 3: Linear Programming Relaxations and Rounding 1 Approximation Algorithms and Linear Relaxations For the time being, suppose we have a minimization problem. Many times, the problem at hand can

### Li Xiong, Emory University

Healthcare Industry Skills Innovation Award Proposal Hippocratic Database Technology Li Xiong, Emory University I propose to design and develop a course focused on the values and principles of the Hippocratic

### Linear Codes. Chapter 3. 3.1 Basics

Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length

### Formal Methods for Preserving Privacy for Big Data Extraction Software

Formal Methods for Preserving Privacy for Big Data Extraction Software M. Brian Blake and Iman Saleh Abstract University of Miami, Coral Gables, FL Given the inexpensive nature and increasing availability

### Mining Signatures in Healthcare Data Based on Event Sequences and its Applications

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications Siddhanth Gokarapu 1, J. Laxmi Narayana 2 1 Student, Computer Science & Engineering-Department, JNTU Hyderabad India 1

### In developmental genomic regulatory interactions among genes, encoding transcription factors

JOURNAL OF COMPUTATIONAL BIOLOGY Volume 20, Number 6, 2013 # Mary Ann Liebert, Inc. Pp. 419 423 DOI: 10.1089/cmb.2012.0297 Research Articles A New Software Package for Predictive Gene Regulatory Network

### MEASURING DISCLOSURE RISK AND AN EXAMINATION OF THE POSSIBILITIES OF USING SYNTHETIC DATA IN THE INDIVIDUAL INCOME TAX RETURN PUBLIC USE FILE

MEASURING DISCLOSURE RISK AND AN EXAMINATION OF THE POSSIBILITIES OF USING SYNTHETIC DATA IN THE INDIVIDUAL INCOME TAX RETURN PUBLIC USE FILE Sonya Vartivarian and John L. Czajka,ÃMathematica Policy Research,

### Competitive Analysis of On line Randomized Call Control in Cellular Networks

Competitive Analysis of On line Randomized Call Control in Cellular Networks Ioannis Caragiannis Christos Kaklamanis Evi Papaioannou Abstract In this paper we address an important communication issue arising

### INTRODUCTION TO CODING THEORY: BASIC CODES AND SHANNON S THEOREM

INTRODUCTION TO CODING THEORY: BASIC CODES AND SHANNON S THEOREM SIDDHARTHA BISWAS Abstract. Coding theory originated in the late 1940 s and took its roots in engineering. However, it has developed and

### The Trip Scheduling Problem

The Trip Scheduling Problem Claudia Archetti Department of Quantitative Methods, University of Brescia Contrada Santa Chiara 50, 25122 Brescia, Italy Martin Savelsbergh School of Industrial and Systems

### DATA MINING - 1DL105, 1DL025

DATA MINING - 1DL105, 1DL025 Fall 2009 An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht09 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

### Establishing How Many VoIP Calls a Wireless LAN Can Support Without Performance Degradation

Establishing How Many VoIP Calls a Wireless LAN Can Support Without Performance Degradation ABSTRACT Ángel Cuevas Rumín Universidad Carlos III de Madrid Department of Telematic Engineering Ph.D Student

### GRADING. SCHEDULE (Tentative and Subject to Change) experience with, security principles is NOT a prerequisite for this course.

Data Privacy in Biomedicine (BMIF-380 / CS-396) Instructor: Bradley Malin Semester: Spring Website: http://people.vanderbilt.edu/~b.malin/bmif380/index.html DESCRIPTION The integration of information technology

### Review of Computer Engineering Research WEB PAGES CATEGORIZATION BASED ON CLASSIFICATION & OUTLIER ANALYSIS THROUGH FSVM. Geeta R.B.* Shobha R.B.

Review of Computer Engineering Research journal homepage: http://www.pakinsight.com/?ic=journal&journal=76 WEB PAGES CATEGORIZATION BASED ON CLASSIFICATION & OUTLIER ANALYSIS THROUGH FSVM Geeta R.B.* Department

### Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set

### SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

### The Health Insurance Portability and Accountability Act (HIPAA) Excerpted from the UTC IRB Policy

The Health Insurance Portability and Accountability Act (HIPAA) Excerpted from the UTC IRB Policy June 2008 Table of Contents PART V: The Health Insurance Portability and Accountability Act (HIPAA)...

### Solving nonlinear equations in one variable

Chapter Solving nonlinear equations in one variable Contents.1 Bisection method............................. 7. Fixed point iteration........................... 8.3 Newton-Raphson method........................

### Computational Disclosure Control A Primer on Data Privacy Protection

A Primer on Data Privacy Protection by Latanya Sweeney i Table of Contents Chapter 0 Preface... 13 0.1 Description of work... 13 0.1.1 Computational disclosure control... 13 0.1.2 Contributions of this

### Applied Algorithm Design Lecture 5

Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design

### 2.1 Complexity Classes

15-859(M): Randomized Algorithms Lecturer: Shuchi Chawla Topic: Complexity classes, Identity checking Date: September 15, 2004 Scribe: Andrew Gilpin 2.1 Complexity Classes In this lecture we will look

### Security in Outsourcing of Association Rule Mining

Security in Outsourcing of Association Rule Mining W. K. Wong The University of Hong Kong wkwong2@cs.hku.hk David W. Cheung The University of Hong Kong dcheung@cs.hku.hk Ben Kao The University of Hong

### Analysis of Uncertain Data: Tools for Representation and Processing

Analysis of Uncertain Data: Tools for Representation and Processing Bin Fu binf@cs.cmu.edu Eugene Fink e.fink@cs.cmu.edu Jaime G. Carbonell jgc@cs.cmu.edu Abstract We present initial work on a general-purpose

### Integrating Pattern Mining in Relational Databases

Integrating Pattern Mining in Relational Databases Toon Calders, Bart Goethals, and Adriana Prado University of Antwerp, Belgium {toon.calders, bart.goethals, adriana.prado}@ua.ac.be Abstract. Almost a

### Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Privacy Policy Effective Date: November 20, 2014 Welcome to the American Born Moonshine website (this Site ). This policy describes the Privacy Policy (this Policy ) for this Site and describes how Windy