Privacy Preserving Health Data Publishing using Secure Two Party Algorithm

Similar documents
Anonymity meets game theory: secure data integration with malicious participants

Centralized and Distributed Anonymization for High-Dimensional Healthcare Data

DATA MINING - 1DL360

Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service

Information Security in Big Data using Encryption and Decryption

QUANTIFYING THE COSTS AND BENEFITS OF PRIVACY-PRESERVING HEALTH DATA PUBLISHING

Data attribute security and privacy in distributed database system

A generalized Framework of Privacy Preservation in Distributed Data mining for Unstructured Data Environment

CS346: Advanced Databases

International Journal of Advanced Computer Technology (IJACT) ISSN: PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Differentially Private Analysis of

Securing Health Care Information by Using Two-Tier Cipher Cloud Technology

Data Mining and Sensitive Inferences

PRIVACY-PRESERVING DATA ANALYSIS AND DATA SHARING

Information Security in Big Data: Privacy and Data Mining (IEEE, 2014) Dilara USTAÖMER

A Knowledge Model Sharing Based Approach to Privacy-Preserving Data Mining

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

Privacy-preserving Data Mining: current research and trends

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October-2013 ISSN

Practicing Differential Privacy in Health Care: A Review

AN APPROACH TO ANTICIPATE MISSING ITEMS IN SHOPPING CARTS

Shroudbase Technical Overview

Privacy Techniques for Big Data

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior

Privacy Preserved Association Rule Mining For Attack Detection and Prevention

De-Identification 101

Li Xiong, Emory University

A NOVEL APPROACH FOR MULTI-KEYWORD SEARCH WITH ANONYMOUS ID ASSIGNMENT OVER ENCRYPTED CLOUD DATA

Secure Computation Martin Beck

ARX A Comprehensive Tool for Anonymizing Biomedical Data

A Privacy-preserving Approach for Records Management in Cloud Computing. Eun Park and Benjamin Fung. School of Information Studies McGill University

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

future proof data privacy

Privacy Preserving Data Mining

SURVEY ON: CLOUD DATA RETRIEVAL FOR MULTIKEYWORD BASED ON DATA MINING TECHNOLOGY

International Journal of Advance Research in Computer Science and Management Studies

Degrees of De-identification of Clinical Research Data

Personalization of Web Search With Protected Privacy

An Advanced Bottom up Generalization Approach for Big Data on Cloud

How To Create A Privacy Preserving And Dynamic Load Balancing System In A Distributed System

A Survey of Quantification of Privacy Preserving Data Mining Algorithms

Privacy Preserving Outsourcing for Frequent Itemset Mining

Proposing a Novel Synergized K-Degree L-Diversity T- Closeness Model for Graph Based Data Anonymization

Impelling Heart Attack Prediction System using Data Mining and Artificial Neural Network

(Big) Data Anonymization Claude Castelluccia Inria, Privatics

A Novel Technique of Privacy Protection. Mining of Association Rules from Outsourced. Transaction Databases

A GENERAL SURVEY OF PRIVACY-PRESERVING DATA MINING MODELS AND ALGORITHMS

Anonymization: Enhancing Privacy and Security of Sensitive Data of Online Social Networks

A SECURE DECISION SUPPORT ESTIMATION USING GAUSSIAN BAYES CLASSIFICATION IN HEALTH CARE SERVICES

NSF Workshop on Big Data Security and Privacy

Obfuscation of sensitive data in network flows 1

Keywords data mining, prediction techniques, decision making.

Classification and Prediction

Employer Health Insurance Premium Prediction Elliott Lui

Privacy-Preserving Big Data Publishing

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Standardization and Its Effects on K-Means Clustering Algorithm

Data mining successfully extracts knowledge to

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Privacy-Preserving Models for Comparing Survival Curves Using the Logrank Test

Mining various patterns in sequential data in an SQL-like manner *

DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE

A Study of Data Perturbation Techniques For Privacy Preserving Data Mining

Introduction to Data Mining

Welcome to the Privacy and Security PowerPoint presentation in the Data Analytics Toolkit. This presentation will provide introductory information

A Catechistic Method for Traffic Pattern Discovery in MANET

Challenges of Data Privacy in the Era of Big Data. Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014

Privacy-preserving Data-aggregation for Internet-of-things in Smart Grid

Efficient Integration of Data Mining Techniques in Database Management Systems

Grand Rapids Medical Education Partners Mercy Health Saint Mary s Spectrum Health. Pam Jager, GRMEP Director of Education & Development

IMPLEMENTATION OF NETWORK SECURITY MODEL IN CLOUD COMPUTING USING ENCRYPTION TECHNIQUE

A UPS Framework for Providing Privacy Protection in Personalized Web Search

HIPAA Compliance and the Protection of Patient Health Information

Easing the Burden of Healthcare Compliance

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

De-identification, defined and explained. Dan Stocker, MBA, MS, QSA Professional Services, Coalfire

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

DATA MINING - 1DL105, 1DL025

Notice of Privacy Practices

Policy-based Pre-Processing in Hadoop

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Decision Trees for Mining Data Streams Based on the Gaussian Approximation

HIPAA COMPLIANCE AND DATA PROTECTION Page 1

Using multiple models: Bagging, Boosting, Ensembles, Forests

International Journal of Advance Research in Computer Science and Management Studies

Agent Based Decision Support System for Identifying the Spread of Nosocomial Infections in a Rural Hospital

PRIVACY IMPLICATIONS FOR NEXT GENERATION SIEMs AND OTHER META-SYSTEMS

ACTA UNIVERSITATIS APULENSIS No 15/2008 A COMPARISON BETWEEN LOCAL AND GLOBAL RECODING ALGORITHMS FOR ACHIEVING MICRODATA P-SENSITIVE K -ANONYMITY

Principles and Best Practices for Sharing Data from Environmental Health Research: Challenges Associated with Data-Sharing: HIPAA De-identification

De-Identification Framework

SQL Auditing. Introduction. SQL Auditing. Team i-protect. December 10, Denition

Homomorphic Encryption Schema for Privacy Preserving Mining of Association Rules

Efficient Algorithms for Masking and Finding Quasi-Identifiers

Near Sheltered and Loyal storage Space Navigating in Cloud

An Approach Towards Customized Multi- Tenancy

Performance of KDB-Trees with Query-Based Splitting*

Transcription:

IJIRST International Journal for Innovative Research in Science & Technology Volume 2 Issue 1 June 2015 ISSN (online): 2349-6010 Privacy Preserving Health Data Publishing using Secure Two Party Algorithm Sameera K M Assistant Professor Meera R Nair Student Aparna Vinayan Student Shalini Balakrishnan Student Abstract In this paper, we address the problem of private data publishing, where different attributes for the same set of individuals are held by two parties. Privacy-preserving data publishing addresses the problem of disclosing sensitive data when mining for useful information. In order to achieve this, we use two systems, namely hospital and insurance in which the two party algorithm is applied to get the resultant, shared dataset. The results are compared with k-anonymity algorithm as part of experiments and found to be better and more secure. Keywords: Differential Privacy, secure data integration, secure data publishing I. INTRODUCTION The research topic of privacy-preserving data publishing has received a lot of attention in different research communities, from economic implications to anonymization algorithms. Huge databases exist today due to the rapid advances in communication and storing systems. Each database is owned by a particular autonomous entity. Moreover, the emergence of new paradigms such as cloud computing increases the amount of data distributed between multiple entities. These distributed data can be integrated to enable better data analysis for making better decisions and providing high-quality services. For example, data can be integrated to improve medical research, customer service, or homeland security. However, data integration between autonomous entities should be conducted in such a way that no more information than necessary is revealed between the participating entities. The field of medicinal research and health data publishing insists upon the compliance of health regulatory bodies and rules by health information custodians, who are liable to share electronic health records for health data mining and clinical research. Health records by its nature are very sensitive and sharing even de-identified records may raise issues of patient privacy breach. Data privacy breach incidents not only create negative impacts of these health service providers in the general public but also result in possible civil lawsuits from patients for claiming compensation. In the United States of America, the Health Insurance Portability and Accountability Act (HIPAA) requires patient consent before the disclosure of health information between health service providers. Health Information Technology for Economic and Clinical Health (HITECH) Act builds on the HIPAA Act of 1996 to strengthen the privacy and security rules. HITECH Act augments an individual s privacy protections, expands individuals new rights to their health information, and includes revisions to the penalties applied to each HIPAA violation category for healthcare data breaches. In this paper, we propose an algorithm to securely integrate person-specific sensitive data from two data providers, whereby the integrated data still retain the essential information for supporting data mining tasks. A HIC wants to share a person-specific data table with a health data miner, such as a medical practitioner or a health insurance company for research purposes. A person-specific dataset for classification analysis typically contains four types of attributes, namely the explicit identifiers, the quasi-identifier (QID), the sensitive attribute, and the class attribute. Explicit identifiers (such as name, social security number, and telephone number, etc.) are those which belongs to personal unique identification. QID (such as birth date, sex, race, and postal code, etc.) is a set of attributes having values may not be unique but their combination may reveal the identity of an individual. Sensitive attributes (such as disease, salary, marital-status, etc.) are those attributes that contain sensitive information of an individual. Class attributes are the attributes that the health data miner wants to perform classification analysis. Let D(A1,...,An, Sens, Class) be a data table with explicit identifiers removed, where{a1,...,an}are quasiidentifiers that can be either categorical or numerical attributes, Sens is a sensitive attribute, and Class is a class attribute. A All rights reserved by www.ijirst.org 33

record in D has the form v1,v2,...,vn,s,cls, where vi is a value of Ai, s is a sensitive value of Sens, and cls is a class value of Class. II. RELATED WORK Data privacy has been an active research topic in the statistics, database, and security communities for the last three decades. An existing system is K-anonymity. A. K-Anonymity k-anonymity is a property possessed by certain anonymized data. Given person-specific field-structured data, produce a release of the data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful. A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release. B. Goal of K-Anonymity Each record is indistinguishable from at least k-1 other records These k records form an equivalence class. 476** 47677 47602 2 ZIP CODE 47678 2* 29 22 27 AGE * MALE FEMALE SEX All rights reserved by www.ijirst.org 34

Let D(A1,...,An) be a table and QID be the +quasi identifier associated with it. D satisfies K- anonymity if and only if each record on QID in D appears with at least K 1 other records in D. K-anonymity does not provide privacy if sensitive values in an equivalence class lack diversity so it is subject to attribute linkage attack. Furthermore, due to the curse of high dimensionality as discussed, enforcing K-anonymity on high-dimensional data would result in significant information loss. Id Name Age Sex Zip occupation disease 10* **** >40 M 300** P migraine 10* **** <40 F 300** P hiv 10* **** >40 F 300** A asthma 10* **** >40 M 300** P migraine 10* **** <40 M 300** A migraine 11* **** >40 F 300** A asthma 11* **** <40 M 300** P hiv 11* **** <40 F 300** A asthma 11* **** <40 F 300** P Migraine 11* **** >40 M 300** A Hiv III. SECURE TWO PARTY ALGORITHM Differential privacy aims to provide means to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying its records. Differential privacy is a recent privacy definition that provides a strong privacy guarantee. It guarantees that an adversary learns nothing more about an individual, regardless of whether her record is present or absent in the data. A standard mechanism to achieve differential privacy is to add a random noise to the true output of a function. The noise is calibrated according to the sensitivity of the function. The sensitivity of a function is the maximum difference of its outputs from two data sets that differ only in one record. A. Two-Party Algorithm In this section, we present our Distributed Differentially private anonymization algorithm based on Generalization (DistDiffGen) for two parties as shown in Algorithm. The algorithm first generalizes the raw data and then adds noise to achieve Ɛ -differential privacy. 1) Algorithm: Two-Party Algorithm (DistDiffGen). Input: Raw data set D1, privacy budget Ɛ, and number of specializations h Output: Anonymized data set D 1) Initialize Dg with one record containing top most values; 2) Initialize Cuti to include the topmost value; 3) Ɛ Ɛ 2( A +2h) 4) Determine the split value for each Ʋ n UCuti with probability exp( Ɛ 2Δu(D, Ʋ n )); 5) Compute the score Ʋ UCuti 6) for l=1 to h do 7) Determine the winner candidate w 8) if w is local then 9) Specialize w on Dg; 10) Replace w with child(w) in the local copy of UCuti; 11) Instruct P2 to specialize and update UCuti; 12) Determine the split value for each new Ʋ n UCuti with probability exp( Ɛ 2Δu(D, Ʋ n )); 13) Compute the score for each new Ʋ UCuti 14) else 15) Wait for the instruction from P2; 16) Specialize w and update UCuti using the instruction; 17) end if 18) end for 19) for each leaf node of Dg do 20) Execute the SSPP Protocol to compute the shares C1 and C2 of the true count C; 21) Generate two gaussian random variables Yi~N(0, 1 Ɛ) for i {1,2} 22) Compute X1 = C1 +Y 2 1 -Y 2 2 ; 23) Exchange X1 with P2 to compute (C+Lap(2/Ɛ)) 24) end for 25) return Each leaf node with count (C+Lap(2/Ɛ)) All rights reserved by www.ijirst.org 35

The general idea is to anonymize the raw data by a sequence of specializations starting from the topmost general state. A specialization, written Ʋ child(ʋ), where child(ʋ) denotes the set of child values of Ʋ, replaces the parent value Ʋ with child values. The specialization process can be viewed as pushing the cut of each taxonomy tree downwards. A cut of the taxonomy tree for an attribute A, denoted by Cuti, contains exactly one value on each root-to leaf path. The specialization starts from the topmost cut and pushes down the cut iteratively by specializing a value in the current cut. Algorithm is executed by the party P1 (same for the party P2) and can be summarized as follows: Generalizing raw data. Each party keeps a copy of the current UCuti and a generalized table Dg, in addition to the private table D1 or D2. Here, UCuti is the set of all candidate values for specialization. Initially, all values in A are generalized to the topmost value in their taxonomy trees, and Cuti contains the topmost value for each attribute A. At each iteration, the algorithm uses the distributed exponential mechanism to select a candidate w UCuti, which is owned by either P1 or P2, for specialization. Candidates are selected based on their score values, and different utility functions can be used to determine the scores of the candidates. Once a winner candidate is determined, both parties specialize the winner w on Dg by splitting their records into child partitions according to the provided taxonomy trees. If the winner w is one of P1 s candidates, P1 specializes w on Dg, updates its local copy of UCuti, and instructs P2 to specialize and update its local copy of UCuti accordingly. P1 also calculates the scores of the new candidates due to the specialization. If the winner w is not one of P1 s candidates, P1 waits for instruction from P2 to specialize w and to update its local copy of UCuti. This process is repeated according to the number of the specializations h. Algorithm performs exactly the same sequence of operations as in the single-party algorithm DiffGen but in a distributed setting. DiffGen is Ɛ-differentially private. Therefore, we prove the correctness of Algorithm by just proving the steps that are different from DiffGen: Candidate selection. Algorithm selects a candidate for specialization. This step correctly uses the exponential mechanism therefore, the candidate selection step guarantees Ɛ-differential privacy. Updating the tree Dg and UCuti. Each party has its own copy of Dg and UCuti. Each party updates these items exactly like DiffGen either by using the local information or by using the instruction provided by the other party Computing the noisy count. Algorithm also outputs the noisy count of each leaf node, where the noise is equal to Lap(2/Ɛ). Thus, it guarantees Ɛ 2-differential privacy. In summary, Algorithm uses half of the privacy budget to generalize the data, where each individual operation is Ɛ - differential privacy; it uses the remaining half of the privacy budget to ensure overall Ɛ-differential privacy. As an example we consider 2 tables 1 for hospital and 1 for insurance. The hospital and insurence database contains a data table given in table 1 and table2 respectively to hold the data that need to be shared. On these tables we apply our algorithm to get the anonymous data table.this table contains the shared fields without revealing the sensitive information. Ɛ-differential privacy algorithm is thus proved to provide higher level of security and performance. A major difference between the algorithms is that k-anonymity simply masks part of the sensitive fields in an attempt to provide privacy. This leaves scope for comparison and guess of the contents. Whereas Ɛ-differential privacy algorithm hides such fields entirely. This gives higher security. Fig. 2: Generalized data table (D9).distributed exponential mechanism is used for specializing the predictor attributes in a topdown manner using half of the privacy budget. Laplace noise is added at leaf nodes to the true count using the second half of the privacy budget to ensure overall Ɛ-differential private output. All rights reserved by www.ijirst.org 36

IV. EXPERIMENTS To evaluate the impact on classification quality, we have used the algorithm on the following dataset. It was observed that the algorithm offers a high degree of privacy, as seen in the result set. The algorithm is implemented in parallel in both parties using TCP. A set of predicates are chosen, based on which the sharing is performed. At the insurance part, we choose salary ([18-99]) and at the hospital, we choose job (Any-Job) and sex (Any_Sex). We anonymize the data based on these predicates. For Any_Sex, there are two possible values Male and Female. Any_Job is classified into Professional and Artist. Salary is generalized to a class 18-99 thousand. The anonymous data table is formed based on these predicates, excluding any sensitive fields of the data table. Table 1: Slno Pname age Occupation sex Disease Class 101 Adarsh Rai 29 Doctor M Migraine N 102 Geetha Mehra 38 Cleaner F Hiv Y 103 Govind Ram 64 Welder M Asthma N 104 Jennifer Sarah 38 Painter F Hiv N 105 Hafiz Ali 56 Painter M Migraine N 106 Arpita Roy 24 Lawyer F migraine N 107 Teena Thomas 36 Cleaner F Hiv Y 108 Rishikesh Mehta 61 Lawyer M asthma Y 109 Irene D'cruz 39 Painter F Hiv N 110 Aditya Sharma 24 Technician M asthma N 111 Sathish Shekhar 52 Painter M Hiv N 112 Smitha Abraham 41 Lawyer F asthma N 113 Karthik Keshav 28 Lawyer M migraine Y 114 Bimal Kumar 37 Cleaner M Hiv N 115 Shekhar Nair 66 Welder M asthma N 116 Poornima Mohan 36 Painter F Hiv Y 117 Rajesh Roy 44 Painter M Hiv N 118 Ujjual Kumar 30 Lawyer M migraine N 119 Sudha Prabhakaran 82 Cleaner F asthma Y 120 Purushothaman P 71 technician M Hiv Y Table 2 Id Name salary Occupation Sex class 101 Adarsh Rai 100000 Doctor M N 102 Geetha Mehra 15000 Cleaner F Y 103 Govind Ram 10000 Welder M Y 104 Jennifer Sarah 25000 Painter F N 105 Hafiz Ali 18000 Painter M N 106 Arpita Roy 30000 Lawyer f N 107 Teena Thomas 21000 Cleaner F Y 108 Rishikesh mehta 95000 Lawyer M Y 109 Irene D'cruz 20000 Painter F Y 110 Aditya Sharma 50000 Technician M N 111 Sathish Shekhar 45000 Painter M Y 112 Smitha Abraham 60000 Lawyer F Y 113 Karthik Keshav 32000 Lawyer M N 114 Bimal Kumar 20000 Cleaner M Y 115 Shekhar Nair 8000 Welder M Y 116 Poornima Mohan 17500 Painter F Y 117 Rajesh Roy 45000 Painter M N 118 Ujjual Kumar 50000 Lawyer M Y 119 Sudha Prabhakaran 5000 Cleaner F Y 120 Purushothaman P 10000 Technician M Y It can be observed from the anonymous table that the data has been published in the most secure way. None of the private details of the patient, such as name, place etc. have been disclosed. Also, distinctive features such as occupation has been classified into 2, professional and artist. So the Shared data is also masked to provide anonymity and privacy. All rights reserved by www.ijirst.org 37

Table 3 Occupation Sex Salary Count Privacy Preserving Health Data Publishing using Secure Two Party Algorithm P M 10-110 4 P F 10-110 8 A M 10-110 4 A F 10-110 4 V. PERFORMANCE ANALYSIS Here, we compare the algorithm with the existing system of k-anonymity algorithm to evaluate its efficiency. We can see that the two party algorithm provides higher security compared to k-anonymity algorithm. REFERENCES [1] Secure Two-Party Differentially Private Data Release for Vertically Partitioned Data, Noman Mohammed. [2] Quantifying the costs and Benefits of Privacy-Preserving health data publishing RASHID HUSSAIN KHOKHAR [3] Privacy Preserving heterogeneous health Data sharing Noman Mohammed,Xiaoqian Jiaong [4] Differentially Private Synthesization of Multi Dimensional data using Copula functions,haoran Li,Li Xiong [5] A General Framework for Privacy Preserving Data Publishing. A.H.M. Sarowar Sattar, Jiuyong Li [6] Design And Analysis Of Two Party Protocol to Retrieve Private Data By Using Vertical Partitioned Data-A Review Ms. Gauri V. Sonawane [7] Boosting and Differential Privacy Cynthia Dwork, Guy N. Rothblumy, Salil Vadhanz,Microsoft Research, 1065 La Avenida, Mountain View, Ca 94043 [8] Differentially Private Data Release for Data Mining Noman Mohamme Concordia University Montreal, QC, Canada [9] R. Agrawal, A.Evfimievski and R. Srikant, Information Sharing Across Private Databases, Proc. ACM Int l Conf. Management of Data, 2003. [10] B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, and K. Talwar, Privacy Accuracy, and Consistency Too: A Holistic Solution to Contingency Table Release, Proc. ACM Symp. Principles of Database Systems (PODS 07), 2007. [11] R.J. Bayardo and R. Agrawal, Data Privacy through Optimal k-anonymization, Proc. IEEE Int l Conf. Data Eng. (ICDE 05), 2005. [12] R. Bhaskar, S. Laxman, A. Smith, and A. Thakurta, Discovering Frequent Patterns in Sensitive Data, Proc. ACM Int l Conf. All rights reserved by www.ijirst.org 38