Globe Tech, Inc. 76 Northeastern Blvd., Suite #30B Nashua, NH Fax PrivGuard an eprivacy Solution

What is stored in a microaggregation?
Where is the location of Globe Tech , Inc .?
What is the purpose of PrivGuard?

Transcription

1 Globe Tech, Inc. 76 Northeastern Blvd., Suite #30B Nashua, NH Fax Protecting Private Healthcare Information (PHI) PrivGuard an eprivacy Solution As a result of widespread use of electronic health records (EHR), in recent years, there has been an explosion of digital patient data being generated and collected by health-care organizations. In tandem with this unprecedented growth of digital data, techniques for data mining have gained popularity in a wide variety of domains. While the health-care industry has benefited from information sharing and data mining, patients are increasingly concerned about invasion of their privacy by these practices. Similarly, the public and civil libertarian groups have also been concerned about privacy protection with EHR becoming main stream and increasingly popular with the medical professionals and patients. These growing concerns on privacy led to the passage of Health Insurance Portability and Accountability Act (HIPAA) in HIPAA HIPAA is designed to give patients more control over their personal medical information. It explicitly outlines how medical records can be given to third parties and carries stiff penalties for violations. The impact of HIPAA on medical research is beginning to surface in the research community with some researchers fearing that it could jeopardize studies of drug safety, medical device validation, and disease prediction and prevention. While HIPAA was intended to protect patient privacy, it has a significant impact on medical studies involving collection of data from a variety of health-care organizations. Because HIPAA guidelines are so cumbersome and the penalties for violations so steep, many organizations, particularly those small community hospitals and clinics, may decide it is safer and easier not to provide data for the medical research. Due to this concern, the Association of American Medical Colleges plans to compile a database so it can document the effect of HIPAA on research activities. PrivGuard is developed to address these concerns and assist healthcare providers in HIPAA compliance. PrivGuard has automated the process of quickly de-identifying and masking sensitive data, yet preserving the overall data integrity to permit high quality data mining and research analysis. PrivGuard is developed from years of innovative research work and integrates various techniques, such as decision trees, linear programming, Bayes estimation, kdtrees, and data masking, and attempts to apply them to help protect patient privacy. PrivGuard s broader implications are that it allows safe sharing of patient data across health-care organizational boundaries, while satisfying compliance requirements and providing the quality data to analysts for data-mining research that benefits both the medical research and society at large. The PrivGuard Solution The PrivGuard system provides several data masking algorithms. Data masking is very different from encryption; it does not change the data via ciphering nor does it require any keys or digital certificates to change the data. Instead data masking changes the data values using noise perturbation, data aggregation, or data swapping. The properties of the data are generally maintained after masking for statistical analysis and data-mining research. Data masking is not as resource intensive as encryption and it is used for preserving privacy of data before sharing with external organizations whereas encryption is more useful for protecting data during the process of data transmission. State of the art engineering solutions

2 Why PrivGuard? While there are many privacy protection and data masking solutions available in the market, PrivGuard is the only application which was designed to give the control of protecting their data from privacy attacks to the data owners. The solution uses complex masking technology yet, its simple to use and cost effective to implement. It does not require expensive and complex encryption technology but can protect patient data and allows data analysts and researchers do high quality research and analysis. Here are the key features of PrivGuard: Empowerment - allows data owners to take control over their data privacy Powerful Technology - provides a choice of 10 data masking techniques developed from years of research & analysis Scalable Solution permits increasing or decreasing the level of masking depending on security desired Open Connectivity ODBC or JDBC support for all databases & file formats Seamless Integration with applications, databases and file formats Open Standards - multi-platform support for O/S (Windows, Unix/Linux &Apple) Intuitive GUI data masking tools require minimal user training Data Masking Techniques used in PrivGuard As mentioned earlier, PrivGuard uses powerful and flexible masking techniques for a wide variety of data formats. There are two categories of data masking algorithms in PrivGuard. The first set of algorithms focuses on masking with numerical data, while the second set focuses on categorical (text or Boolean) data. This document provides a detail description of the various algorithms available from PrivGuard, where they are useful, and various options or parameters available within each algorithm for increasing the level of data masking or preserving the data originality. A. Numeric Data Masking i) Simple Noise Perturbation: This is a univariate perturbation technique to add random noise to the original data. It does not preserve the relationships between attributes when perturbing data. Noise Type: Additive: Add random noise to the data. The noise follows a normal distribution with mean = 0 and a specified variance. The mean of the noise is zero so that the mean of the data will remain approximately the same after adding the noise. Multiplicative: Multiply the data values by random noise. The noise follows a normal distribution with mean = 1 and a specified variance. The mean of the noise is one so that the mean of the data will remain approximately the same after multiplying the noise. Column List: This option allows you to select the various attributes (or fields) for perturbation. Noise Multiple: This parameter is related to the variance of the noise. The larger the value, the higher degree of noise in the masked data, which implies a lower disclosure risk but deteriorated data quality in the masked data. ii) General Additive Data Perturbation (GADP): This multivariate perturbation technique adds random noise to the original data. It attempts to preserve the multivariate distribution of the data. There is no parameter (option) for this technique to control the degree of perturbation. The technique is ideal when the data follow exactly a multivariate normal distribution. Type: GADP: Adds random noise to the data, based on the multivariate normal distribution theory. Shuffle: This technique is a variant of GADP. With this technique, numeric values are swapped, instead of perturbed by random noise.

3 Column List: To select attributes (or fields) for perturbation. iii) MicroAggregation: This technique first divides data into groups using sorting and clustering techniques and then masks the data by replacing original values with group averages. It is a non-parametric approach that does not require any knowledge about the statistical distribution of the original data. Type: Univariate Microaggregation: Group the data for each attribute based on the sorted values of the attribute. The technique does not consider (or preserve) the relationships between attributes. It runs fast for large data sets. Multivariate Microaggregation: Group the data for each attribute based on clustering techniques. It attempts to preserve the relationships among all attributes. However, it is slow for large data sets. Subset Size: The maximum number of records allowed in a group (subset). The larger the value, the higher degree of masking in the masked data, which implies a lower disclosure risk but deteriorated data quality in the masked data. Column List: To select attributes for masking. iv). KD-Tree-Based Masking: This approach first divides data into groups using kd-tree-based techniques. It then masks the data by replacing original values with group averages or by swapping data within the groups. It is a non-parametric approach that does not require any knowledge about the statistical distribution of the original data. It attempts to preserve the relationships among all attributes. It runs fast for large data (significantly faster than multivariate microaggregation). Subset Size: The maximum number of records allowed in a group (subset). The larger the value, the higher degree of masking in the masked data, which implies a lower disclosure risk but deteriorated data quality in the masked data. Column List: To select attributes for masking. B. Categorical Data Masking i) Simple Data Swapping: This is a univariate swapping technique that randomly swaps the categorical (text) values of an attribute. It attempts to preserve the frequency distribution of the attribute, but does not consider the dependencies across different attributes. Swapping Proportion: The proportion of the values in each attribute to be swapped. The larger the proportion, the more records are swapped, which implies a lower disclosure risk but deteriorated data quality in the masked data. Because the swapped values for different attributes may appear in different records, the total proportion of the records that have at least one attribute value swapped will normally large than this ii) Multivariate Data Swapping: A multivariate swapping technique that attempts to preserve the multivariate frequency distributions up to a certain order (see descriptions for the term order below). Proportion: The proportion of the values in each attributes to be swapped. The larger the proportion, the Order: The number of dimensions (attributes) whose joint distributions are to be preserved. Order = 1: To preserve univariate frequency distributions. So this is equivalent to Simple Data Swapping.

4 Order = 2: To preserve bivariate frequency distributions. Take the life insurance data as an example. There are four categorical attributes: Age (A), Gender (G), Location (L) and Income (I). This technique will swap the data such that the joint counts for each value combination involving the following pairs of attributes will be approximately preserved: A&G, A&L, A&I, G&L, G&I, and L&I. For example, the count for {A = & G = Female} will likely remain the same after swapping. Order = 3: To preserve trivariate frequency distributions. In the above example, the joint distributions will involve the following triples of attributes: A&G&L, A&G&I, A&L&I, and G&L&I. When the Order is greater than 3, the algorithm becomes very time consuming. Therefore, we only implement the algorithm up to order 3. iii) Bayesian-Based Data Swapping: This is a multivariate swapping technique that preserves the multivariate frequency distributions up to any order (see descriptions for the term order in Multivariate Data Swapping). These attributes are assumed to be conditionally independent (the Naïve Bayes assumption). The algorithm runs faster than the Multivariate Data Swapping for higher order requirements. In addition, this technique is optimal in preserving univariate distributions (via a Linear Programming method). Proportion: The proportions of the values in each attribute to be swapped. The larger the proportion, the iv) Decision-Tree-Based Data Swapping: This approach first divides data into groups using decision-tree-based techniques. It then masks the data by swapping the values within the groups. The attribute subject to masking must be categorical. However, it allows the other attribute to be categorical or numeric and attempts to preserve the relationships among all attributes (categorical and numeric). This is a key difference between this technique and the other categorical data swapping techniques (which require all attributes to be categorical) and the KD-Tree-Based Masking (which works for numeric attributes only). The algorithm runs fast for large data sets. Random Seed: Used in swapping. Proportion: The proportion of the values in each attribute to be swapped. The larger the proportion, the Note: Currently, this algorithm can only mask one attribute at a time. Further work needs to be done to extend this algorithm to masking multiple attributes simultaneously. PrivGuard Technical References J.F. Traub, Y. Yemini, and H. Wozniakowski, The statistical security of a statistical database, ACM Transactions on Database Systems, vol. 9, no. 4, pp , C. K. Liew, U.J. Choi, and C.J. Liew, A data distortion by probability distribution, ACM Transactions on Database Systems vol. 10, no. 3, pp , K. Muralidhar, R. Parsa, and R. Sarathy, A general additive data perturbation method for database security, Management Science, vol. 45, no. 10, pp , K.Muralidhar and R. Sarathy, Data shuffling A new masking approach for numerical data, Management Science vol. 52, no. 5, pp , 2006.

5 D. Defays and P. Nanopoulos, Panels of enterprises and confidentiality: The small aggregates method, Proceedings of Statistics Canada Symposium 92 on Design and Analysis of Longitudinal Surveys, pp , Ottawa, Canada, November J. Domingo-Ferrer and J.M. Mateo-Sanz, Practical data-oriented microaggregation for statistical disclosure control, IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 1, pp , X.-B. Li and S. Sarkar, A tree-based data perturbation approach for privacy-preserving data mining, IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 9, pp , X.-B. Li and S. Sarkar, Protecting privacy against re-identification by record linkage, Proceedings of the 16th Annual Workshop on Information Technologies and Systems (WITS 2006), Milwaukee, WI, 2006.S.P. Reiss, Practical data-swapping: The first steps, ACM Transactions on Database Systems, vol. 9, no. 1, pp , X.-B. Li and S. Sarkar, Privacy protection in data mining: A perturbation approach for categorical data, Information Systems Research, vol. 17, no. 3, pp , X.-B. Li and S. Sarkar, Protecting Privacy against Classification Attacks in Data Mining, Proceedings of the 15th Annual Workshop on Information Technologies and Systems (WITS 2005), Las Vegas, NV, Globe Tech, Inc. All Rights Reserved. PrivGuard is a trademark of Globe Tech, Inc. All other trademarks or service marks are the property of their respective owners.