Data mining successfully extracts knowledge to

Transcription

1 C O V E R F E A T U R E Privacy-Preserving Data Mining Systems Nan Zhang University of Texas at Arlington Wei Zhao Rensselaer Polytechnic Institute Although successful in many applications, data mining poses special concerns for private data. An integrated architecture takes a systemic view of the problem, implementing established protocols for data collection, inference control, and information sharing. Data mining successfully extracts knowledge to support a variety of domains marketing, weather forecasting, medical diagnosis, and national security but it is still a challenge to mine certain kinds of data without violating the data owners privacy. 1 How to mine patients private data, for example, is an ongoing problem in healthcare applications. In recognition of the growing privacy concern, directives such as the US Health Insurance Portability and Accountability Act (HIPAA) and the European Union Privacy Directive mandate privacy protection for data management and analysis systems. As data mining becomes more pervasive, such concerns are increasing. Online data collection systems are an example of new applications that threaten individual privacy. Already companies are sharing data mining models to obtain a richer set of data about mutual customers and their buying habits. The computing community must address data mining privacy before data mining techniques become widespread and the threat to private information spirals out of control. The sticking point is how to protect privacy while preserving the usefulness of data mining results. Much research is under way to address obstacles, but practical privacy-preserving data mining systems are largely in the research and prototyping stages. Many techniques for privacy-preserving data mining concentrate on algorithmic solutions and underlying mathematical tools, 2,3 rather than focusing on system issues. Our goal in investigating privacy preservation issues was to take a systemic view of architectural requirements and design principles and explore possible solutions that would lead to guidelines for building practical privacypreserving data mining systems. FOUNDATIONAL DESIGN As Figure 1 shows, privacy-preserving data mining usually has multiple steps that translate to a three-tiered architecture: At the bottom tier are the data providers, the data owners, which are often physically distributed. The data providers submit their private data to the data warehouse server. This server, which constitutes the middle tier, supports online analytical data processing to facilitate data mining by translating raw data from the data providers into aggregate data that the data mining servers can more quickly process. The data warehouse server stores the data collected in disciplined physical structures, such as a multidimensional data cube, and aggregates and precomputes the data in various forms, such as sum, average, max, and min. In an online survey system, for example, the survey respondents would be data providers who submit their data to the survey analyzer s data warehouse server; an aggregated data point might be the average age of all survey respondents. The aggregated data is more efficient to process than raw data from the providers. At the top tier are the data mining servers, which perform the actual data mining. In a privacy-preserving data 52 Computer Published by the IEEE Computer Society /07/$ IEEE

2 mining system, these servers do not have free access to all data in the data warehouse. In a hospital system, the accounting department can mine patients financial data, for example, but cannot access patients medical records. Developing and validating effective rules for the data mining servers access to the data warehouse is an open research problem. 4 Besides constructing data mining models on its local data warehouse server, a data mining server might share information with data mining servers from other systems. The motivation for this sharing is to build data mining models that span systems. For example, several retail companies might opt to share their local data mining models on customer Data Mining System 1 Data mining servers Data warehouse server Data providers records to build a global data mining model about consumer behavior that would benefit all the companies. As Figure 1 shows, sharing occurs in the top tier, where each data mining server holds the data mining model of its own system. Thus, sharing means sharing local data mining models rather than raw data. Minimum necessary design principle Any design of a privacy-preserving data mining system requires a clear definition of privacy. The common interpretation is that a data point is private if its owner has the right to choose whether or not, to what extent, and for what purpose to disclose the data point to others. In privacy-preserving data mining literature, most authors assume (either implicitly or explicitly) that a data owner generally chooses not to disclose its private data unless data mining requires it. This assumption and the accepted information-privacy definition form the basis of the minimum necessary design principle: In a data mining system, disclosed private information (from one entity to another) should be the minimum necessary for data mining. Minimum in this context is a qualitative, not a quantitative, measure. Since the quantitative measure of privacy disclosure varies among systems, minimum captures the idea that all unnecessary private information (unnecessary in the context of how accurate the data mining results must be) should not be disclosed. Information sharing Data Mining System 2 Data mining servers Data warehouse server Data providers Figure 1. Basic architecture for privacy-preserving data mining.the architecture typically has three tiers: data providers, which are the data owners; the data warehouse server, which supports online analytical processing; and the data mining servers that perform data mining tasks and share information. The challenge is to control private information transmitted among entities without impeding data mining. Minimum thus means that privacy disclosure is on a need-to-know basis. Many privacy regulations, including HIPAA, mandate this minimum necessary rule. Privacy protocols On the basis of the architecture in Figure 1 and the minimum necessary design principle, we have evolved a basic strategy for building a privacy-preserving data mining system. Central to the strategy are three protocols that govern privacy disclosure among entities: Data collection protects privacy during data transmission from the data providers to the data warehouse server. Inference control manages privacy protection between the data warehouse server and data mining servers. Information sharing controls information shared among the data mining servers in different systems. Given the minimum necessary rule, a common goal of these protocols is to transmit the minimum private information necessary for data mining from one entity to another to build accurate data mining models. In reality, it is often difficult to build an efficient system that protects private information perfectly. Consequently, there are always tradeoffs between data privacy and data mining model accuracy. These protocols are based on established methods that the system designer can tailor to particular requirements, choosing the most beneficial tradeoffs. The data collection protocol, for example, can April

3 Perturbation-based Value-based method Aggregation-based Data collection protocol draw from one of two established collection methods, each with its advantages and drawbacks. DATA COLLECTION PROTOCOL The data collection protocol lets data providers identify the minimum necessary part of private information what must be disclosed to build accurate data mining models and ensures that they transmit only that part of the information to the data warehouse server. Several requirements shape the data collection protocol. First, it must be scalable, since a data warehouse server can deal with as many as hundreds of thousands of data providers, as in an online survey system. Second, the computational cost to data providers must be small because they have considerably lower computational power than the data warehouse server, and a higher cost could discourage them from participating in data mining. Finally, the protocol must be robust; it must deliver relatively accurate data mining results while protecting data providers privacy, even if data providers behave erratically. For example, if some data providers in an online survey system deviate from the protocol or submit meaningless data, the data collection protocol must control the influence of such erroneous behavior and ensure that global data mining results remain sufficiently accurate. Figure 2 shows a data collection protocol taxonomy based on two data collection methods. Value-based method With the value-based method, 5 a data provider manipulates the value of each data attribute or item independently using one of two es. The perturbation-based 3 adds noise directly to the original data values, such as changing age 23 to 30 or Texas to California. The aggregation-based generalizes data according to the relevant domain hierarchy, such as changing age 23 to age range or Texas to the US. The perturbation-based is highly suitable for arbitrary data, while the aggregation-based relies on knowledge of the domain hierarchy, but Dimension-based method Blocking-based Projection-based Figure 2. Data collection protocol taxonomy. A designer can choose which of two methods value- or dimension-based and its attendant es best serve the design. can be effective in guaranteeing the data s anonymity 6 k-anonymity, for example, means that each perturbed data record is indistinguishable from the perturbed values of at least k 1 other data records. The value-based method assumes that it would be difficult, if not impossible, for the data warehouse server to rediscover the original private data from the manipulated values but that the server would still be able to recover the original data distribution from the perturbed data, thereby supporting the construction of accurate data mining models. 5 Dimension-based method The dimension-based method is so called because the data to be mined usually has many attributes, or dimensions. The basic idea is to remove part of the private information from the original data by reducing the number of dimensions. The blocking-based 3 accomplishes this by truncating some private attributes without releasing them to the data warehouse server. However, this could result in information loss, preventing data mining servers from constructing accurate data mining models. The more complicated projection-based 7 overcomes this problem by projecting the original data into a carefully designed, low-dimensional subspace in a way that retains only the minimum information necessary to construct accurate data mining models. Advantages and drawbacks Each method and attendant has pluses and minuses. The value-based method is independent of the data mining task, which makes it suitable for applications involving multiple data mining tasks or tasks unknown at data collection. In contrast, the dimensionbased method fits better with individual data mining tasks because the information to be retained after dimension reduction usually depends on the particular task. So far, research has not defined an effective and universally applicable projection-based. Even so, the projection-based promises strong advantages over value-based methods in terms of the tradeoff between accuracy and privacy protection. Most value-based es treat different attributes independently and separately, so at least some attributes that are less necessary for data mining are always disclosed to the data warehouse server to the same extent as other attributes. Indeed a recent study 54 Computer

4 revealed that, with the perturbation-based randomization, the data warehouse server could use privacy intrusion techniques to filter noise from the perturbed data, thereby rediscovering part of the original private data. 8 The projection-based avoids this problem by exploiting the relationship among attributes and disclosing only those necessary for data mining. Guiding data submission can also reduce unnecessary privacy disclosure, enhancing the performance of data perturbation. In earlier work, 7 we and colleague Shengquan Wang proposed a guidance-based dimension reduction scheme for dynamic systems, such as online survey systems, in which data providers (survey respondents and so on) join the system and submit their data asynchronously. To guide data providers that have not yet submitted data, the scheme analyzes the data already collected and estimates the attributes necessary for data mining. The system then sends the estimated useful attributes to data providers as guidance. Our work shows that this guidance-based scheme is more effective than es without such guidance. INFERENCE CONTROL PROTOCOL Protecting private data in the data warehouse server requires controlling the information disclosed to the data mining servers which is the aim of the inference control protocol. Following the minimum necessary rule, the inference control protocol ensures that the data warehouse server answers the queries necessary for data mining yet minimizes privacy disclosure. Several requirements drive the inference control protocol s design and implementation. One is the need to block inferences. If a data mining server becomes an adversary, it will try to infer private information from the query answers it has already received. Figure 3 gives an example. Further, the inference control protocol must be efficient enough to satisfy the data warehouse server s required online response time the time between issuing a query and answering it. The time that an inference control protocol uses is part of that response time. It must be controlled so that the data warehouse server can maintain its reduced response time. To meet these requirements, inference control protocols must restrict the information included in the query answers so that the data mining server cannot infer private data from received query answers. Classify safe and unsafe sets offline Item April May June July Sum Book 10 Known 15 Known Q 5 = 25 CD 20 Known 27 Known Q 6 = 47 DVD Known Q 7 = 87 Game Known 25 Known 14 Q 8 = 39 Sum Q 1 = 30 Q 2 = 60 Q 3 = 58 Q 4 = 50 Figure 3. Inference that discloses private information. If the data mining server becomes an adversary, it might be able to infer from the query answers and certain cells (Known) the number of DVDs a data provider sold in June (which is private and should not be disclosed) by computing Q 1 + Q 3 (Q 5 + Q 6 ) = = 16, where Q 1 to Q 8 are query answers. Query-oriented method Figure 4 shows an inference control protocol taxonomy based on two inference control methods. Query-oriented method The query-oriented method 4 is centered on the concept of a safe query set, which says that query set <Q 1, Q 2,, Q n > is safe if a data mining server cannot infer private data from the answers to Q 1, Q 2,, Q n. Thus, query-oriented inference control means that when the data warehouse server receives a query, it will answer the query only if the union set of query history the set of all queries already answered and the recently received query are safe. Otherwise, it will reject the query. Relative to query-oriented inference control in statistical databases, inference control in data warehouses involves significantly more data. Consequently, the burden is on inference control protocols to process queries more efficiently. Because dynamically determining a query set s safety (online query history check) can be time-consuming, a static version of the query-oriented method might be more suitable. The static version determines a safe set of queries offline (before any query is actually received). If a query set is safe, then any one of its subsets is also safe. At runtime, when the data warehouse server Check query history online Inference control protocol Do perturbation by data collection Data-oriented method Do perturbation online when query received Figure 4. An inference control protocol taxonomy. A designer can choose which of two methods query- or data-oriented best serves the design. April

5 receives the query, it answers only if the query is in the predetermined safe set. Otherwise, it will reject the query. On the downside, the static method is conservative in selecting a safe set, which might cause it to reject some queries unnecessarily. Data-oriented method With the data-oriented method of inference control, 9 the data warehouse server perturbs the stored raw data and estimates the query answers as accurately as possible on the basis of the perturbed data. As Figure 4 shows, the data collection protocol can handle perturbation unless the application requires storing original data in the data warehouse server. In that case, the data warehouse server might have to perturb the data when processing the query. The data-oriented method assumes that perturbation can protect private information from being disclosed, enabling the data warehouse server to answer all queries freely on the basis of the perturbed data. Research has shown that the query answers estimated from the perturbed data can still support the construction of accurate data mining models. 5 Advantages and disadvantages The two methods have unique performance considerations. The data-oriented method offers query responsiveness, since the data warehouse server will answer all queries. The query-oriented method, in contrast, normally rejects a substantial number of queries, 9 which means that some data mining servers might be unable to complete their data mining tasks. On the plus side, the query-oriented method can provide more accurate answers than the data-oriented method. When the data warehouse server answers a query, its answer will always be precise. The data-oriented method, in contrast, answers queries with estimation, so it might not be accurate enough to support data mining, particularly when the construction of data mining models requires highly accurate query answers. Efficiency is an important advantage for the static version of the query-oriented method, which has the shortest response time because most of its computational cost is offline. The dynamic version must trade off efficiency and query responsiveness: To answer more queries, the data warehouse server must spend more time analyzing the query history. The data-oriented method also suffers from low efficiency, since the computational overhead for query estimation can be several orders of magnitude higher than for query answering. One way to enhance inference control protocol performance is to integrate query- and data-oriented methods. Introducing the query answer-or-reject scheme to the data-oriented method would let the data warehouse The query-oriented method can provide more accurate answers than the data-oriented method. server reject some privacy-divulging queries (such as Q 3 in Figure 3). This, in turn, would effectively downgrade the data perturbation level yet retain the same degree of privacy protection. Because the data is perturbed, the server would have to reject far fewer queries and could thus answer most queries fairly accurately while continuing to protect private information. INFORMATION SHARING PROTOCOL Because each data mining server constructs local data mining models in its own system, these servers are likely to share their local data mining models rather than the raw data in the data warehouses. Local data mining models can be sensitive, especially when the local models are not globally valid. To protect the privacy of individual data mining systems, some mechanism must control the disclosure of private information in local data mining models. This mechanism is the information sharing protocol, which again follows the minimum necessary rule. The protocol s objective is to enable data mining servers across multiple systems to construct global data mining models while disclosing only the minimum private information about local data mining models necessary for information sharing. Many information sharing protocols exist for applications other than data mining, such as database interoperation or data integration. 10 Information sharing is necessary for most distributed data mining systems, and much work has focused on designing specific information sharing protocols for data mining tasks. A major design concern of the information sharing protocol is defending against adversaries that behave arbitrarily within the capability allocated to them. The defense strategy depends on the adversary model the set of assumptions about an adversary s intent and behavior. Two of the more popular adversary models are semihonest 10 and beyond semihonest. Semihonest adversaries An adversary is semihonest if it properly follows the designated protocol but records all intermediate computation and communication, thereby providing a way to derive private information. Cryptographic encryption has proved effective in defending against semihonest adversaries. 2,10,11 In this method, each data mining server encrypts its local data mining model and exchanges the encrypted model with other data mining servers. Some encryption scheme properties, such as the Rivest- Shamir-Adleman (RSA) cryptosystem s commutative encryption property, make it possible to design algorithms for data mining servers to perform certain data mining tasks and set operations without knowing the 56 Computer

6 private keys of other entities. 2,10,11 Tasks include classification, association rule mining, clustering, and collaborative filtering; set operations include set intersection, set union, and element reduction. Because it is not possible to recover the original (local) data mining models from their encrypted values without knowing the private keys, this method is a secure defense against semihonest adversaries. Researchers have already evolved a detailed taxonomy and cryptographic encryption methods for various system settings. 2,3 Beyond semihonest adversaries An adversary is considered beyond semihonest if it deviates from the designated protocol, changes its input data, or both. Because it is difficult if not impossible to defend against an adversary that is behaving arbitrarily, dealing with beyond semihonest adversaries requires more refined models. One such model is the intent-based adversary model, 12 which formulates an adversary s intent as combining the intent to obtain accurate data mining results with compromising other entities private information. A game-theoretic method is then developed to defend against adversaries that weigh the accuracy of data mining results over compromising other parties privacy. 12 The basic idea is to design the information sharing protocol in a way that no adversary can both obtain accurate data mining results and intrude on other servers privacy. Adversaries that are more concerned with the accuracy of data mining results will be forced not to intrude on the privacy of others to get that accuracy. OPEN RESEARCH ISSUES Several issues require additional research to ensure the optimum performance of the techniques described. Protocol integration Many systems need a seamless integration of the three protocols, yet little research has addressed this need. Our proposed integrated architecture could serve as a platform for studying protocol interaction. Such insights can pave the way for effective and efficient integration. Research on anomaly detection can contribute to multiple disciplines, such as security, biology, and finance. Heterogeneous privacy requirements Privacy-preserving data mining techniques depend on respecting the privacy protection levels that data providers require. Most existing studies assume homogenous privacy requirements that all data owners need the same privacy level for all their data and its attributes. This assumption is unrealistic in practice and could even degrade system performance unnecessarily. Designing and implementing techniques that exploit heterogeneous privacy requirements is a challenge with much potential return. Privacy measurements The accuracy versus protection tradeoff inherent in privacy-preserving data mining means that some mechanism must accurately measure the degree of privacy protection. Although extensive work has focused on privacy measurement, as yet no one has proposed a commonly accepted measurement technique for generic privacy-preserving data mining systems. Proper privacy protection measurement has three criteria: It must reflect system settings (adversaries might have different levels of interest in different data values, such as being more concerned with patients that have contagious diseases than other diseases), account for data providers diverse privacy concerns (some might consider age as private information, while others are willing to disclose it publicly), and satisfy the minimum necessary rule. A comprehensive study of privacy measurement for all three protocols would be a huge step toward improving the performance of privacy-preserving data mining techniques. Anomaly detection A common application of data mining is to detect data-set anomalies, as in mining log file data to detect intrusions. However, few researchers have considered privacy protection in detecting anomalies. Research on anomaly detection is an important part of data mining and can contribute to multiple disciplines, such as security, biology, and finance. Thoroughly investigating issues related to the design of privacy-preserving data mining techniques for anomaly detection would be extremely beneficial. Multiple protection levels In some cases, multiple levels of private information must be protected. The first level might be a data point value, and the second level, the data point sensitivity (knowledge of whether or not a data point is private). Most existing studies focus on protecting the first level and assume that all entities already know the second level. Research has yet to answer how to protect the second level (and higher levels) of private information. Our work is an important first step in addressing the critical systemic issues of privacy preservation in data mining. Much research remains to realize the April

7 potential of the architecture and design principles we have described. Much literature already addresses privacy-preserving data mining, but clearly the ideas must cross considerable ground to become practical systems. Studies are needed for the design of privacy-preserving data mining techniques in real-world scenarios, in which data owners can freely address their individual privacy concerns without the data miner s consent. Also critical is work that more closely incorporates designs with specialized applications such as healthcare, market analysis, and finance. Our hope is that others will continue efforts in this important area. References 1. J. Han and M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann, C. Clifton et al., Tools for Privacy Preserving Distributed Data Mining, SIGKDD Explorations, vol. 4, no. 2, 2003, pp V.S. Verykios et al., State-of-the-Art in Privacy Preserving Data Mining, SIGMOD Record, vol. 33, no. 1, 2004, pp L. Wang, S. Jajodia, and D. Wijesekera, Securing OLAP Data Cubes against Privacy Breaches, Proc. 25th IEEE Symp. Security and Privacy, IEEE Press, 2004, pp R. Agrawal and R. Srikant, Privacy-Preserving Data Mining, Proc. 19th ACM SIGMOD Int l Conf. Management of Data, ACM Press, 2000, pp R.J. Bayardo and R. Agrawal, Data Privacy through Optimal k-anonymization, Proc. 21st Int l Conf. Data Eng., IEEE Press, 2005, pp N. Zhang, S. Wang, and W. Zhao, A New Scheme on Privacy-Preserving Data Classification, Proc. 11th ACM SIGKDD Int l Conf. Knowledge Discovery and Data Mining, ACM Press, 2005, pp Z. Huang, W. Du, and B. Chen, Deriving Private Information from Randomized Data, Proc. 24th ACM SIGMOD Int l Conf. Management of Data, ACM Press, 2005, pp R. Agrawal, R. Srikant, and D. Thomas, Privacy-Preserving OLAP, Proc. 25th ACM SIGMOD Int l Conf. Management of Data, ACM Press, 2005, pp R. Agrawal, A. Evfimievski, and R. Srikant, Information Sharing across Private Databases, Proc. 22nd ACM SIG- MOD Int l Conf. Management of Data, ACM Press, 2003, pp Y. Lindell and B. Pinkas, Privacy Preserving Data Mining, Proc. 12th Ann. Int l Conf. Advances in Cryptology, Springer- Verlag, 2000, pp N. Zhang and W. Zhao, Distributed Privacy Preserving Information Sharing, Proc. 31st Int l Conf. Very Large Data Bases, ACM Press, 2005, pp Nan Zhang is an assistant professor of computer science and engineering at the University of Texas at Arlington. His research interests include databases and data mining, information security and privacy, and distributed systems. Zhang received a PhD in computer science from Texas A&M University. He is a member of the IEEE. Contact him at nzhang@cse.uta.edu. Wei Zhao is a professor of computer science and the dean for the School of Science at Rensselaer Polytechnic Institute. His research interests include distributed computing, real-time systems, computer networks, and cyberspace security. Zhao received a PhD in computer and information sciences from the University of Massachusetts, Amherst. He is a Fellow of the IEEE and a member of the IEEE Computer Society and the ACM. Contact him at zhaow3@rpi.edu. Engineering and Applying the Internet IEEE Internet Computing reports emerging tools, technologies, and applications implemented through the Internet to support a worldwide computing environment. In 2007, we ll look at: Autonomic Computing Roaming Distance Learning Dynamic Information Dissemination Knowledge Management Media Search 58 Computer