On the Performance Measurements for Privacy Preserving Data Mining



Similar documents
A generalized Framework of Privacy Preservation in Distributed Data mining for Unstructured Data Environment

Data mining successfully extracts knowledge to

A Game Theoretical Framework for Adversarial Learning

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October-2013 ISSN

How to Build a Private Data Classifier

Privacy Preserved Association Rule Mining For Attack Detection and Prevention

A Secure Model for Medical Data Sharing

Information Security in Big Data using Encryption and Decryption

Privacy Preserving Outsourcing for Frequent Itemset Mining

Experimental Analysis of Privacy-Preserving Statistics Computation

Homomorphic Encryption Schema for Privacy Preserving Mining of Association Rules

Is Privacy Still an Issue for Data Mining? (Extended Abstract)

PBKM: A Secure Knowledge Management Framework

Enhancement of Security in Distributed Data Mining

Privacy Preserving Mining of Transaction Databases Sunil R 1 Dr. N P Kavya 2

International Journal of Advanced Computer Technology (IJACT) ISSN: PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

Data Outsourcing based on Secure Association Rule Mining Processes

A Survey of Quantification of Privacy Preserving Data Mining Algorithms

Performing Data Mining in (SRMS) through Vertical Approach with Association Rules

PRIVACY PRESERVING DATA MINING BY USING IMPLICIT FUNCTION THEOREM

On the Efficiency of Competitive Stock Markets Where Traders Have Diverse Information

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

OLAP Online Privacy Control

Privacy-Preserving Outsourcing Support Vector Machines with Random Transformation

DATA MINING - 1DL360

Analysis of Privacy-Preserving Element Reduction of Multiset

Data Mining and Sensitive Inferences

The Optimality of Naive Bayes

Privacy-preserving Data Mining: current research and trends

Sharing Online Advertising Revenue with Consumers

Barriers to Adopting Privacy-preserving Data Mining

Principle of Data Reduction

Preserving Privacy in Supply Chain Management: a Challenge for Next Generation Data Mining 1

Sharing Online Advertising Revenue with Consumers

A Novel Technique of Privacy Protection. Mining of Association Rules from Outsourced. Transaction Databases

Non Parametric Inference

PRIVACY PRESERVING DATA MINING OVER VERTICALLY PARTITIONED DATA. A Thesis. Submitted to the Faculty. Purdue University. Jaideep Shrikant Vaidya

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

DATA MINING - 1DL105, 1DL025

A GENERAL SURVEY OF PRIVACY-PRESERVING DATA MINING MODELS AND ALGORITHMS

Encyclopedia of Information Ethics and Security

Enhancing Wireless Security with Physical Layer Network Cooperation

A Game Theoretical Framework on Intrusion Detection in Heterogeneous Networks Lin Chen, Member, IEEE, and Jean Leneutre

Self-Disciplinary Worms and Countermeasures: Modeling and Analysis

Current Developments of k-anonymous Data Releasing

IMPROVED MASK ALGORITHM FOR MINING PRIVACY PRESERVING ASSOCIATION RULES IN BIG DATA

Security Analysis for Order Preserving Encryption Schemes

PRIVACY PRESERVING ASSOCIATION RULE MINING

Equilibrium computation: Part 1

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

A Pseudo Nearest-Neighbor Approach for Missing Data Recovery on Gaussian Random Data Sets

Linear Threshold Units

Social Media Mining. Data Mining Essentials

Online Scheduling for Cloud Computing and Different Service Levels

Accuracy in Privacy-Preserving Data Mining Using the Paradigm of Cryptographic Elections

A RESEARCH STUDY ON DATA MINING TECHNIQUES AND ALGORTHMS

Introduction to Learning & Decision Trees

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Random Projection-based Multiplicative Data Perturbation for Privacy Preserving Distributed Data Mining

Priority Based Load Balancing in a Self-Interested P2P Network

An Introduction to Competitive Analysis for Online Optimization. Maurice Queyranne University of British Columbia, and IMA Visitor (Fall 2002)

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan , Fall 2010

Using Data Mining Methods to Predict Personally Identifiable Information in s

Bargaining Solutions in a Social Network

Introduction to Data Mining

Network Security A Decision and Game-Theoretic Approach

Online Adwords Allocation

A Three-Dimensional Conceptual Framework for Database Privacy

Competitive Analysis of On line Randomized Call Control in Cellular Networks

A Protocol for Privacy Preserving Neural Network Learning on Horizontally Partitioned Data

Exact Nonparametric Tests for Comparing Means - A Personal Summary

Network Security Validation Using Game Theory

An Attacker s View of Distance Preserving Maps For Privacy Preserving Data Mining

Prediction of Stock Performance Using Analytical Techniques

10 Evolutionarily Stable Strategies

DSL Spectrum Management

L4: Bayesian Decision Theory

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Private Record Linkage with Bloom Filters

DATA MINING, DIRTY DATA, AND COSTS (Research-in-Progress)

Vishnu Swaroop Computer Science and Engineering College Madan Mohan Malaviya Engineering College Gorakhpur, India

LArge-scale Internet applications, such as video streaming

Explanation-Oriented Association Mining Using a Combination of Unsupervised and Supervised Learning Algorithms

Online Appendix to Stochastic Imitative Game Dynamics with Committed Agents

Predict Influencers in the Social Network

Offline sorting buffers on Line

Lecture V: Mixed Strategies

A Study of Data Perturbation Techniques For Privacy Preserving Data Mining

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

CHAPTER 2 Estimating Probabilities

Tiers, Preference Similarity, and the Limits on Stable Partners

Compositional Real-Time Scheduling Framework with Periodic Model

Creating a NL Texas Hold em Bot

The Max-Distance Network Creation Game on General Host Graphs

Compact Representations and Approximations for Compuation in Games

Basics of Statistical Machine Learning

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Mining Association Rules: A Database Perspective

arxiv: v3 [cs.cr] 27 May 2015

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Transcription:

On the Performance Measurements for Privacy Preserving Data Mining Nan Zhang, Wei Zhao, and Jianer Chen Department of Computer Science, Texas A&M University College Station, TX 77843, USA {nzhang, zhao, chen}@cs.tamu.edu Abstract. This paper establishes the foundation for the performance measurements of privacy preserving data mining techniques. The performance is measured in terms of the accuracy of data mining results and the privacy protection of sensitive data. On the accuracy side, we address the problem of previous measures and propose a new measure, named effective sample size, to solve this problem. We show that our new measure can be bounded without any knowledge of the data being mined and discuss when the bound can be met. On the privacy side, we identify a tacit assumption made by previous measures and show that the assumption is unrealistic in many situations. To solve the problem, we introduce a game theoretic framework for the measurement of privacy. 1 Introduction In this paper, we address issues related to the performance measurements of privacy preserving data mining techniques. The purpose of data mining is to discover patterns and extract knowledge from large amounts of data. The objective of privacy preserving data mining is to enable data mining without invading the privacy of the data being mined. We consider a distributed environment where the data being mined are stored in multiple autonomous entities. We can classify privacy preserving data mining systems into two categories based on their infrastructures: Server-to-Server (S2S) and Client-to- Server (C2S), respectively. In the first category (S2S), the data being mined are distributed across several servers. Each server holds numerous private data points. The servers collaborate with each other to enable data mining across all servers without letting either server know the private data of the other servers. Since the number of servers in a system is usually small, the problem is often modeled as a variation of secure multi-party computation problem, which has been extensively studied in cryptography [12]. Existing privacy preserving algorithms in this category serve a wide variety of data mining tasks including data classification [7, 14, 15, 20], association rule mining [13, 19], and statistical analysis [6]. In the second category (C2S), a system usually consists of a data miner (server) and numerous data providers (clients). Each data provider holds only one data point. The data miner performs data mining tasks on the aggregated (possibly perturbed) data provided by the data providers. A typical example of this kind of system is online survey,

as the survey analyzer (data miner) collects data from thousands of survey respondents (data providers). Most existing privacy preserving algorithms in C2S systems use an randomization approach which randomizes the original data to protect the privacy of data providers [1, 2, 5, 8 10, 18]. Both S2S and C2S systems have a broad range of applications. Nevertheless, we focus on studying C2S systems where the randomization approach is used. In particular, we establish the foundation for analyzing the tradeoff between the accuracy of data mining results and the privacy protection of sensitive data. Our contributions in this paper are summarized as follows. On accuracy side, we address the problem of previous measures and propose a new accuracy measure named effective sample size to solve this problem. We show that our new measure can be upper bounded without any knowledge of the data being mined and discuss when the bound can be met. On privacy protection side, we show that a tacit assumption made by previous measures is that all adversaries use the same intrusion technique to invade privacy. We address the problems of this assumption and propose a game theoretic formulation which takes the adversary behavior into consideration. The rest of the paper is organized as follows. In Section 2, we introduce our models of data, data providers, and data miners. Based on these models, we briefly review the literature in Section 3. In Section 4, we propose our new accuracy measure. Analytical bound on the new measure is derived in this section. In Section 5, we propose a game theoretic formulation on the measurement of privacy and define our new privacy measure. Section 6 concludes the paper with some final remarks. 2 System Model Let there be n data providers (clients) C 1,...,C n and one data miner (server) S in the system. Each client C i has a private data point (e.g., transaction, data tuple, etc) x i.we view the original data values x 1,...,x n as n independent and identically distributed (i.i.d.) variables that have the same distribution as a random variable X. Let the domain of X (i.e., the set of all possible values of X) bev X and the distribution of X be p X. Each data point x i is i.i.d. on V X with distribution p X. Due to the privacy concern of data providers, we classify the data miners into two categories. One category is honest data miners. These data miners always acts honestly in that they only perform regular data mining tasks and have no intention to invade privacy. The other category is malicious data miners. These data miners would purposely compromise the privacy of data providers. 3 Related Work To protect the data providers from privacy invasion, countermeasures must be implemented in the data mining system. Randomization is a commonly used approach. We briefly review it as follows.

The randomization approach is based on an assumption that accurate data mining results can be obtained from a robust estimation of the data distribution. Previous work showed that this assumption is reasonable in many situations [2]. Thus, the basic idea of the randomization approach is to distort the individual data values but keep an (statistically) accurate estimation of the original data distribution. Based on the randomization approach, the privacy preserving data mining process can be considered as a two-step process. In the first step, each data provider C i perturbs its data x i by applying a predetermined randomization operator R( ) on x i, and then transfers the randomized data R(x i ) to the data miner. We note that the randomization operator is known by both the data providers and the data miner. Let the domain of R(x i ) be V Y. The randomization operator R( ) is a function from V X to V Y with transition probability p[x y]. In previous studies, several randomization operators have been proposed, including random perturbation operator [2], random response operator [8], MASK distortion operator [18], and select-a-size operator [10]. For example, the random perturbation operator and the random response operator are listed in (1) and (2), respectively. R(x i )=x i + r i. (1) { xi, if r R(x i )= i θ i, (2) x i, if r i <θ i. Here, x i is the original data value, r i is the noise randomly generated from a predetermined distribution, and θ i is a parameter set by each data provider individually. As we can see, the random response operator only applies to binary data. In the second step, the honest data miner first employs a distribution reconstruction algorithm on the aggregate data, which intends to recover the original data distribution from the randomized data. Then, the honest data miner performs the data mining task on the reconstructed distribution. Several distribution reconstruction algorithms have been proposed [1,2,8,10,18]. In particular, the expectation maximization (EM) algorithm [1] reconstructs the distribution to converge to the maximum likelihood estimate of the original data distribution. For example, suppose that the data providers randomize their data using the random response operator in (2). Let r i be random variables uniformly distributed on [0, 1]. Let θ i be 0.3. The distribution reconstructed by EM algorithm is stated as follows. Pr{x i =0} = 7 4 Pr{R(x i)=1} 3 4 Pr{R(x i)=0}, (3) Pr{x i =1} = 7 4 Pr{R(x i)=0} 3 4 Pr{R(x i)=1}. (4) Also in the second step, a malicious data miner may invade privacy by using a private data recovery algorithm. This algorithm is used to recover individual data values from the randomized data supplied by the data providers. Figure 1 depicts the architecture of the system. Clearly, any privacy preserving data mining system should be measured by its capacity of both constructing the accurate data mining results and protecting individual data values from being compromised by the malicious data miners.

Fig. 1. System Model 4 Quantification of Accuracy In this section, we study the measure of the accuracy of data mining results. First, we briefly review previous accuracy measures and address their problem. Then, we propose a new accuracy measure named effective sample size and derive an analytical bound on it. 4.1 Previous Measures In previous studies, several accuracy measures have been proposed. We classify these measures into two categories. One category is application-specified accuracy measures. Measures in this category are specified to particular data mining applications. For example, in the MASK system [18] for privacy preserving association rule mining, the measurement of accuracy includes two measures, named support error and identity error, respectively. Support error is the average error on the support of identified frequent itemsets. Identity error measures the average probability of that frequent itemset is not identified. These measures are specified to association rule mining and cannot be applied to other data mining applications (e.g., data classification). The other category is general accuracy measures. Measures in this category can be applied to any privacy preserving data mining systems based on the randomization approach. An existing measure in this category is information loss measure [1]. Let p be the reconstructed distribution. The information loss measure I(p X, p) is defined as I(p X, p) = 1 ] 2 [ V E p X (x) p(x) dx, (5) X which is in proportion to the expected error of the reconstructed distribution.

4.2 Problem of Previous Measures We remark that the ultimate goal of the performance measurements is to help the system designers to choose the optimal randomization operator. As we can see from the privacy preserving data mining process in Section 3, the randomization operator has to be determined before any data is transferred from the data providers to the data miner. Thus, in order to reach its goal, a performance measure must be estimated or bounded without any knowledge of the data being mined. As we can see, the application-specified accuracy measures depend on both the reconstructed data distribution and the performance of data mining algorithm. The information loss measure depends on both the original distribution and the reconstructed distribution. Neither measure can be estimated or bounded when the data distribution is not known. Thus, previous measures cannot be used by the system designers to choose the optimal randomization operator. 4.3 Effective Sample Size We now propose effective sample size as our new accuracy measure. Roughly speaking, given the number of the randomized data points, the effective sample size is in proportion to the minimum number of original data points that can make an estimate of the data distribution as accurate as the distribution reconstructed from the randomized data points. The formal definition is stated as follows. Definition 1. Suppose that the system consists of n data providers and one data miner. Given randomization operator R : V X V Y, let p be the maximum likelihood estimate of the distribution of x i reconstructed from R(x 1 ),..., R(x n ). Recall that p X is the original distribution of x i. Let p 0 (k) be the maximum likelihood estimate of the distribution based on k random variables generated from distribution p X. We define the effective sample size r as the minimum value of k/n such that D Kol ( p 0 (k),p X ) D Kol ( p, p X ) (6) where D Kol is the Kolmogorov distance [16], which measures the distance between an estimated distribution and the theoretical distribution 1. As we can see, effective sample size is a general accuracy measure which measures the accuracy of the reconstructed distribution. Effective sample size is a function of three parameters: n, R, and p X. As we can see from the simulation result in Figure 2, the minimum value of k is (almost) in proportion to n. Thus, we can reduce the effective sample size to a function of R and p X. We now show that the effective sample size can be strictly bounded without any knowledge of p X. Theorem 1. Recall that p[x y] is the probability transition function of R : V X V Y. An upper bound on the effective sample size r is given as follows. r 1 y V Y min p[x y]. (7) x V X 1 Other measures of such distance (e.g., Kuiper distance, Anderson-Darling distance, etc) can also be used to define the effective sample size. The use of other measures does not influence the results in this paper.

15 10 min k 5 0 0 5 10 15 20 25 30 35 40 45 50 Number of Data Providers (n) 10 4 Fig. 2. Relationship between min k and n Proof. We denote Pr{x i = x} and Pr{R(x i )=y} by p(x) and p(y), respectively. We have p(y) = p(x)p[x y] (8) x V X =minp[x y]+ p(x)(p[x y] min p[x y]) (9) x V X x V X x V X We separate R into two operators, R 1 and R 2, such that R( ) = R 2 (R 1 ( )). Let y V Y min x VX p[x y] be p 0. Note that p 0 1. Let e V X V Y be a symbol which represents a denial-of-service. Note that no private information can be infered from e. R 1 and R 2 are stated as follows. e, with probability p 0, y R 1 (x) = 1, with probability p[x y 1 ] min x VX p[x y 1 ], (10),, y VY, with probability p[x y VY ] min x VX p[x y VY ], z, if z e, y R 2 (z) = 1, if z = e and with probability (min x VX p[x y 1 ])/p 0, (11),, y VY, if z = e and with probability (min x VX p[x y VY ])/p 0, Here, y 1,,y VY are all possible values occur in V Y. That is, V Y = {y 1,,y VY }. We now show the equivalence between R( ) and R 2 (R 1 ( )). For all x V X,y V Y, we have Pr{R 2 (R 1 (x)) = y} (12) =Pr{R 1 (x) =e} Pr{R 2 (R 1 (x)) = y R 1 (x) =e} +Pr{R 1 (x) =y} Pr{R 2 (R 1 (x)) = y R 1 (x) =y} (13) min x VX p[x y] =p 0 + p[x y] min p[x y], p 0 x V X (14) =p[x y]. (15)

Note that R 2 is only determined by p[x y], which is the probability transition function of R. Suppose that the data providers use R 1 to randomized their data. The data miner can always construct R(x i ) from R 1 (x i ) using its knowledge of R. Thus, the effective sample size when R is used is always less than or equal to the effective sample size when R 1 is used. That is, r 1 p 0 =1 min p[x y]. (16) x V X y V Y This bound only depends on the randomization operator R. It is independent of the number of data providers n and the original data distribution p X. As we can see, the bound can be met if and only if for any given x V X, there exists no more than one y i V Y, such that 5 Quantification of Privacy Protection p[x y i ] > p 0 V Y. (17) In this section, we address issues related to the measurement of privacy protection in privacy preserving data mining. First, we briefly review the previous measures of privacy protection. Then, we identify a tacit assumption made by previous measures which is unrealistic in practice. To solve the problem, we propose a new privacy measure based on a game theoretic framework. 5.1 Previous Measures In previous studies, two kinds of privacy measures have been proposed. One kind of measure is information theoretic measure [1], which measures privacy by the mutual information between the original data x i and the randomized data R(x i ) (i.e., I(x i ; R(x i ))). This measure is a statistical measurement of the privacy disclosure. In [9], the authors challenge the information theoretic measure and remark that there exist certain kinds of privacy disclosure that cannot be captured by this measure. For example, suppose that for a certain y V Y, a data miner can almost certainly infer that x i = y from R(x i )=y (i.e., Pr{x i = y R(x i )=y} 1). This privacy disclosure is serious because if a data provider knows the disclosure, it will purposely change its randomized data if the randomized data value happens to be y. However, the information theoretic measure cannot capture this privacy disclosure if the occurrence of y has a fairly low probability (i.e., Pr{R(x i )=y} 0). The reason is that the mutual information only measures the average information that is disclosed to the data miner. The other kind of privacy measure is proposed to solve the problem of the information theoretic measure. Privacy measures of this kind includes privacy breach measure [9] and interval-based privacy measures [3,21]. We use the privacy breach measure as an example. Due to the privacy breach measure, the level of privacy protection is determined by p[x y] max x,x V X p[x (18) y]

for any given y V Y. This measure captures the worst case privacy disclosure and can guarantee a bound on the level of privacy protection without any knowledge of the original data distribution. However, we remark that this measure solves the problem of the information theoretic measure by an exact reverse. That is, the privacy breach measure is (almost) independent of the average information disclosure and only depends on the privacy disclosure in the worst case. We will show the problem of previous measures as follows. 5.2 Problem of Previous Measures For the measurement of privacy, we need to define the privacy of data providers first. In the dictionary, privacy is defined to be the capacity of the data providers to be freedom from unauthorized intrusion [17]. As we can see from the definition, the effectiveness of privacy protection depends on whether a malicious data miner can perform unauthorized intrusion to the data providers. The privacy loss of the data providers is measured by the gain of the data miner from unauthorized intrusions. Thus, the privacy protection measure depends on two important factors: a) the privacy protection mechanism of the data providers, and b) the unauthorized intrusion technique of the data miner. The data miner has the freedom to choose different intrusion techniques in different circumstances. Thus, the intrusion technique of the data miner should always be considered in the measurement of privacy. However, previous measures do not follow this principle. Both information theoretic measure and privacy breach measure do not address the variety of intrusion techniques. Instead, they make a tacit assumption that all data miners will use the same intrusion technique. This assumption seems to be reasonable as a (rational) data miner will always choose the intrusion technique that compromises the most private information. However, as we will show below, the optimal intrusion technique varies in different circumstances. Thereby, the absence of consideration of intrusion techniques results in problems of the privacy measurement. Example 1. Suppose that V X = {0, 1}. The original data x i is uniformly distributed on V X. The system designer needs to determine which of the following two randomization operators, R 1 and R 2, discloses less private information. { x, with probability 0.70, R 1 (x) = (19) x, with probability 0.30. 0, if x =0, R 2 (x) = 1, if x =1and with probability 0.01, (20) 0, if x =1and with probability 0.99. In the example, the mutual information I(x; R 1 (x)) is much greater than I(x; R 2 (x)). That is, the average amount of private information disclosed by R 1 is much greater than R 2. Due to the information theoretic measure, R 2 is better than R 1 in the privacy protection perspective. The result is different when the privacy breach measure is used. As we can see, if the data miner receives R 2 (x i )=1, then it can always infer that x i =1with probability of 1. Thus, the worst-case privacy loss of R 2 is much greater than that of R 1. Due to the privacy breach measure, R 1 is better than R 2 in the privacy protection perspective.

We now show that whether R 1 or R 2 is better actually depends on the system setting. In particular, we consider the following two system settings. 1. The system is an online survey system where the survey analyzer and the survey respondents are the data miner and the data providers, respectively. The value of x i indicates whether a survey respondent is interested in buying certain merchandise. The intrusion performed by a malicious data miner is to make unauthorized advertisement to data providers with such interest. 2. The system consists of n companies as the data providers and a management consulting firm as the data miner. The consulting firm performs statistical analysis on the financial data of the companies. The original data x i contains the expected profit of the company which has not been published yet. As the unauthorized intrusion, a malicious data miner may use x i to make investment on a high-risk stock market. The profit from a successful investment is great. However, a failed investment results in a loss five times greater than the profit the data miner may obtain from a successful investment. In the first case, an advertisement to a wrong person costs the data miner little. A reasonable strategy for the data miner is to make advertisement to all data providers. In fact, if the expected loss from an incorrect estimate (i.e., advertisement to a person without interest) is equal to 0, this is the optimal intrusion technique for the data miner. Consider the two randomization operators, R 1 discloses the original data value with probability of 0.7, which is greater than that of R 2 (0.501). Thus, R 2 is better than R 1 in the privacy protection perspective. In the second case, the data miner will not perform the intrusion when R 1 is used by the data providers. The reason is that the loss from a failed investment (i.e., an incorrect estimate on x i ) is unaffordable. Even if the profit from a successful investment is fairly high, the loss from a wrong decision is too high to risk. That is, for the data miner, the expected net benefit from an unauthorized intrusion is less than 0. However, the data miner will perform the intrusion if a randomized data R 2 (x i )=1 is received the data miner. The reason is that the data miner has a fairly high probability (99%) to make a successful investment. If a randomized data R 2 (x i )=0is received, the data miner will simply ignore it. Thus, in this case, R 1 is better than R 2 in the privacy protection perspective. As we can see from the example, the data miner will choose different privacy intrusion techniques in different system settings. This will result in different performance of the randomization operators. Thus, the system setting and the privacy intrusion technique has to be considered in the measurement of privacy. 5.3 A Game Theoretic Framework In order to introduce the system setting and the privacy intrusion technique to our privacy measure, we first propose a game theoretic framework to analyze the strategies of the data miner (i.e., privacy intrusion technique). Since we are studying the privacy protection performance of the randomization operator, we consider the randomization operator as the strategy of the data providers.

We model the privacy preserving data mining process as a non-cooperative game between the data providers and the data miner. There are two players in the game. One is the data providers. The other is the data miner. Since we only consider the privacy measure, the game is zero-sum in that the benefit obtained by the server from unauthorized intrusions always results in an invasion of the privacy of the data providers. Let S c be the set of randomization operators that the data providers can choose from. Let S s be the set of the intrusion techniques that the data miner can choose from. Let u c and u s be the payoffs (i.e., expected benefits) of the data providers and the data miner, respectively. Since the game is zero-sum, we have u c + u s =0. We remark that the payoffs depend on both the strategies of the players and the system setting. We assume that both the data providers and the data miner are rational. That is, given a certain randomization operator, the data miner always choose the optimal privacy intrusion technique that can maximize its payoff u s. Given a certain privacy intrusion technique, the data providers always choose the optimal randomization operator that can maximize u c. Due to game theory, if a Nash equilibrium 2 exists in the game, it contains the optimal strategies for both the data providers and the data miner [11]. 5.4 Our Privacy Measure Now we will define our privacy measure based on the game theoretic formulation. Definition 2. Given a privacy preserving data mining system G (S s,s c,u s,u c ),we define the privacy measure l p of a randomization operator R as l p (R) =u c (R, L 0 ), (21) where L 0 is the optimal privacy intrusion technique for the data miner when R is used by the data providers, u c is the payoff of the data providers when R and L 0 are used. As we can see, the smaller l p is, the more benefit is obtained by the data miner from the unauthorized intrusion. We now use an example to illustrate the definition. Example 2. Let V X be {0, 1}. Support that the original data x i is uniformly distributed on V X. A system designer wants to make a comparison between the privacy preserving capacity of randomization operators R 1 and R 2, which are shown as follows. { xi, with probability 0.60; R 1 (x i )= x i, with probability 0.40; { xi, with probability 0.01; R 2 (x i )= e, with probability 0.99; (22) (23) where e is a denial-of-service signal which satisfies e {0, 1}. As we can see, no private information can be inferred from e. Thus, without loss of generality, we suppose that the data miner ignores a data point if it has a value of e. Due to the information theoretic measure, R 2 is better than R 1. Due to the private breach measure, R 1 is better 2 Roughly speaking, a Nash equilibrium is a condition where no player can benefit by changing its own strategy unilaterally while the other player keeps its current strategy.

than R 2. We will analyze the problem based on our privacy measure in a game theoretic formulation. Since the comparison is between R 1 and R 2, we assume that the data providers can only choose the randomization operator from {R 1,R 2 }. That is, S c = {R 1,R 2 }. For a given system setting, let the optimal intrusion technique for the data miner be L 0. We now propose a specific intrusion technique L 1. Roughly speaking, L 1 represents an intrusion technique that infers x i = R (x i ) if and only if R (x i ) e. Wehave {L 0,L 1 } S s. Since Pr{R 2 (x i )= x i } =0, L 1 is the optimal intrusion technique for the data miner when R 2 is the randomization operator. That is, L 0 = L 1 when R 2 is used by the data providers. The strategies and payoffs are listed in Table 1, where Table 1. Strategies and Payoffs L 0 L 1 R 1 u 0/u 0 u 1/u 1 R 2 u 2/u 2 u 2/u 2 u 0, u 1 and u 2 are the payoffs of the data miner in different circumstances. Due to the assumption that L 0 is the optimal intrusion technique, we always have u 0 u 1. The comparison between u 1 and u 2 depends on the system setting. Recall the two system settings in Example 1. In the online survey example, we have u 1 >u 2. In the stock market example, a reasonable estimation is u 1 u 0 u 2. Let C, S be the strategies of the data providers and the data miner, respectively. We consider the comparison between R 1 and R 2 in different cases as follows. 1. u 1 >u 2 There are two Nash equilibria in the game: C, S = R 2,L 0 and C, S = R 2,L 1. Thus, R 2 is a better choice for the data providers in the privacy protection perspective. 2. u 1 <u 2,u 0 >u 2 Only one Nash equilibrium C, S = R 2,L 0 exists in the game. Thus, R 2 is a better choice for the data providers in the privacy protection perspective. 3. u 1 <u 2,u 0 <u 2 Only one Nash equilibrium C, S = R 1,L 0 exists in the game. Thus, R 1 is a better choice for the data providers in the privacy protection perspective. As we can see, the comparison between the privacy preserving capacity of R 1 and R 2 depends on the comparison between u 1 and u 2, which is determined by the ratio between the benefit from a correct estimate and the loss from an incorrect estimate. Let the ratio be σ. In the above case, we have σ = A useful theorem is provided as follows. gain from a correct estimate loss from an incorrect estimate = 40u 2. (24) 60u 2 u 1

Theorem 2. Suppose that in the original data distribution, we have max Pr{x i = x 0 } = p m. (25) x 0 V X If the randomization operator R : V X V Y satisfies max x VX p[x y] max y V Y min x VX p[x y] 1 p m, (26) σp m then the privacy measure l p (R) =0. The proof of Theorem 2 is omitted due to space limit. 6 Conclusion In this paper, we establish the foundation for the measurements of accuracy and privacy protection in privacy preserving data mining. On accuracy side, we address the problem of previous accuracy measures and solve the problem by introducing an effective sample size measure. On privacy protection side, we first identify an unrealistic assumption tacitly made by previous measures. After that, we present a game theoretic formulation of the system and propose a privacy protection measure based on the formulation. We conclude this paper with some future research directions. Design the optimal randomization operator based on the new accuracy and privacy protection measures. Further analysis on the performance of data mining algorithms. Most existing theoretic analysis on the performance of privacy preserving data mining techniques is based on the assumption of ideal data mining algorithm. The performance of practical data mining algorithms has only been analyzed through heuristic results. However, as shown in [4], the difference between practical and ideal data mining algorithms can be nontrivial. Further analysis on this issue is needed to measure the performance of randomization operators more precisely. References 1. D. Agrawal and C. C. Aggarwal. On the design and quantification of privacy preserving data mining algorithms. In Proceedings of the 20th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 247 255. ACM Press, 2001. 2. R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proceedings of the 26th ACM SIGMOD International Conference on Management of Data, pages 439 450. ACM Press, 2000. 3. R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proceedings of the 26th ACM SIGMOD Conference on Management of Data, pages 439 450. ACM Press, 2000. 4. C. Clifton. Using sample size to limit exposure to data mining. Journal of Computer Security, 8(4):281 307, 2000. 5. W. Du and M. Atallah. Privacy-preserving cooperative statistical analysis. In Proceedings of the 17th Annual Computer Security Applications Conference, page 102. IEEE Computer Society, 2001.

6. W. Du, Y. S. Han, and S. Chen. Privacy-preserving multivariate statistical analysis: Linear regression and classification. In Proceedings of the 4th SIAM International Conference on Data Mining, pages 222 233. SIAM Press, 2004. 7. W. Du and Z. Zhan. Building decision tree classifier on private data. In Proceedings of the IEEE International Conference on Privacy, Security and Data Mining, pages 1 8. Australian Computer Society, Inc., 2002. 8. W. Du and Z. Zhan. Using randomized response techniques for privacy-preserving data mining. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 505 510. ACM Press, 2003. 9. A. Evfimievski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy preserving data mining. In Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 211 222. ACM Press, 2003. 10. A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of association rules. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 217 228. ACM Press, 2002. 11. R. Gibbons. A Primer in Game Theory. Harvester Wheatsheaf, New York, 1992. 12. O. Goldreich. Secure Multi-Party Computation. Cambridge University Press, 2004. 13. M. Kantarcioglu and C. Clifton. Privacy-preserving distributed mining of association rules on horizontally partitioned data. IEEE Transactions on Knowledge and Data Engineering, 16(9):1026 1037, 2004. 14. M. Kantarcioglu and J. Vaidya. Privacy preserving naïve bayes classifier for horizontally partitioned data. In Workshop on Privacy Preserving Data Mining held in association with The 3rd IEEE International Conference on Data Mining, 2003. 15. Y. Lindell and B. Pinkas. Privacy preserving data mining. In Proceedings of the 20th Annual International Cryptology Conference on Advances in Cryptology, pages 36 54. Springer Verlag, 2000. 16. F. J. Massey. The Kolmogorov-Smirnov test for goodness of fit. Journal of American Statistical Association, 46(253):68 78, 1951. 17. Merriam-Webster. Merriam-Webster s Collegiate Dictionary. Merriam-Webster, Inc., 1998. 18. S. J. Rizvi and J. R. Haritsa. Maintaining data privacy in association rule mining. In Proceedings of the 28th International Conference on Very Large Data Bases, pages 682 693. Morgan Kaufmann, 2002. 19. J. Vaidya and C. Clifton. Privacy preserving association rule mining in vertically partitioned data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 639 644. ACM Press, 2002. 20. J. Vaidya and C. Clifton. Privacy preserving naïve bayes classifier for vertically partitioned data. In Proceedings of the 4th SIAM Conference on Data Mining, pages 330 334. SIAM Press, 2004. 21. Y. Zhu and L. Liu. Optimal randomization for privacy preserving data mining. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 761 766. ACM Press, 2004.