Preserving Privacy in Data Preparation for. Association Rule Mining

Transcription

1 Preserving Privacy in Data Preparation for Association Rule Mining Nan Zhang, Shengquan Wang, and Wei Zhao, Fellow, IEEE Abstract We address the privacy preserving association rule mining problem in a distributed system with one data miner and multiple data providers, each holds one transaction. The literature has tacitly assumed that randomization is the only effective approach to preserve privacy in such circumstances. We challenge this assumption by introducing an algebraic techniques based scheme in the data preparation phase. Compared to previous approaches, our new scheme can identify association rules more accurately but disclose less private information. Furthermore, our new scheme can be readily integrated as a middleware with existing systems. Index Terms Data mining; clustering, classification, and association rules; privacy; singular value decomposition. I. INTRODUCTION The goal of data mining is to extract interesting knowledge from large amounts of data [1]. Traditional data mining algorithms deal with centralized data. Recently, a number of applications on the Internet lead to a need for mining distributed data. In this circumstance, a privacy concern arises from the distributed The authors are with the Department of Computer Science, Texas A&M University, College Station, TX {nzhang, swang, zhao}@cs.tamu.edu. A preliminary version of this paper is to be presented at the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), September 2004.

2 2 data providers. In this paper, we address issues related to production of accurate data mining results, while preserving the private information in the data being mined. We will focus on association rule mining, which will be briefly reviewed in the next section. Since Agrawal, Imielinski, and Swami addressed this problem in [2], association rule mining has been an active research area due to its wide applications and the challenges it presents. Many algorithms have been proposed and analyzed [3] [5]. However, few of them have addressed the issue of privacy protection. We can classify privacy preserving association rule mining systems into two classes based on their infrastructures, named Server to Server (S2S) and Client-to-Server (C2S), respectively. In the first category (S2S), data are distributed across several autonomous entities (servers) [6], [7]. Each server holds a private database that contains numerous data points (i.e., transactions). The servers collaborate with each other to identify association rules spanning multiple databases. Since usually only a few servers are involved in a system (e.g., less than 10), the problem can be modeled as a variation of the secured multi-party computation [8], which has been extensively studied in cryptography [9]. In the second category (C2S), a system consists of one data miner (server) and large amounts of data providers (clients) [10], [11]. Each data provider holds only one transaction. Association rule mining is performed by the data miner on the aggregate transactions provided by the data providers. Online survey is a typical example of this kind of system, as the system can be modeled as consisting of one data miner (i.e., the survey analyzer) and thousands of data providers (i.e., the survey respondents). To ensure the effectiveness of the survey results (e.g., to block multiple votes from a unique IP), the identity of the data providers cannot be hidden from the data miner. Thus, privacy is of particular concern in this kind of system. In fact, there has been wide media coverage of the public debate of protecting privacy in online surveys [12]. Both S2S and C2S systems have wide applications. Nevertheless, we will focus on studying C2S systems in this paper. Several studies have been carried out on privacy preserving association rule mining in C2S systems. Most of them have tacitly assumed that an effective approach to preserving privacy is

3 3 randomization. If we consider the data mining as a two-phase process, which consists of the the data prepartion phase and the data mining phase, the randomization approach is involved in both phases. We Fig phase Data Mining challenge this assumption by introducing a new scheme that only occurs in the data preparation phase. Our new scheme integrates algebraic techniques with random noise perturbation. It has the following important features that distinguish it from previous approaches: Our scheme is easy to implement and flexible. Our scheme is involved only in the data preparation phase and does not require a support recovery procedure in the data mining phase. Thus, our new scheme is transparent to the data mining process. It can be readily integrated as a middleware with existing systems. Our scheme can identify association rules more accurately but disclose less private information. Roughly speaking, our scheme selects the most useful features for asssociation rule mining from the original data and transmits only these features to the data miner. Our simulation data show that for the same level of accuracy, our system discloses private information about five times less than the randomization approach. We allow explicit negotiation between the data providers and the data miner in terms of the tradeoff between the accuracy of data mining results and the privacy of data providers. Instead of following the rules set by the data miner, a data provider can play a role in determining the tradeoff between accuracy and privacy. This is an important feature because people have a wide variety of attitudes

4 4 towards privacy. Due to survey results in [13], the net users attritudes towards privacy can be divided into three categories: 7% privacy fundamentalists who are extremely concerned about the use of their private data, 56% pragmatic majority who are generally willing to provide data if privacy protection measures can be offered; and 27% marginally concerned who barely care about privacy. The negotiation feature in our scheme can help the data miner to collaborate with both hard-core privacy fundamentalists and people comfortable with limited privacy disclosure. The rest of this paper is organized as follows: In Section II, we brief review previous approaches. We present our models and introduce our new scheme in Section III. The communication protocol of our new scheme and its basic components are also provided in this section. In Section IV, we present the theoretical analysis of the tradeoff between accuracy and privacy in our scheme. The simulation results are presented in Section V. In Section VI, we present the experimental results on the performance evaluation of our scheme. Implementation and overhead issues are discussed in Section VII, followed by a final remark in Section VIII. II. APPROACHES In this section, we first overview the definition of association rule mining. Then, we introduce our models of the data miners. Based on the model, we review the randomization approach, which has been widely used in the literature of privacy preserving data mining. We will address the problems associated with the randomization approach, which motivates us to design a new privacy preserving scheme. A. Association Rule Mining A motivating example for association rule mining is a survey of hobbies. Each survey respondent chooses arbitrary number of hobbies from five options: football, soccer, beauty, video games and PC games. These options are called items. For a survey respondent, the answer to the survey is called a transaction. As we can see, a transaction is a set of items (e.g., t ={football, soccer}). Given a number of transactions, association rule mining finds interesting correlation (relationship) between items [1]. For

5 5 example, suppose that there are 5, 000 survey respondents. The survey analyzer finds that within the 1, 000 respondents who choose video games as their hobbies, there are 800 respondents who also choose PC games as their hobbies. Based on the survey results, the survey analyzer may infer an association rule that can be represented as follows. video games PC games [support = 16%, confidence = 80%]. (1) The support of this rule is the percentage of the respondents who choose both video games and PC games. Roughly speaking, the confidence is the probability that a respondent who chooses video games also chooses PC games. Both support and confidence are measures of the validity and trustworthiness of the association rules. Technically speaking, the main task of association rule mining is to find all frequent itemsets, which are itemsets that have support larger than a threshold determined by the data miner. B. Model of Data Miners Due to the privacy concern introduced to the system, we classify the data miners into two categories. One category is legal data miners. These data miners always act legally in that they only perform regular data mining tasks and would never intentionally invade the privacy of the data providers. The other category is illegal data miners. These data miners would purposely compromise the privacy of the data providers. Like adversaries in distributed systems, illegal data miners come in many forms. In most forms, their behavior is restricted from arbitrarily deviating from the protocol. In this paper, we focus on a particular sub-class of illegal miners. That is, in our system, illegal data miners are honest but curious: they follow proper protocols (i.e., they are honest), but they analyze all intermediate communications and received transactions (i.e., they are curious) to discover private information [9]. Even though it is a relaxation from the Byzantine behavior, this kind of honest but curious (nevertheless illegal) behavior is common and has been widely used as the adversary model in the literature.

6 6 C. Randomization Approach To prevent invasion of privacy due to the existence of illegal data miners, countermeasures must be implemented in the data mining system. We briefly review the randomization approach, which is currently used to preserve privacy in association rule mining. Based on the randomization approach, the entire data mining process is a two-step process. The first step is in the data preparation phase. In this step, a data provider first applies the randomization algorithm to the transaction it holds. Then, the data provider transforms the randomized transaction to the data miner. In previous studies, several randomization algorithms have been proposed including the cut-andpaste operator [10] and MASK operator [11]. For example, when the cut-and-paste operator is used, the data provider first randomly choose an integer j as the number of items that occur in both the original transaction t and the randomized transaction R(t). After that, the data provider randomly choose j items from the t and place these items into R(t). Then, for every item a t, the data provider tosses a coin with probability ρ to place a into R(t). In the second step, the data miner performs association rule mining on the aggregate data. With the randomization approach, the data miner must first employ a support recovery algorithm which intends to reconstruct the support of candidate itemsets. Also in the second step, an illegal data miner may invade privacy by using a privacy data recovery algorithm on the randomized transactions supplied by the data providers. Figure 1 depicts the privacy preserving association rule mining system with the randomization approach. Clearly, any such system should be measured by its capability of both generating accurate association rules and preventing invasion of privacy. D. Problems of the Randomization Approach While the randomization approach is intuitive, researchers have recently identified some problems of the randomization approach as follows.

7 7 Fig. 2. randomization approach In [10], the authors remarked that if the cut-and-paste operator is applied to a transaction with 10 or more items, it is difficult, if not impossible, for the data provider to contribute to the association rule mining with its privacy preserved. Furthermore, large itemsets have exceedingly high variances on the recovered support. Similar problems exist with other randomization operators as they share the similar scheme on the randomization of the original data. An approach was proposed in [10] to solve this problem. In the approach, all data providers that hold transactions with 10 or more items do not transfer the randomized transaction to the data miner. Unfortunately, this approach prevents many frequent itemsets that contain 4 or more items from being discovered by the data miner. In [14], the authors showed that the spectral properties of randomized data could help curious data miners to separate noise from private data. Based on random matrix theory, they proposed a filtering method to reconstruct private data from the randomized data set. They demonstrated that the randomization approach preserves very little privacy in many cases. Although their work is based

8 8 on the randomization approach for privacy preserving data classification, we believe that the similarity between randomization operators in association rule mining and data classification makes the problem inherent in the randomization approach. The randomization approach also suffers in efficiency. Since the privacy preserving mechanism is not restricted to the data preparation phase, it puts a heavy load on (legal) data miners at run time (because of the support recovery) [15]. It is shown that the cost of mining randomized data set is well within an order of magnitude in respect to that of mining original data set. We explore the reasons behind these problems as given below. We note that previous randomization approaches are transaction-invariant. That is, the same perturbation algorithm is applied to all data. Since more items in the original transaction always result in more real items to be included in the randomized transaction, privacy protection on transactions with a large size (e.g., t > 10) are doomed to failure. Previous randomization approaches are item-invariant. All items in the original transaction have the same probability of being included in the perturbed transaction. No specific operation is performed to preserve the correlation between different items. Thus, a lot of real items in the perturbed transactions may not appear in any frequent itemset. That is, the disclosure of these items does not contribute to the mining of association rules. We remark that the transaction-invariant and item-invariant properties are inherent in the randomization approach. The reason is that in a system using randomization approach, the communication is one-way: from the data providers to the data miner. As such, a data provider cannot obtain any specific guidance on the perturbation of its transaction from the (legal) data miner, nor can the data providers learn the correlation between the items. Thus, a data provider has no choice but to use a transaction-invariant and item-invariant approach. This observation motivates us to develop a new scheme that allows two-way communication between the data miner and the data providers. The two-way communication helps preserving privacy while not

9 9 incurring too much overhead. Thereby, we significantly improve the performance in terms of accuracy, privacy, and efficiency. We describe the new scheme in the next section. III. COMMUNICATION PROTOCOL AND RELATED COMPONENTS In this section, we introduce our new scheme including the communication protocol and its basic components. A. Description of Our New Scheme Fig. 3. our new scheme Figure 2 depicts the infrastructure of a system using our new scheme. Our scheme has two key components, perturbation guidance (PG) in the data miner server side and perturbation in the data provider client side. Compared to the randomization approach, our scheme does not have the support recovery component. Instead, the association rule mining is performed on the perturbed transactions (R(t)) directly. Thus, our scheme is restricted to the data preparation stage and does not put a heavy load on the data miner by recovering the support at run time.

10 10 Our scheme is a three-step process. In the first step, the data miner negotiates different perturbation level k with different data providers. The larger k is, the more contribution R(t) will make to the association rule mining task. The smaller k is, the more private information is preserved. Thus, a privacy fundamentalist can choose a small k to preserve its privacy while a privacy unconcerned data provider can choose a large k to contribute to the association rule mining. The second step is to transmit the perturbed transactions from the data providers to the data miner. Since each data provider comes at a different time (e.g., different survey respondents take the survey at different time), this step can be considered as an iterative process. In each stage, the data miner dispatches a reference (perturbation guidance) V k to a data provider P i. Here V k depends on the perturbation level kthat is negotiated by the data miner and the data provider P i in the first step. Based on the received V k, the perturbation component of P i computes the perturbed transaction R(t) from its original transaction t. Then, P i transmits R(t) to the perturbation guidance (PG) component of the data miner. The PG component then updates V k based on R(t) and forwards R(t) to the association rule mining process. A curious data miner can also obtain R(t). In this case, the curious data miner uses private data recovery algorithm to compromise privacy in R(t). In the third step, the perturbed transactions received by the data miner are used by the association rule mining process. Association rules are identified and delivered to the data miner. The key here is to properly design V k so that correct guidance can be given to the data providers on how to perturb the transactions. In our scheme, we let V k be an algebraic quantity derived from the currently received, yet perturbed transactions. The details of computing V k will be presented as a basic component of our scheme. B. Notions of Transactions Before presenting the details of the communication protocol and its basic components, we first introduce some notions of the data set. Let I be a set of n items (i.e., I = {a 1,..., a n }. Suppose that there are m data providers in the system. Each data provider C i holds a transaction t i, which is a subset of I. We

11 11 represent the data set by an m n matrix T = [a 1,..., a n ] = [t 1 ;... ; t m ] 1. An example of the transaction matrix T is shown in Table I. Let T ij be the element of T with indices i and j. The element T ij represents whether item j is in transaction t i. Suppose that transaction t 1 contains items a 1 and a 2. The first row of the matrix has T 1,1 = T 1,2 = 1. TABLE I AN EXAMPLE OF A TRANSACTION MATRIX a 1 a 2 a n t t m An itemset B I is an h-itemset if and only if B contains h items (i.e., B = h). The support of B is the percentage of the transactions in the data set that contain B. That is, supp(b) = {t T B t}. (2) m An h-itemset B is frequent if supp(b) min supp, where min supp is a predefined minimum threshold of support. Refer to the survey of hobbies example in the above subsection, {video games, PC games} is a frequent 2-itemset with support The set of frequent h-itemsets is denoted by L h. C. The Communication Protocol We now describe the communication protocol of our new scheme. The negotiation between the data miner and data providers is shown in Protocol 1. On the side of data miner server, there are two threads that perform the operations in Protocol 2 and Protocol 3 iteratively after the negotiation. For a data provider, it performs the operations in Protocol 4 to perturb and transmit its transaction to the data miner. 1 Here t i and a i are used somewhat ambiguously. In the context of association rule mining, t i is a transaction and a i is an item. In the context of matrix, t i represents a row vector in T and a i represents a column vector in T.

12 12 Protocol 1 Negotiation NM1. Based on the SVD of T (T = U Σ V ), the data miner calculates S = Σ Σ 2 nn; NM2. Find the smallest k [1, n] such that k i=1 Σ 2 ii µ S; NM3. The data miner dispatches k to registered data providers; NP1. For a data provider C i, if C i receives k K t (K t is the threshold of truncation level set by C i ) then C i sends ready message to the data miner; end if Protocol 2 Thread of registering data provider R1. Negotiate on the truncation level k with a data provider; R2. Wait for a ready message from a data provider; R3. Upon receiving the ready message from a data provider, Register the data provider; Send the data provider current V k ; R4. Go to Step R1; D. Basic Components There are three key components in the communication protocol of our scheme: (a) the method of computing V k, (b) the perturbation function R( ), and (c) the negotiation on truncation level k. 1) Computation of V k : Recall that V k carries information from the data miner to data providers on how to perturb the original transactions to preserve privacy. In our scheme, V k is an estimate of the eigenvectors of A = T T (i.e., the right singular vectors of T ). The justification of V k on providing accurate mining results and preserving privacy is presented in Appendix I. As we are considering dynamic cases where the perturbed transactions are dynamically fed to the data miner, the data miner keeps a copy of all received (perturbed) transactions and updates it when a new perturbed transaction is received. Assume that the initial set of received (perturbed) transactions T is

13 13 Protocol 3 Thread of receiving transaction T1. Wait for a perturbed transaction R(t) from a data provider; T2. Upon receiving the transaction from a registered data provider, Update V k based on the recently received perturbed transaction; Deregister the data provider; T3. Go to Step T1; Protocol 4 Transaction perturbation P1. Negotiate on the truncation level k with the data miner; P2. Send the data miner a ready message indicating that this provider is ready to contribute to the mining process; P3. Wait for a message that contains V k from the data miner; P4. Upon receiving the message from the data miner, Compute R(t) based on t and V k ; P5. Transfer R(t) to the data miner; empty 2. Every time when a perturbed transaction R(t) is received, T is updated by appending R(t) to the bottom of T. Thus, T is the matrix of currently received (perturbed) transactions. We derive V k from T. In particular, the computation of V k is done in the following steps. Using singular value decomposition (SVD) [16], we decompose T as (3), where Σ = diag(s 1,..., s n ) is a diagonal matrix with s 1 s n. T = U Σ V. (3) The numbers s 2 i make up the eigenvalues of A = T T. V is an n n unitary matrix composed of the eigenvectors of A. V k is composed of the k eigenvectors of A that are associated with the largest k eigenvalues of A. If 2 T may also contain some transactions provided by privacy-careless data providers

14 14 V = [v 1,..., v n ], we have V k = [v 1,..., v k ]. (4) Thus, we call V k as the k-truncation of V. Several incremental algorithms have been proposed to update V k when a perturbed transaction is received by the data miner [17], [18]. The computing cost of updating V k is addressed in Sect. VII. As we will see in Sect. IV, k plays a critical role in balancing accuracy and privacy. We will also show that by using V k in conjunction with R( ), which is to be discussed next, we can achieve both desired accuracy and privacy protection. 2) Perturbation function: Recall that once a data provider receives V k from the data miner, the data provider applies a perturbation function R( ) to its transaction t. The result is a perturbed transaction R(t) that will be transmitted to the data miner. The computation of R(t) is defined as follows. First, for a given V k, the transaction t is transformed by t = tv k V k. Note that the elements of the vector t may not be integers. Algorithm 5 is employed to round t to 0 or 1. In this algorithm, ρ t is a pre-defined parameter. Finally, for the completeness of our work, we introduce an optional procedure to enhance the privacy preserving capability of the system, which is shown in Algorithm 6. A data provider may use the protocol to insert additional noise into R(t). In the protocol, ρ m is a parameter determined by the data provider. The higher ρ m is, the more noise is inserted into the perturbed transaction. We remark that this protocol is optional and is only needed by privacy fundamentalists. An example of the perturbation process is provided in Appendix II. 3) Negociation on truncation level: In order to retain enough information for association rule mining after the transformation from t to R(t), a textbook heuristic is to make the sum of the k eigenvalues associated with the retained eigenvectors of A larger than 85% of the sum of the eigenvalues of A (i.e., µ = 85%) [16], [19]. The perturbation level k is usually large at the beginning but decreases and stabilizes to be fairly small (e.g., less than 1% of n) soon. Thus, in the theoretical analysis of our scheme, we consider the perturbation level k as a predetermined parameter rather than a variable updated throughout the data

15 15 Algorithm 5 Mapping Let t i be the element of vector t with index i. Similar notations apply to other vectors. for every element t i in t do if t i 1 ρ t then else R(t) i = 1 R(t) i = 0 end if end for Algorithm 6 Random-noise perturbation for every item a i / t do Choose a real number j uniformly at random on [0, 1] if j 1 ρ m then R(t) i = 1 end if end for preparation process. We have described the communication protocol of our scheme and its key components. We now discuss the accuracy and privacy measures of our scheme. IV. ANALYSIS ON ACCURACY AND PRIVACY In this section, we analyze our new scheme. We define measures for accuracy and privacy and derive their bounds, in order to provide guidelines for the tradeoff between these two measures and hence help system managers setting parameter. We also show the simulation and experimental results of our scheme on real datasets. For the simplicity of our discussion, we do not consider the optional random noise insertion procedure in Algorithm 6.

16 16 A. Accuracy Measure An accuracy measure should reflect the capability of the system that can correctly identify association rules in a given dataset. We define the accuracy measure as the error of the support of frequent itemsets. This is because that the main task of association rule mining is to identify frequent itemsets with support larger than a threshold min supp. There are two kinds of errors: a) false drops, which are unidentified frequent itemsets, and b) false positives, which are itemsets incorrectly identified to be frequent. We now formally define our accuracy measure. Given itemset I j, let supp(i j ) and supp (I j ) be the support of I j in the original transactions T and the perturbed transactions R(T ), respectively. Recall that the set of frequent h-itemsets in T is L h. We define the errors on false drops and false positives as follows. Definition 4.1: Given itemset size h, the error on false drops, ρ h 1, and the error on false positives, ρh 2, are defined as ρ h 1 = max I j L h (supp(i j ) supp (I j )), (5) ρ h 2 = max I j L h (supp (I j ) supp(i j )). (6) We define our accuracy measure, degree of accuracy, as the maximum value of ρ h 1 and ρh 2 itemsets. on all sizes of Definition 4.2: The degree of accuracy in a privacy preserving association rule mining system is defined as γ = max h 1 max(ρ h 1, ρh 2 ). Based on the definition, we derive an upper bound on the accuracy measure. Theorem 4.3: In our system, the degree of accuracy γ satisfies γ σ2 k+1 m ( { 1 (1 ρt ) max, 1 ρ }) t, (7) (1 ρ t ) 2 ρ t where σ 2 k+1 is the (k +1)th largest eigenvalue of A = T T. In particular, when ρ t = (3 5)/2, γ reaches its lowest upper bound at γ 2.618σ 2 k+1 /m. The proof of Theorem 4.3 can be found in Appendix III. Our bound on accuracy measure is fairly small when the number of transactions (m) is sufficiently large. This is usually the case in reality. Actually, our

17 17 scheme tends to enlarge the support of frequent itemsets and reduce the support of infrequent itemsets. Thus, the upper bound is not always tight. We can observe this trend from the experimental results, which will be presented in Section VI. B. Privacy Measure In our system, the data miner cannot infer the original transaction t from a perturbed transaction R(t) deterministically because V k V k is a singular matrix with its determinant det(v kv k ) = 0 (i.e., it does not have an inverse matrix). To measure the probability that an item in t can be identified from R(t), we need a privacy measure. Due to survey results in [12], a data provider (e.g., net user) always has a strong will of filtering out unwanted data (i.e., data not contribute to the mining of association rules) before transmitting its private data to the data miner. Given transaction t, an item a i in t is unwanted if a i does not appear in any frequent itemset (i.e., h 1, a i {L h L h t}). That is, the disclosure of a i (i.e., a i R(t)) does not contribute to the mining of association rules. We measure privacy by the probability that an unwanted item is included in the perturbed transaction. Formally speaking, our privacy measure, named level of privacy, is defined as follows. Definition 4.4: Given transaction t, an item a i t is unwanted if and only if there does not exist any frequent itemset I t such that a i I. We define the level of privacy as the probability that an infrequent item in t is included in the perturbed transaction R(t). That is, the level of privacy is defined as δ = Pr{a i R(t) a i is unwanted in t}. (8) As we can see from the definition, a higher level of privacy results in a higher probability of privacy invasion. With an approximation, we derive an upper bound on the level of privacy in our scheme. Theorem 4.5: With properly set parameters, the level of privacy in our system is bounded by σk+1 δ< σ2 n, (9) σ σn 2 where σ 2 i is the ith largest eigenvalue of A = T T.

18 18 The proof of Theorem 4.5 can be found in Appendix IV. Because of the uncertainty on the rounding off operation in Algorithm 5, this bound is not tight. As we can see from Theorem 4.3 and Theorem 4.5, there is a tradeoff between accuracy and privacy in our system. The upper bound of the degree of accuracy is in proportion to σk+1 2. The level of privacy is a decreasing function of σk+1 2. The larger the perturbation level k is, the more unwanted items are disclosed to the data miner, and the more frequent itemsets can be correctly identified. V. SIMULATION RESULTS In this section, we present the simulation results of our scheme on a randomly generated dataset. The experimental results on a real dataset will be presented in the next section. The randomly generated dataset consists of 2, 000 transactions. Let there are 20 items in the dataset. Each transaction is a subset of the 20 items. We represent the dataset by a 2, matrix T. The first 19 columns (items) of T, a 1,..., a 19, are independently generated. The 20th column is set to be the same as the 10th column (i.e., a 10 = a 20 ). The first item a 1 has a probability of 0.6 to be included in a transaction. All other columns have probability of 0.1 to be included in a transaction. Thus, the expected support of a 1 is 0.6. The expected support of every other item is 0.1. Since a 10 = a 20, the expected support of {a 10, a 20 } is 0.1. We set min supp = 0.09 as the threshold (lower bound) on the support of a frequent itemset. In the randomly generated dataset, the support of {a 1 } and {a 10, a 20 } are listed as follows. support of {a 1 } = , support of {a 10, a 20 } = (10) As we can see, these two itemsets have support much higher than other 1-itemsets and 2-itemsets, respectively. The left part of Figure 3 shows the original dataset. In the original dataset, the total number of appearance of all items is 4929 times including 1217 times of a 1, 197 times of a 10, 197 times of a 20, and 3318 times of the other items.

19 19 In our simulation, we set the parameters as µ = 0.85 (in Protocol 1), ρ t = 0.8 (in Algorithm 5), and µ m = 0 (in Algorithm 6). The data miner updates V k when every 10 transactions are received. The truncation level k calculated from Protocol 1 is listed as follows. The right part of Fig. 3 shows the TABLE II TRUNCATION LEVEL k k Transaction No dataset after perturbation. In the perturbed dataset, the total number of appearance of all items is 1646 times including 1217 times of a 1, 197 times of a 10, 197 times of a 20, and 35 times of the other items. The association rule mining results are stated as follows. {a 1 }, with support and {a 10, a 20 }, with support (11) As we can see, the support of interesting itemsets are perfectly recovered with the degree of accuracy γ = 0. The privacy is well preserved with the level of privacy δ = 1.05%. Fig. 4. comparison of between original and perturbed transactions

20 20 VI. EXPERIMENTAL RESULTS ON A REAL DATASET In this section, we make a comparison between our approach and the cut-and-paste randomization approach [10] by experimental results on a real dataset. We use a real world dataset named BMS Webview 1 [20], which contains web click stream data of several months from the e-commerce website of a leg-care company. The dataset contains 59, 602 transactions and 497 items. We randomly choose 10, 871 transactions from the dataset as our test band. The maximum transaction size is 181. The average transaction size is There are 325 transactions (2.74%) that contains 10 or more items. We set the support threshold min supp to be 0.2%. There are 798 frequent itemsets including itemset, itemsets, itemsets, 37 4-itemsets and two 5-itemsets. As a compromise between privacy and accuracy, the cutoff parameter K m of cut-and-paste randomization operator is set to 7. The truncation level k of our approach is set to 6. Since both our approach and the cut-and-paste operator use the same method to add random noise, we compare the results before noise is added. Thus we set ρ m = 0 for both our approach and the cut-and-paste randomization operator Our perturbation algorithm Previous approach: cut and paste degree of accuracy (%) ρ t Fig. 5. accuracy The solid line in Figure 4 shows the change of degree of accuracy (max{ρ 1, ρ 2 }) of our approach with the parameter ρ t. The dotted line shows the degree of accuracy when the cut-and-paste randomization operator is used. As we can see, our approach always identifies association rules more accurately than

21 21 the cut-and-paste approach Our perturbation algorithm Previous approach: cut and paste 0.6 Degree of Privacy ρ t Fig. 6. privacy The level of privacy in the same setting is presented in Fig. 5. As we can see, our approach preserves privacy better than the cut-and-paste operator when ρ t > 0.1. Thus, our approach is better on both privacy and accuracy issues when 0.1 ρ t 1. From the figures, we observe a recommendation for the data providers as that ρ t [0.7, 0.8] is suitable for hard-core privacy fundamentalists and ρ t [0.2, 0.3] is recommended to privacy marginally concerned people. In particular, we analyze all 2-itemsets in the real dataset to show that our scheme tends to enlarge the support of frequent itemsets and reduce the support of infrequent itemsets. The result is shown in Fig. 6. The x-axis is the support of 2-itemsets in the original dataset. The y-axis is the support of itemsets in the perturbed dataset. The figure intends to show how effectively our system blocks the unwanted items from being divulged. If a system preserves privacy perfectly, we should have y equal to zero when x is less than min supp. The data in Fig. 6 shows that almost all 2-itemsets with support less than 0.2% (233, 368 unwanted 2-itemsets) have been blocked. That is, the privacy has been successfully protected. Meanwhile, the supports of frequent 2-itemsets are enlarged. This should help the data miner to identify frequent itemsets from additional noises.

22 22 Support in Perturbed Transactions k = 6 out of 497, ρ t = 0.4 Support in R(T) ( 10 4 ) Support in T ( 10 4 ) 233,368 2 itemsets itemsets Support in Original Transactions Fig. 7. comparison of 2-itemsets supports between original and perturbed transactions VII. IMPLEMENTATION A prototypical system for privacy preserving mining of association rules has been implemented using our new scheme. The system is designed on web browsers and servers for the application of online surveys. Visitors taking surveys through web browsers are considered to be data providers. The web server conducting surveys is considered to be the data miner. We implement the data perturbation algorithm as custom code on the web browsers. The PG (perturbation guidance) part of the data miner is implemented as custom code on the web server. All custom codes are component-based plug-ins that one can easily install to existing systems. The components required for building the system is shown in Figure 7. Fig. 8. system implementation The infrastructure of our scheme is shown in Figure 8. There are three separate layers in our system: user interface layer, perturbation layer, and web layer. The top layer, named user interface layer, provides

23 23 interface to data providers and the data miner. The middle layer, named perturbation layer, realizes our privacy preserving scheme and exploits the bottom layer to transfer information. The bottom layer, named web layer, consists of web servers and web browsers. As an important feature of our system, the details of data perturbation on the middle layer are transparent to both data providers and the data miner. The Fig. 9. system infrastructure overhead of our implementation is substantially smaller than the randomization approach in the context of online survey. The time-consuming part of the cut-and-paste randomization approach, which is the support recovery procedure, occurs in the association rule mining process. The support recovery algorithm needs to compute the partial support of all candidate itemsets for each transaction size, which will result in a significant overhead. In our system, the only overhead (possibly) incurred on the data miner is to update the perturbation guidance V k. Many SVD updating algorithms have been proposed including SVD-updating, folding-in and recomputing the SVD [17], [18]. Since T is usually a sparse matrix, the complexity of updating SVD can be considerably reduced to O(n). Besides, this overhead is not on the critical time path of the mining process. It occurs during data collection instead of data mining process. Note that the transmitted perturbation guidance V k is of length k n. Since k is always a small number (e.g., k 10), the communication overhead incurred by our introduction of two-way communication is not significant. VIII. FINAL REMARKS In this paper, we propose a new scheme on privacy preserving mining of association rules. Compared with previous approaches, we introduce a two-way communication mechanism between the data miner

24 24 and data providers with little overhead. In particular, we let the data miner send a perturbation guidance to the data providers. Using this intelligence, the data providers distort the data transactions to be transmitted to the miner. As a result, our scheme identifies association rules more precisely than previous approaches and at the same time protects privacy more effectively. Our work is preliminary and many extensions can be made. In addition to using a similar approach in data classification [21], we are currently investigating how to apply the approach to clustering problems. We would like to investigate a new behavior model that is stronger than the honest-but-curious model, and can be dealt with by our scheme. REFERENCES [1] J. Han and M. Kamber, Data Mining Concepts and Techniques. Morgan Kaufmann, [2] R. Agrawal, T. Imielinski, and A. Swami, Mining association rules between sets of items in large databases, in Proc. ACM SIGMOD Int. Conf. on Management of Data, 1993, pp [3] R. Agrawal and R. Srikant, Fast algorithms for mining association rules in large databases, in Proc. Int. Conf. on Very Large Data Bases, 1994, pp [4] J. S. Park, M. S. Chen, and P. S. Yu, An effective hash-based algorithm for mining association rules, in Proc. ACM SIGMOD Int. Conf. on Management of Data, 1995, pp [5] M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman, Computing Iceberg queries efficiently, in Proc. Int. Conf. on Very Large Data Bases, 1998, pp [6] J. Vaidya and C. Clifton, Privacy preserving association rule mining in vertically partitioned data, in Proc. ACM SIGKDD Int. Conf. on Knowledge discovery and data mining, 2002, pp [7] M. Kantarcioglu and C. Clifton, Privacy-preserving distributed mining of association rules on horizontally partitioned data, in Proc. ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 2002, pp [8] Y. Lindell and B. Pinkas, Privacy preserving data mining, Advances in Cryptology, vol. 1880, pp , [9] O. Goldreich, Secure Multi-Party Computation. Cambridge Univeristy Press, [10] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, Privacy preserving mining of association rules, in Proc. ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 2002, pp [11] S. J. Rizvi and J. R. Haritsa, Maintaining data privacy in association rule mining, in Proc. Int. Conf. on Very Large Data Bases, 2002, pp [12] J. Hagel and M. Singer, Net Worth. Harvard Business School Press, 1999.

25 25 [13] L. F. Cranor, J. Reagle, and M. S. Ackerman, Beyond concern: Understanding net users attitudes about online privacy, AT&T Labs-Research, Tech. Rep. TR , [14] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar, On the privacy preserving properties of random data perturbation techniques, in Proceedings of the 3rd IEEE International Conference on Data Mining. IEEE Press, 2003, pp [15] S. Agrawal, V. Krishnan, and J. R. Haritsa, On addressing efficiency concerns in privacy-preserving mining, in Proceedings of the 9th International Conference on Database Systems for Advanced Applications. Springer Verlag, 2004, pp [16] G. H. Golub and C. F. V. Loan, Matrix Computations. Johns Hopkins University Press, [17] J. R. Bunch and C. P. Nielsen, Updating the singular value decomposition, Numerische Mathematik, vol. 31, pp , [18] M. Gu and S. C. Eisenstat, A stable and fast algorithm for updating the singular value decomposition, Yale University, Tech. Rep. YALEU/DCS/RR-966, [19] I. T. Jolliffe, Principle Component Analysis. Springer Verlag, [20] Z. Zheng, R. Kohavi, and L. Mason, Real world performance of association rule algorithms, in Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2001, pp [21] N. Zhang, S. Wang, and W. Zhao, On a new scheme on privacy preserving data classification, Texas A&M University, Tech. Rep. TAMU/CS , [22] B. Nobel and J. W. Daniel, Applied linear algebra. Prentice-Hall, 1988.

26 26 APPENDIX I APPENDIX: JUSTIFICATION OF V k The main part of a privacy preserving association rule mining system is the data perturbation mechanism employed by the data providers. In current techniques, randomization operator is used to perturb original transactions. As we described in Section II, an item-invariant randomization operator leaves the correlation between different items unconsidered. We investigate the item correlation to improve accuracy of mining results. Specifically, our data perturbation algorithm preserves the support of 2-itemsets. The intuition behind our approach can be stated as follows. The PG part of the data miner maintains an estimation of the support of all 1-itemsets and 2-itemsets. Based on the estimation, PG tells the data providers which itemsets are more likely to be frequent itemsets. We know that no superset of an infrequent itemset is frequent (anti-monotone). A data provider may safely remove all infrequent 1-itemsets and 2-itemsets because they cannot appear in any frequent itemset 3 or more items either. Readers may raise a question that why we choose to maintain the support of 1-itemsets and 2-itemsets instead of 1, 2, and 3-itemsets or 1-itemsets only? For the first question, the main reason is that the large number of 3-itemsets (n 3 ) could impose serious overhead on the system. Besides, as we can see in Appendix III, our algorithm also preserves the support of itemsets with 3 or more items. On the other hand, if we choose to maintain the support of 1-itemsets only, we cannot remove many items from original transactions to protect privacy. Roughly speaking, if {a i, a j }, {a j, a k }, and {a i, a k } are all frequent, we have a fairly high probability that {a i, a j, a k } is also frequent. However, if both a i and a j are frequent, the probability that {a i, a j } is frequent is much smaller. Thus, in order to guarantee that all frequent 2-itemsets can be preserved, we cannot remove many items from the original transactions. Now we will show that our data perturbation algorithm successfully preserves the support of 2-itemsets. Here we consider the case when V k is the k-truncation of the accurate eigenvectors of T T. In reality, the value of V k has to be estimated from the current copy of T. Fortunately, the value of V k converges well

27 27 to its accurate value. Besides, the convergence is fairly fast in most cases. The proof of convergence is presented in Appendix V. Consider A = T T a 1 a 1 a 1 a n A = T T =. a i a j. a n a 1 a n a n (I.12) where a i a j is the dot product of a i and a j. Note that a i a j = m a i h a j h (I.13) h=1 a i a j m = support of {a i, a j }. (I.14) Thus, we may preserve the support of 2-itemsets by a precise approximation of A. In particular, we use truncated eigenvectors of A to perturb the original transactions T into T. Given transaction matrix T, define the perturbed transaction T = T V k V k. T = [t 1, t 2,... t m ] V k = [v 1, v 2,..., v k ] T = T V k V k = [t 1V k V k, t 2V k V k,..., t mv k V k ]. (I.15) (I.16) (I.17) With some algebriac manupulation we have Ã = T T = Vk V k T T V k V k = V k V k V (Σ Σ)V V k V k = V k Σ ΣV k (I.18) (I.19) (I.20) = σ 2 1 v 1v σ2 k v kv k (I.21) Thus, Ã is the optimal rank-k approximation of A [16]. In other words, we preserve the support of 2-itemsets while cutting off its eigenvectors.

28 28 Our data perturbation algorithm also preserves the privacy of data providers. The data miner cannot deduce the original t from t because V k V k is a singular matrix with det(v kv k ) = 0, which means that V k V k does not have an inverse matrix. APPENDIX II APPENDIX: EXAMPLE OF TRANSACTION PERTUBATION As we have shown in Appendix I, we can reach a precise approximation of A = T T by perturbing t to t = tv k V k. Besides this truncated eigenvectors perturbation, we need a function to map the values of t [0, 1] to integer values R(t) {0, 1}. Although the truncation of eigenvectors has already inserted some noise items into the transactions, we still need to add more noise items. These tasks are done by the Alg. 5 and Alg. 6. Here we use an example to illustrate the algorithms involved in the perturbation process. Example 2.1: A data provider C i holds a transaction t = [1, 1, 0, 1, 1, 1, 0, 0]. After negotiating with the data miner, C i receives V k (k = 2) from the data miner as follows. V k = (II.22) Using truncated eigenvectors V k as perturbation guidance, C i first calculates t by t = tv k V k. Now the perturbed transaction t is t = [0.54, 0.75, 0.67, 0.48, 0.56, 0.76, 0.63, 0.46]. (II.23) The step of mapping t to integer values can be stated as: 1) As stated in Alg. 5, C i rounds off t to integer. Let ρ t be 0.3. C i transforms t to R(t) = [1, 0, 0, 1, 0, 0, 0, 0] = {a 1, a 4 }; (II.24) 2) After that, as stated in Alg. 6, C i inserts some items into R(t) as random noise. ρ m is the probability of a false item to be placed into R(t). Now R(t) may become R(t) = [1, 0, 0, 1, 0, 0, 1, 0] = {a 1, a 4, a 7 }. (II.25)

29 29 TABLE III MAPPING FUNCTION t 1 i t 2 j t 1 i t 2 j R(t 1) i R(t 2) j 0 0 (1 ρ t) (1 ρ t) (1 ρ t) (1 ρ t) As shown in the above example, there are three parameters, k, ρ t, and ρ m, in the perturbation process. They all supply data providers controls to preserve privacy. The first parameter, k, is determined by the negotiation between the data provider and the data miner. The other parameters are determined by the data provider due to its sensitivity of privacy. APPENDIX III APPENDIX: PROOF OF THEOREM 4.3 We first consider the error on support of 2-itemsets. Then we will generalize the result to itemsets with other sizes. First, we consider the error introduced by the truncated eigenvectors perturbation (i.e., t = tv k V k ). As shown in (I.21) in Appendix I, Ã = T T is a rank-k approximation of A = T T. Since no entry of a matrix can exceeds the 2-norm of the matrix [22], we have max Ã A ij Ã A 2 = σk+1 2 i,j (III.26) For 1-itemsets, assume that the error on support of itemset {a i } introduced by the perturbation is ɛ T i. We have ɛ T i = Ã A ii σ 2 k+1 /m. For 2-itemsets, assume that the error on support of itemset {a i, a j } introduced by the perturbation is ɛ T ij. We have ɛt ij = Ã A ij σ 2 k+1 /m. Second, we consider the error introduced by the rounding off operation in Alg. 5. Let ɛ M ij be the error on support of {a i, a j } introduced by the rounding off operation. As we may can from Tab. III, for every