AN IMPROVED PRIVACY PRESERVING ALGORITHM USING ASSOCIATION RULE MINING(27-32) AN IMPROVED PRIVACY PRESERVING ALGORITHM USING ASSOCIATION RULE MINING

Transcription

1 AN IMPROVED PRIVACY PRESERVING ALGORITHM USING ASSOCIATION RULE MINING Ravindra Kumar Tiwari Ph.D Scholar, Computer Sc. AISECT University, Bhopal Abstract-The recent advancement in data mining technology to analyze vast amount of data has played an important role in several areas of Business processing. Data mining also opens new threats to privacy and information security if not done or used properly. The main problem is that from non-sensitive data, one is able to infer sensitive information, including personal information, fact or even patterns which are generated by any algorithm of data mining. In order to focusing on privacy preserving association rule mining, the simplistic solution to address the problem of privacy is presented. The solution is to survey different aspects which are discussed in the several research papers and after analyzing those research papers conclude a new solution which is best in efficiency and performance. Before analyzing the algorithms, the data structure of database and sensitive association rule mining set have been analyzed to build the more effective model. Keywords -Data Mining, Association Rule Mining, Privacy Preserving 1. INTRODUCTION Data mining services is not alone sufficient. Data mining services play an important role in the field of Communication industry. The recent advancement in data mining technology to analyze vast amount of data has played an important role in several areas of Business processing. Data mining also opens new threats to privacy and information security if not done or used properly. The main problem is that to hide sensitive information, including personal information, even patterns which are generated by any algorithm of data mining. In order to focusing on privacy preserving association rule mining. The statistical significance of a pattern (called support) was measured as a percentage of data sequences containing the pattern. In the problem was generalized by adding taxonomy (is-a hierarchy) on items and time constraints such as minimum and maximum gap between adjacent elements of a pattern, where discovered patterns (called episodes) could have different type of ordering: full (serial episodes), none (parallel episodes) or partial and had to appear within a user-defined time window. The episodes were mined over a single event sequence and their statistical significance was measured as a percentage of windows containing the episode (frequency) or as a number of occurrences. Efficient algorithms were presented for serial and parallel episodes. In the model was extended to handle events described by a set of attributes. Episodes mined in sequences of such events were build of a set of unary and binary predicates on event attributes. To make discovery of such complex episodes feasible, it was assumed that a user has to specify a class of interesting patterns by providing a template. In a language capable of specifying episodes of interest based on logical predicates was presented and a few further extensions to the model were added. 1.1 Hiding Purposes The PPDM algorithms [4] is classified into two types :Data hiding and Rule hiding, According to the purposes of hiding, Data hiding refers to the cases where the sensitive data from original database like identity, name, and address that can be linked, directly or indirectly, to an individual person are hided. In contrast, the Rule hiding, the sensitive knowledge (rule) derived from original database after applying data mining are hided. Majority of the PPDM algorithms used data hiding techniques. Most PPDM algorithms hide sensitive patterns by modifying data. Currently, the PPDM algorithms are mainly used on the tasks of classification, association rule and clustering. Association analysis involves the discovery of associated rules, showing attribute value and conditions that occur frequently in a given set of data. Classification is the process of finding a set of models that describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. Clustering Analysis concerns the problem of decomposing or partitioning a data set (usually multivariate) into groups so that the points in one group are similar to each other and are as different as possible from the points in other groups. 1.2 Goal of Privacy Preservation The privacy preserving goal [5] is to mine the raw data while privacy is not being leaked. Current technology is mainly realized from these two aspects: 1) The sensitive raw data in database such as names, certificate numbers, addresses and hobbies can be modified or cut to avoid the leak of personal private information. That is to say, without visiting privacy data, correct results can be gained by using data mining algorithms. 2) Sensitive rules included in data mining results can be eliminated through rule algorithms. That is, try to protect potential sensitive rules in mining process not to be Vol. 1(1), January 2014 (ISSN: ) Page

2 obtained by the party with ill intention who will maliciously reason. 1.3 Privacy Preservation Techniques Several privacy-preserving techniques [13] for association rule mining have also been proposed in the past few years. Various proposals and algorithms have been developed for centralized data, while others refer to a distributed data scenario. Distributed data scenarios can also be classified as horizontal data distribution and vertical data distribution. The purpose of privacy preserving [13] is to discover accurate patterns without precise access to the original data. The algorithm of association rule mining is to mine the association rule based on the given minimal support and minimal confidence. Therefore, the most direct method to hide association rule is to reduce the support or confidence of the association rule below the minimal support of minimal confidence. A lot of implementations [2] of the confidentiality of data and knowledge are applied in association rule mining process. According to privacy protection technologies, at present, privacy preserving association rule mining algorithms commonly can be divided into three categories: i) Heuristic-based techniques ii) Reconstruction-based techniques iii)cryptography-based techniques Heuristic based techniques is used for centralized data set and cryptography-based techniques are designed for protecting privacy in a distributed dataset by using encryption technique. Heuristic-based techniques [2] are to resolve how to select the appropriate data sets for data modification. Since the optimal selective data modification or sanitization is an NP-Hard problem, heuristics can be used to address the complexity issues. The methods of Heuristic-based modification include perturbation, which is accomplished by the alteration of an attribute value by a new value (i.e., changing a 1-value to a 0-value, or adding noise), and blocking, which is the replacement of an existing attribute value with a?. There is a basic principle of choosing the transaction or the item of item set to be modified that we should reduce the influence of the original database as far as possible. 2. MOTIVATION Successful applications of data mining techniques have been demonstrated in many areas that benefit commercial, social and human activities. Along with the success of these techniques, they pose a threat to privacy. One can easily disclose other s sensitive information or knowledge by using these techniques. So, before releasing database, sensitive information or knowledge must be hidden from unauthorized access. To solve privacy problem, PPDM has become a hotspot in data mining and database security field. In order to focusing on privacy preserving association rule mining, the simplistic solution to address the problem of privacy is presented. To overcome these problems, Improved Privacy Preserving Algorithm Using Association Rule Mining is proposed which is based on the random Perturbation technique and gives best result in terms of efficiency and performance. Proposed algorithm is a good way to apply data mining techniques with security that hides logical instances from others. Data mining is an interactive and iterative process. A user formulate a data mining task as a KDD query in a high level language. The query is sent to the knowledge Discovery Management System which retrieve the data from the database, chooses the right data mining algorithm and return result in a form of frequent pattern, association rule and pruning result to the user. The system should provide mechanism for storing discovered knowledge in a database for further selective analyses. So far proposed an SQL like language for specifying all tasks concerning discovery of frequent pattern, association rule and pruning resulting databases. The language is MineSQL, which is an extension of SQL proposed to handle association rules queries. This approach seems to be reasonable because association rules and sequential patterns are very often mined in the same datasets. MineSQL is designed as a query language for advanced users but it can also serve as an Application Programming Interface (API) for building business application dealing with knowledge discovery. MineSQL provides mechanisms for storing patterns in relational tables by offering new complex data types. MineSQL allows a user to specify various constraints defining the requested class of patterns. Current algorithm does not handle item constraints at all or require too detailed information on the structure of patterns. In this Dissertation an algorithm using item constraints in the mining process will be presented. A special emphasis will be laid on the fact that the source data is likely to be stored in relational tables. 3. PRIVACY PROTECTION TECHNIQUE There are various of privacy protection [7] Technique what apply to centralized distribution like Reconstruction Technique, Random response technique, Random perturbation technique, Heuristic Technology, Isometric transformation technology. There are various of privacy protection technique what apply to distributed distribution Vol. 1(1), January 2014 (ISSN: ) Page

3 Like Switching encryption technique, Secure multiparty computation. Among them Random perturbation technique is to convert the raw data randomly according to the set of probability which has a great advantage in the privacy data mining. 3.1 Data Distribution The PPDM algorithms [13] can be first divided into two major categories, centralized and distributed data, based on the distribution of data. In a centralized database environment, data are all stored in a single database; while, in a distributed database environment, data are stored in different databases. Distributed data scenarios can be further classified into horizontal and vertical data distributions. Horizontal distributions refer to the cases where different records of the same data attributes are resided in different places. While in a vertical data distribution, different attributes of the same record of data are resided in different places. Earlier research has been predominately focused on dealing with privacy preservation in a centralized database. The difficulties of applying PPDM algorithms to a distributed database can be attributed to: first, the data owners have privacy concerns so they may not willing to release their own data for others; second, even if they are willing to share data, the communication cost between the sites is too expensive. 3.2 Randomization method The randomization method [6] provides an effective yet simple way of preventing the user from learning sensitive data, which can be easily implemented at data collection phase for privacy preserving data mining, because the noise added to a given record is independent of the behaviour of other data records. When the randomization Age Sex Blood pressure EC G Maximum heart rate Resul t Male Hyp Healt hy Male Hyp Sick Fema Hyp Healt le hy Fema Nor Sick le mal Male Nor Sick mal Male Nor Healt mal hy method is carried out, the data collection process consists of two steps.the first step is for the data providers to randomize their data and transmit the randomized data to the data receiver. In the second step, the data receiver estimates the original distribution of the data by employing a distribution reconstruction algorithm. The model of randomization is shown in Figure 3.2 Figure 3.2 : The Model Of Randomization 3.3 Random Perturbation Technique Age Sex Blood Pressure ECG This method [7] can deal with character type,boolean type, number types of discrete data and to facilitate conversion of data sets, it is necessary to preprocess the original data set. The data preprocessing is divided into discrete data, attribute coding, data sets coded data set,three parts. A (max) - A (min)/n = length A is continuous attributes, n is the number of discrete, length is the length of the discrete interval. When the interval length is a decimal, round to the nearest integer, the first interval of discrete begin from A(min), the last interval is A(max). In this paper, the attributes of number are seen as continuous attributes, taking Table I as an example, the continuous attributes have age, resting blood pressure and maximum heart rate. TABLE I CARDIOLOGY DATE SET When n is 5, the discrete data sets are shown in Table II. Attribute coding find out different values of each attribute domain by querying the discrete data sets, and then use natural numbers to encode these different attribute values to generate attribute coding sheet. (As shown in Table III, IV) Table II DISCRETE DATA SET Table III ATTRIBUTE DOMAIN CODE Maxi Mum heart rate Result 39 Male 128 Hyp 130 Healthy 60 Male 135 Hyp 170 Sick 58 Female 137 Hyp 147 Healthy 45 Female 142 Normal 163 Sick 62 Male 140 Normal 151 Sick 70 Male 146 Normal 148 Healthy Vol. 1(1), January 2014 (ISSN: ) Page

4 Age Cod Ing Sex Cod ing Blood pressure Cod ing Female Male E ECG Coding Maximum heart rate Coding Result Table V PERTURBATION DATA SET Table IV ATT RIBU TE DOM AIN COD Setting data set into a set of encoded data is to replace the attribute values of discrete data set with the corresponding code according to the attribute table, and then form data set encoding. (As shown in Table V) Apriori algorithms having a two-step process. Coding Hyp Healthy 1 Normal Sick Age Sex Blood ECG Maximum Result heart rate Step 1: To find L k, a set of candidate k item sets is generated by joining L k-1 with itself. This set of candidate is denoted C k. Step 2 (Prune Step ): C k is the superset of L k, that is, its members may or may not be frequent, but all of the frequent k-itemsets are included in C k. A scan of the databases to determine the count of each candidate in C k would result in the determination of L k. (i.e. all candidates having a count no less than the minimum support count are frequent by definition, and therefore belongs to L k ) 4. PROPOSED WORK In this Paper, proposed algorithm named Improved Privacy Preserving Mining (IPPM). The entire system architecture consists of five phases: Proposed algorithm is a good way to apply data mining techniques with security that hides our logical instances from others. 1) Check for Authentication. 2) Reading 3) Association Rule Mining 4) Encoded and decoded the data by using random perturbation technique 5) Perform Pruning. Data mining techniques [4] are used in the discovery of user behavior patterns using several algorithms. Data mining can find interesting valuable patterns or relationships describing the data and predictive or classify the behavior of the model based on available data. In other words. It uses automated tools that employ several methodologies and algorithms to discover mainly hidden patterns, associations, frequent structure from large amounts of data stored in data warehouses or other information repositories and filter necessary information from this big dataset. Telecommunications industry is a typical data intensive industry, competition is also becoming fierce increasingly. Compared with other industries, the telecommunications industry have more crucial personal user s data, which can help people analyze the data accurately and obtain useful knowledge, in order to maintain and win the competition, people should find more interactive business opportunities and provide users with better service with short time duration. As a result, data warehouse and data mining has important value in the telecommunications industry. In this paper, propose an efficient data mining algorithm named Improved Privacy Preserving Mining (IPPM). 4.1 Proposed Method: IPPM There is some terminology which is important for understanding the novel technique. 1) Frequent Pattern- Frequent pattern means the item set which are used by the customer frequently. For example if item I1 is purchased by 10 customers and item I2 is purchased by 5 customers then the item I1 is most frequently used. So the owner must concentrate on I1 Items because it is visited by more no of customers. 2) Minimum support-for Item to be a frequent member we decide a minimum support count by which we will determine that the item is in the list of Frequent Pattern or not. For Example if minimum support is 2 then the item which count or customer visiting no is = or > 2 is the most frequent one, which will be consider for pruning. 3) Data Pruning The act of removing those item set which is not necessary is called data Pruning. Vol. 1(1), January 2014 (ISSN: ) Page

5 Memory (MB) AN IMPROVED PRIVACY PRESERVING ALGORITHM USING ASSOCIATION RULE 4) Encryption/Decryption :-We will provide encryption/ decryption at four level such as transaction,frequent item,association rule,pruning result Working Procedure Our module is divided into two parts. We can login as the normal user or by the Admin. If we enter as the normal user we can sub categorize our model of Improved Privacy Preserving Mining (IPPM) in five phases: 1) Check for Authentication. 2) Reading 3) Association Rule Mining 4) Encoded and decoded the data by using random perturbation technique 5) Perform Pruning. 5. RESULT ANALYSIS The result analysis is based on IPPM and SPADE algorithm. The new method shows in the graph that the time is less in comparison of old methods like spade. So it is more efficient. One taking Spade algorithm and IPPM techniques to analyze several aspects like Memory and computation time. It possibly takes a very long time on large inputs until the program has completed its work and gives a sign of life again. Sometimes it makes sense to be able to estimate the running time before starting a program. Obviously, the running time depends on the number n of the strings to be sorted. If we analyze SPADE (Sequential Pattern Discovery using Equivalence classes) algorithm for discovering the set of all frequent sequences the key features of SPADE algorithm is 1. They use a vertical id-list database format, where they associate with each sequence a list of objects in which it occurs, along with the time-stamps. They show that all frequent sequences can be enumerated via simple temporal joins (or intersections) on id-lists. 2. They use a lattice-theoretic approach to decompose the original search space (lattice) into smaller pieces (sublattices) which can be processed independently in mainmemory. 3. Their approach usually requires three database scans, or only a single scan with some pre-processed information, thus minimizing the I/O costs in comparison of Generalized Sequential Pattern. SPADE not only minimizes I/O costs by reducing database scans, but also minimizes computational costs by using efficient search schemes. The vertical id-list based approach is also insensitive to data-skew. An extensive set of experiments shows that SPADE outperforms previous approaches by a factor of two, and by an order of magnitude if we have some additional off-line information. Furthermore, SPADE scales linearly in the database size, and a number of other database parameters. In spade he main steps include for the computation of the frequent 1-sequences and 2-sequences, the decomposition into prefix-based parent equivalence classes, and the enumeration of all other frequent sequences via BFS or DFS search within each class. In proposed algorithm one only compute pre subset for the computation so one only include on side subset not the whole as well as we not consider the candidate generation. Time efficiency estimates depend on what we define to be a step. For the analysis to correspond usefully to the actual execution time, the time required to perform a step must be guaranteed to be bounded above by a constant. One must be careful here; for instance, some analyses count an addition of two numbers as one step. This assumption may not be warranted in certain contexts. The Graphs show that proposed method is better in comparison to spade. 5.1 Memory Based graph Min Support Figure 5.1 Memory Based graph Above figure shows that proposed algorithm IPPM takes less memory as comparison of Spade algorithm. At the min support 1 Proposed algorithm IPPM requires <= 500 MB memory for storing frequent item set while Spade requires 1000 MB memory because proposed algorithm work on either pre or post basis while Spade work on pre and post both. 5.2 Time Based graph Time (ms) Vol. 1(1), January 2014 (ISSN: ) Page

6 Min Support Figure 5.2 Time Based graph Above figure shows that proposed algorithm IPPM takes less computation time as comparison of Spade algorithm. At the min support 1 Proposed algorithm IPPM requires <= 500 millisecond computation time while Spade requires 1000 millisecond computation time because proposed algorithm work on either pre or post basis while Spade work on pre and post both. 6. CONCLUSION The recent advancement in data mining technology to analyse vast amount of data has played an important role in several areas of Business processing. Data mining also opens new threats to privacy and information security if not done or used properly. The main problem is that from non-sensitive data, one is able to infer sensitive information, including personal information, fact or even patterns which are generated by any algorithm of data mining. In order to focusing on privacy preserving association rule mining, the simplistic solution is presented, which is best in terms of efficiency and performance.because proposed algorithm takes just half computation time and memory in comparison of Spade algorithm. [6] Pingshui WANG, Survey on Privacy Preserving Data Mining, International Journal of Digital Content Technology and its Applications, Vol. 4, No. 9, 2010 [7] Brian, C.S. Loh and Patrick, H.H. Then, Ontology- Enhanced Interactive Anonymization in Domain- Driven Data Mining Outsourcing, IEEE, Second International Symposium on Data, Privacy, and E- Commerce,,2010 [8] Chirag N. Modi, Udai Pratap Rao and Dhiren R. Patel, Maintaining privacy and data quality in privacy preserving association rule mining, IEEE, International Conference on Advances in Communication, Network, and Computing, [9] Wang Yan, Le Jiajin and Huang Dongmei, A Method for Privacy Preserving Mining of Association Rules Based on Web Usage Mining, IEEE,International Conference on Web Information Systems and Mining, Vol.1, pp , FUTURE WORK In future one also include the simulation result which shows proposed method is good than other traditional methods.and one can overcome this limitation by providing one more additional key as for security purpose at time of accessing high confidential data. REFERENCES [1] R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules, 20th International Conference on Very Large Data Bases, pp , [2] Vassilios S. Verykios, Elisa Bertino,et al., Stateof-the-art in Privacy Preserving Data Mining, SIGMOD Record, Vol. 33, pp.50-57, March [3] Alan F. Karr, Xiaodong Lin, Ashish P. Sanil and Jerome P. Reiter Privacy-Preserving Analysis of Vertically Partitioned Data Using Secure Matrix Products Journal of Official Statistics, Vol. 25, pp , [4] J. Han and M. Kamber, Data Mining: Concepts and Techniques. [5] Yanguang Shen, Junrui Han and HuiShao, Research on Privacy-Preserving Technology of Data Mining, IEEE, Second International Conference on Intelligent Computation Technology and Automation, Vol. 2, pp , Vol. 1(1), January 2014 (ISSN: ) Page