A Survey on Privacy Preserving Decision Tree Classifier

A Survey on Classifier Tejaswini Pawar*, Prof. Snehal Kamalapur ** * Department of Computer Enineerin, K.K.W.I.E.E&R, Nashik, ** Department of Computer Enineerin, K.K.W.I.E.E&R, Nashik, ABSTRACT In recent year s privacy preservation in minin has become an important issue. A new class of minin method called privacy preservin minin alorithm has been developed. The aim of these alorithms is protectin the sensitive information in while extractin knowlede from lare amount of. The extracted knowlede is enerally expressed in the form of cluster, decision tree or association rule allow one to mine the information. Several modification technique like randomization method, anonymization method, distributed privacy technique have been developed to incorporatin privacy mechanism and allow to hide sensitive pattern or itemset before minin process is executed. This paper mainly focuses on eneral classification technique decision tree classifier for preservin privacy. It presents a survey on decision tree learnin on various privacy techniques. Keywords - Data Minin, Classifiers, Perturbation, Data Minin, Splittin Criteria I. INTRODUCTION Data minin is a recently emerin field, connectin the three worlds of bases, statistics and artificial intellience. Data minin is the process of extractin knowlede or pattern from lare amount of. It is widely used by researchers for science and business process. Data collected from information providers are important for pattern reoranization and decision makin. The collection process takes time and efforts hence sample sets are sometime stored for reuse. However attacks are attempted to steal these sample sets and private information may be leaked from these stolen sets. Therefore privacy preservin minin are developed to convert sensitive sets into sanitized version in which private or sensitive information is hidden from unauthorized retrievers. preservin minin refers to the area of minin that seeks to safeuard sensitive information from unsanctioned or unsolicited disclosure. Preservation Data Minin [1] [2] was introducin to preserve the privacy durin minin process to enable conventional minin technique. Many privacy preservation approaches were developed to protect private information of sample set. The rest of this paper is structured as follows: the next section describes tree classifier and ives brief introduction about decision tree alorithm. Section 3 introduces different privacy preservation technique throuh decision tree approach. Section 4 provides an overall summary of this paper, and suests directions for further research on this topic. II. DECISION TREE CLASSIFIER A decision tree[3][4][5] is defined as a predictive modelin technique from the field of machine learnin and statistics that builds a simple tree-like structure to model the underlyin pattern of. tree is one of the popular methods is able to handle both cateorical and numerical and perform classification with less computation. trees are often easier to interpret. tree is a classifier which is a directed tree with a node havin no incomin edes called root. All the nodes except root have exactly one incomin ede. Each non-leaf node called internal node or splittin node contains a decision and most appropriate taret value assined to one class is represented by leaf node. tree classifier is able to break down a complex decision makin process into collection of simpler decision. The complex decision is subdivided into simpler decision on the basis of splittin criteria. It divides whole trainin set into smaller subsets. Information ain, ain ratio, ini index are three basic splittin criteria to select attribute as a splittin point. trees can be built from historical they are often used for explanatory analysis as well as a form of supervision learnin. The alorithm is desin in such a way that it works on all the that is available and as perfect as possible. Accordin to Breiman et al. [6] the tree complexity has a crucial effect on its accuracy performance. The tree complexity is explicitly controlled by the prunin method employed and the stoppin criteria used. Usually, the tree complexity is measured by one of the followin metrics: The total number of nodes; Total number of leaves; depth; 843 P a e

Number of attributes used. tree induction is closely related to rule induction. Each path from the root of a decision tree to one of its leaves can be transformed into a rule simply by conjoinin the tests alon the path to form the antecedent part, and takin the leaf s class prediction as the class value. The resultin rule set can then be simplified to improve its accuracy and comprehensibility to a human user [7]. Flowchart for tree based classification is shown in Fi.1 Start Sample set If all instance into same class Return the class Fi. 2 A framework for privacy preservin decision tree minin The splittin ceases when the number of instances to be split is below a certain threshold. C4.5 can handle numeric attributes. It perform error based prunin after rowin phase. It can use corrected ain ratio induce from a trainin set that incorporates missin values. III. PRIVACY PRESERVING DECISION TREE LEARNING This section explores the privacy preservation decision tree learnin techniques in which firstly is modified by usin different modification and perturbation-based approaches and then decision tree minin is applied to modified or sanitized set.fi.2 shows a framework for privacy preservin decision tree minin Select hihest information attribute Divide the set usin that attribute Fi.1. Flowchart for tree based classification End Hyafil and Rivest proved that ettin the optimal tree is NP-complete [8]. Most alorithms employ the reedy search and the divide-and-conquer approach to row a tree. In particular, the trainin set continues to be split in small. ID3 [4] and C4.5 [4] [9] adopt a reedy approach in which decision trees are constructed in top down recursive divide and conquer manner. ID3 was one of the first tree alorithms. It works on wide variety of problems in both academia and industry and has been modified improved and borrowed from many times over. ID3 picks splittin value and predicators on the basis of ain in information that the split or splits provide. Gain represents difference between the amount of information that is needed to correctly make a prediction both before and after the split has been made. Information ain is defined as the difference between the entropy of oriinal sement and accumulated entropies of the resultin split sement. C4.5 [9] is an extension of ID3, presented by the same author (Quinlan, 1993). It uses ain ratio as splittin criteria. A. - Minin Based on Random Substitutions: Inspired by the fact that the pioneerin privacypreservin decision tree minin method of [1] was flawed [11], Jim Dowd, Shouhuai Xu, and Weinin Zhan [10] explored a random substitution perturbation technique for privacy-preservin decision tree minin methods. This perturbation technique is based on a different privacy measure called ρ1-to-ρ2 privacy breachin [12] and a special type of perturbation matrix called the γ-diaonal matrix [13].They makes several contributions towards privacy-preservin decision tree minin. A novel error reduction technique is introduced for reconstruction, so that it not only prevents a critical problem caused by a lare perturbation matrix, but also uarantees a strictly better accuracy. In addition, the resultin privacy-preservin decision tree minin method has immune to the relevant recovery and repeated-perturbation attacks which were not accommodated in the model of [12]. It ensures that the decision trees learned from the perturbed are quite accurate compared with those learned from the oriinal. The parameters include: (1) assurance metric γ, (2) Dimension size N of perturbation matrix, which affects the accuracy of reconstructed as well as performance, and (3) Entropy of a perturbation matrix, which, to some extent, can encompass the effect of both γ and N. It is showed that one can achieve the desired trade-off between accuracy, privacy, and performance by selectin an appropriate N. Oriinal Dataset Perturbation Perturbed Dataset Reconstruction Reconstruct ed Dataset Minin Perturbation Matrix 844 P a e

B. A Noise Addition Scheme in for Data Minin [14]: Mohammad Ali Kadampur, Somayajulu D.V.L.N. propose a stratey that protects the privacy durin decision tree analysis of minin process. It is basically a noise addition framework specifically tailored toward classification task in minin. They propose to add specific noise to the numeric attributes after explorin the decision tree of the oriinal. The obfuscated then is presented to the second party for decision tree analysis. The decision tree obtained on the oriinal and the obfuscated are similar but by usin this method the proper is not revealed to the second party durin the minin process and hence the privacy will be preserved. The method also preserves averaes and few other statistical parameters thus makin the modified set useful for both minin and statistical purposes. This paper uses Quinlan's [15] C5.0 decision tree builder on the selected set [16] and obtain the decision tree of the oriinal set and a unique method of listin the nodes (attributes) that we touch in the path from the root of the tree to the leaf, then use a noise addition stratey for each of the attributes. The approach taken in this paper interates both cateorical and numeric types and focuses on privacy preservin durin decision tree analysis. The noise addition methods used are effective in preservin the privacy of the proper and producin prediction accuracies on par with the oriinal set. Noise addition technique is the ability to maintain ood quality and ensure individual privacy. This approach deals only with the classification task so this approach is addressin the issue of privacy preservin minin partially. More experiments are to be conducted on security level measurement and quality. C. - s over Vertically Partitioned Data: Jaideep Vaidya, Chris Clifton, Murat Kantarciolu and A. Scott Patterson [17] introduce a - s over Vertically Partitioned Data, eneralized privacy-preservin variant of the ID3 alorithm for vertically partitioned distributed over two or more parties. The alorithm as presented in this paper is workin proram is a sinificant step forward in creatin usable, distributed, privacy preservin -minin alorithms. This paper presents a new protocol to construct a decision tree on vertically partitioned with an arbitrary number of parties where only one party has the class attribute. It presents a eneral framework for constructin a system in which distributed classification would work. It serves to show that the methods can actually be built and are feasible. This work provides an upper bound on the complexity of buildin privacy preservin decision trees. Sinificant work is required to find a tiht upper bound on the complexity. D. Minin from Perturbed Data: In this paper, Li Liu, Murat Kantarciolu and Bhavani Thuraisinham [18] propose a new perturbation based technique to modify the minin alorithms. They build a classifier for the oriinal set from the perturbed trainin set. This paper proposed a modified C4.5 [9] decision tree classifier which is suitable for privacy preservin minin can be used to classify both the oriinal and the perturbed. This technique considers the splittin point of the attribute as well as the bias of the noise set as well. It calculates the bias whenever try to find the best attribute, the best split point and partition the trainin. This alorithm is based on the perturbation scheme, but skips the steps of reconstructin the oriinal distribution. The proposed technique has increased the privacy protection with less computation time. as a security issue in minin area is still a challene. E. Learnin Usin Unrealized Data Sets: Pui K. Fon and Jens H. Weber-Jahnke [19] introduce a new perturbation and randomization based approach that protects centralized sample sets utilized for decision tree minin. They introduced a new privacy preservin approach via set complementation. Dataset complementation confirms the utility of trainin sets for decision tree learnin. This approach converts the oriinal sample sets into a roup of unreal sets. The oriinal samples cannot be reconstructed without the entire roup of unreal sets. An accurate decision tree can be built directly from those unreal sets. This novel approach can be applied directly to the storae as soon as the first sample is collected and applied at any time durin the collection process. In order to mitiate the threat of their inadvertent disclosure or theft, privacy preservation is applied to sanitize the samples prior to their release to third parties. In contrast to other sanitization methods, this technique does not affect the accuracy of minin results. The decision tree can be built directly from the sanitized sets, no need to be reconstructin the oriinal set. The set reconstruction alorithm is eneric hence privacy preservation via set complementation fails if all trainin sets were leaked. This limitation need to be overcome by applyin technique like encryption. This technique requires extra storae for storin perturbed and complement of sample set. So optimizin the storae size of the unrealized samples needs to be explored. 845 P a e

IV. CONCLUSION In this paper, we present a decision tree classification technique. We have surveyed different approaches used in evaluatin the effectiveness of privacy preservin minin alorithms usin decision tree classifier. The work presents in this paper indicates the ever increasin interest of researchers in the area of securin sensitive and knowlede from malicious users. Many privacy preservin alorithms of decision tree minin are proposed by researchers however, privacy preservin technoloy needs to be further researched because of the complexity of the privacy problem. We conclude privacy preservin decision tree minin alorithms by analyzin the existin work in the table.1and make some remark. Future research need to be developed to work on these remarks. Sr. No 1 2 3 Technique Purpose Remark - Minin Based on Random Substituti ons A Noise Addition Scheme in for Data Minin - s over Vertically Partitione d Data Author presented a perturbation technique based on random substitutions & showed that the resultin privacypreservin decision tree minin method is immune to attacks that are seeminly relevant. Propose to add specific noise to the numeric attributes after explorin the decision tree of the oriinal. Proper is not revealed to the second party durin the minin process Tackled the problem of classification & introduced a eneralized privacy preservin variant of the ID3 alorithm for vertically partitioned distributed Extra storae required for storin perturbati on matrix, do not make strict tradeoffs between preservati on of sample utility and privacy It works on numerical only, need to work on quality and security level measurem ent. Required to find a tiht upper bound on the complexit y 4 5 Minin from Perturbed Data Learnin Usin Unrealize d Data Sets over two or more parties A new perturbation based technique proposed a modified C4.5 decision tree classifier New perturbation and randomization based approach via set complementation Need various ways to build classifiers which can be used to classify the perturbed set. Requires extra storae for storin perturbed and compleme nt of sample set and fails if all trainin were leaked sets REFERENCES [1] R. Arawal and R. Srikant. preservin minin. In ACM SIGMOD International Conference on Manaement of Data, paes 439 450. ACM, 2000. [2] Y. Lindell and B. Pinkas. preservin minin. In Advances in Cryptoloy, volume 1880 of Lecture Notes in Computer Science, paes 36 53. Spriner-Verla, 2000. [3] Lior Rokach and Oded Maimon Top- Down Induction of s Classifiers A Survey, IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS: PART C, VOL. 1, NO. 11, NOVEMBER 2002 [4] Jiawei Han, Micheline Kamber, Data Minin: Concepts and Techniques, Moran Kaufmann, 2001. [5] Lior Rokach, Oded Maimon Data Minin and Knowlede Discovery Handbook Second edition paes 167-192, Spriner Science + Business Media,2010 [6] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Reression s. Wadsworth Int. Group, 1984. [7] J.R. Quinlan, Simplifyin decision trees, International Journal of Man- Machine Studies, 27, 221-234, 1987. 846 P a e

[8] L. Hyafil and R. L. Rivest, "Constructin Optimal Binary s is {NP}- Complete," Inf. Process. Lett, vol. 5, pp. 15-17, 1976. [9] J. R. Quinlan. C4.5: Prorams for Machine Learnin. Moran Kaufmann, 1993. [10] J. Dowd, S. Xu, and W. Zhan, - Minin Based on Random Substitions, Proc. Int l Conf. Emerin Trends in Information and Comm. Security (ETRICS 06), pp. 145-159, 2006. [11] Hillol Karupta, Souptik Datta, Qi Wan, and Krishnamoorthy Sivakumar. On the privacy preservin properties of random perturbation techniques. In IEEE International Conference on Data Minin, 2003. [12] A. Evfimievski, J. Gehrke, and R. Srikant. Limitin privacy breachin in privacy preservin minin. In ACM Symposium on Principles of Database Systems, paes 211 222. ACM, 2003. [13] Shipra Arawal and Jayant R. Haritsa. A framework for hih-accuracy privacypreservin minin. In IEEE International Conference on Data Enineerin, 2005. [14] Mohammad Ali Kadampur, Somayajulu D.V.L.N., A Noise Addition Scheme in for Data Minin, Journal of Computin, Vol 2, Issue 1, January 2010, ISSN 2151-9617 [15] Lian Liu.,Jie Wan.,Jun Zhan., ``Wavelet based perturbation for simultaneous privacy preservin and statistics preservin.,'' In Proceedins of IEEE International Conference on Data Minin workshop., 2008. [16] Charu C Arwal.,Philip S Yu., preservin minin models and Alorithms., Spriner Science+Business media.,llc..2008 [17] J. Vaidya and C. Clifton. preservin decision trees over vertically partitioned. In Proceedins of the 19th Annual IFIP WG 11.3 Workin Conference on Data and Applications Security, Storrs,Connecticut, 2005. Spriner. L. Liu, M. Kantarciolu, and B. Thuraisinham, Minin from Perturbed Data, Proc. 42 nd Hawaii Int l Conf. System Sciences (HICSS 09), 2009. [18] Pui K. Fon and Jens H. Weber-Jahnke, Learnin Usin Unrealized Data Sets. IEEE Transl. on knowlede and enineerin, vol. 24, no. 2, February2012. 847 P a e