Optimization of C4.5 Decision Tree Algorithm for Data Mining Application

Transcription

1 Optimization of C4.5 Decision Tree Algorithm for Data Mining Application Gaurav L. Agrawal 1, Prof. Hitesh Gupta 2 1 PG Student, Department of CSE, PCST, Bhopal, India 2 Head of Department CSE, PCST, Bhopal, India Abstract-- Data mining is a new technology and has successfully applied on a lot of fields, the overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Data mining is mainly used for model classification and prediction..classification is a form of data analysis that extracts models describing important data classes. C4.5 is one of the most classic classification algorithms on data mining,but when it is used in mass calculations,the efficiency is very low.in this paper,we propose C4.5 algorithm which is improved by the use of L Hospital Rule,which simplifies the calculation process and improves the efficiency of decision making algorithm. We aim to implement the algorithms in a space effective manner and response time for the application will be promoted as the performance measures. Our system aims to implement these algorithms and graphically compare the complexities and efficiencies of these algorithms. Keywords- C4.5 algorithm,data mining, Decision tree, ID3 algorithum, L hospital rule. I. INTRODUCTION Decision trees are built of nodes, branches and leaves that indicate the variables, conditions, and outcomes, respectively. The most predictive variable is placed at the top node of the tree. The operation of decision trees is based on the C4.5 algorithms. The algorithms make the clusters at the node gradually purer by progressively reducing disorder (impurity) in original data set. Disorder and impurity can be measured by the well-established measures of entropy and information gain. One of the most significant advantages of decision trees is the fact that knowledge can be extracted and represented in the form of classification (if-then) rules. Each rule represents a unique path from the root to each leaf. In operations research, specifically in decision analysis, a decision tree (or tree diagram) is a decision support tool. A decision tree is used to identify the strategy most likely to reach a goal. Another use of trees is as a descriptive means for calculating conditional probabilities. A decision tree is a flow-chart-like tree structure, where each branches represents an out-come of the test, and each leaf node represent classes. The attribute with highest information gain is chosen as test attribute for current node. 341 This attribute minimizes the information needed to classify the sample. In this In the paper, we analyzed several decision tree classification algorithms currently in use, including the ID3 [4] and C4.5 [2] algorithm as well as some of the improved algorithms [3] [5] [6] thereafter them. When these classification algorithms are used in the data processing, we can find that its efficiency is very low and it can cause excessive consumption of memory. On this basis, combining with large quantity of data, we put forward the improvement of C4.5 algorithm efficiency, and uses L Hospital rule to simplify the calculation process by using approximate method. This improved algorithm not only has no essential impact on the outcome of decision-making, but can greatly improve the efficiency and reduce the use of memory. So it is more easily used to process large amount of data collection. The rest of the paper is organized as follows. A. Decision Tree induction: Decision tree induction is the learning of decision trees from class labeled training tuples. A decision tree is a flowchart-like tree structure, where each internal node (non leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label. The topmost node in a tree is the root node. Suppose we want to buy the computer It represents the concept buys computer, that is, it predicts whether a customer likely to purchase a computer. Internal nodes are denoted by rectangles, and leaf nodes are denoted by ovals. Some decision tree algorithms produce only binary trees (where each internal node branches to exactly two other nodes), whereas others can produce non binary trees. C4.5 adopt greedy (i.e., nonbacktracking) approach in which decision trees are constructed in a top- down recursive divide and conquer manner. Most algorithms for decision tree induction also follow such top-down approach, which starts with a training set of tuples and their associated class labels. The training set is recursively partitioned into smaller subsets as the tree is being built.

2 II. METHODOLOGY Steps of the System: 1. Selecting dataset as an input to the algorithm for processing. 2.Selecting the classifiers 3. Calculate entropy, information gain, gain ratio of attributes. 4. Processing the given input dataset according to the defined algorithm of C4.5 data mining. 5. According to the defined algorithm of improved C4.5 data mining processing the given input dataset. 6. The data which should be inputted to the tree generation mechanism is given by the C4.5 and improved C4.5 processors. Tree generator generates the tree for C4.5 and improved C4.5 decision tree algorithm. III. DATA MINING AND KNOWLEDGE DISCOVERY A. Attribute Selection Measure: The attribute selection measure provides a ranking for each attribute describing the given training tuples. The attribute having the best score for the measure is chosen as the splitting attribute for the given tuples. If the splitting attribute is continuous-valued or if we are restricted to binary trees then, respectively, either a split point or a splitting subset must also be determined as part of the splitting criterion. The tree node created for partition D is labeled with the splitting criterion, branches are grown for each outcome of the criterion, and the tuples are partitioned accordingly. There are two most popular attribute selection measures information gain, gain ratio [12]. Let S be set consisting of data sample. Suppose the class label attribute has m Distinct values defining m distinct class C i (for i =1... m). Let S i be the number of Sample of S in class C i. The expected information needed to classify a given sample is given by equation I (S1, S2,, Sm) = Where Pi is probability that an arbitrary sample belongs to classify Ci and estimated by S i /S. Note that a log function to base 2 is used since the information in encoded in bit. B. Classifiers: In [17] order to mine the data, a well-known data mining tool WEKA was used. Since the data has numeric data type with only the classification as nominal leading to the category of labeled data set. 342 Therefore it is needed to perform supervised data mining on the target data set. This narrowed down the choice of classifiers to only few, classifiers that can handle numeric data as well as give a classification (amongst a predefined set of classifications). Hence selecting C4.5 decision tree learning became obvious. The attribute evaluation was also performed in order to find out the gain ratio and ranking of each attribute in the decision tree learning. In case for some data set data mining could not produce any suitable result then finding the correlation coefficient was resorted to investigate if relation between attributes. C. Entropy: It is minimum number of bits of information needed to encode the classification of arbitrary members of S. Lets attribute A have v distinct value a 1,..., a v. Attribute A can be used to Partition S into v subsets, S 1, S 2,..., S v, where S j contains those samples in S that have value a j of A. If A were selected as the test attribute, then these subset would corresponds to the branches grown from the node contains the set S. Let S ij be the number of class C i, in a subset by S j, The entropy or expected information based on partitioning into subset by A, is given by equation E(A) = j v =1 (S 1j +S 2j + + S mj / S )*I(S ij + + S mj ) The first term acts as the weight of the jth subset and is the number of samples in the subset divided by the total number of sample in S. The smaller the entropy value, the greater purity of subset partitions as shown in I(S 1, S 2,, S m ) = Where P i is the probability that a sample in S j belongs to class C i. D. Information Gain: It is simply the expected reduction in entropy caused by partitioning the examples according to the attribute.more precisely the information gain, Gain(S, A) of an attribute A, relative collection of examples S, is given by equation. Gain (A) = I (S 1,S 2,, S m ) E (A) In other words gain (A) is the expected reduction in entropy caused by knowing the Value of attribute A. The algorithm computes the information gain of each attribute. With highest information gain is chosen as the test attribute for a given set. E. Gain ratio: It [12] differs from information gain, which measures the information with respect to classification that is acquired based on the same partitioning.

3 The gain ratio is defined as Gain Ratio(A) = Gain(A)/Split Info(A) The attribute with the maximum gain ratio is selected as the splitting attribute. Note, however, that as the split information approaches 0, the ratio becomes unstable. A constraint is added to avoid this, whereby the information gain of the test selected must be large-at least as great as the average gain over all tests examined. IV. C4.5 ALGORITHM C4.5 is an algorithm used to generate a decision tree developed by Ross qiunlan. Many scholars made kinds of improvements on the decision tree algorithm. But the problem is that these decision tree algorithms need multiple scanning and sorting of data collection several times in the construction process of the decision tree. The processing speed reduced greatly in the case that the data set is so large that can not fit in the memory.at present, the literature about the improvement on the efficiency of decision tree classification algorithm For example, Wei Zhao, Jamming Su in the literature [7] proposed improvements to the ID3 algorithm, which is simplify the information gain in the use of Taylors formula. But this improvement is more suitable for a small amount of data, so its not particularly effective in large data sets. Due to dealing with large amount of datasets, a variety of decision tree classification algorithm has been considered. The advantages of C4.5 algorithm is significantly, so it can be choose. But its efficiency must be improved to meet the dramatic increase in the demand for large amount of data. A. Pseudo Code [16]: 1. Check for base cases. 2. For each attribute a calculate: i. Normalized information gain from splitting on attribute a. 3. Select the best a, attribute that has highest information gain. 4. Create a decision node that splits on best of a, as rot node. 5. Recurs on the sub lists obtained by splitting on best of a and add those nodes as children node. B. Improvements from ID3 algorithm: C4.5 made a number of improvements to ID3. Some of these are follows: 1.Handling both continuous and discrete attributes - In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it. 2. Handling training data with missing attribute values - C4.5 allows attribute values to be marked as? for missing. Missing attribute values are simply not used in gain and entropy calculations. 3. Handling attributes with differing costs. 4. Pruning trees after creation - C4.5 goes back through the tree once its been created and attempts to remove branches that do not help by replacing them with leaf nodes V. THE IMPROVEMENT OF C4.5 ALGORITHM A. The improvement The C4.5 algorithm [8] [9] generates a decision tree through learning from a training set, in which each example is structured in terms of attribute-value pair. The current attribute node is one which has the maximum rate of information gain which has been calculated, and the root node of the decision tree is obtained in this way. Having studied carefully, we find that for each node in the selection of test attributes there are logarithmic calculations, and in each time these calculations have been performed previously too. The efficiency of decision tree generation can be impacted when the dataset is large. We find that the all antilogarithm in logarithmic calculation is usually small after studying the calculation process carefully, so the process can be simplified by using L Hospital Rule. As follows: If f(x) and g(x) satisfy: (1) ) And are both zero or are both (2) In the deleted neighbourhood of the point x0, both f(x) and g(x) exist and g(x)! = 0; f( x) lim exits or is xx0 gx ( ) Then f ( x) f ( x) lim lim x x 0 g( x) x x 0 g ( x) 343

4 ln( l x) [ln( l x)] lim lim x x 1 lim x lim 1 x (x approaches to zero) viz.ln (l-x)=-x(x approaches to zero) ln(1-x) =x(when x is quiet small)) Suppose c = 2, that is there are only two categories in the basic definition of C4.5 algorithm. Each candidate attribute s information gain is calculated and the one has the largest information gain is selected as the root. Suppose that in the sample set S the number of positive is p and the negative is n. E(S, A) = r j1 pj nj I ( s 1j s 2j) p n So we can get the equation in which p j and n j are respective the number of positive examples and the negative examples in the sample set.so gain Ratio(A) can be simplified as gain( A) E( s) E( S, A) Gain Ratio( A) I( A) I( A) S1 S2 I( p, n) { I( S11, S12 ) I( S21, S22)} = N N I( S, S ) 1 2 Where S1: the number of positive examples in A S2: the number of negative examples in A S11: the number of examples that A is positive and attributes value is positive, S12: the number of examples that A is positive and Attributes value is negative, S21: the number of examples that A is negative and attributes value is positive, S22: the number of examples that A is negative attributes value is negative and attributes value is negative. 344 Go on the simplification we can get: p p n n S S S log log { [ log N S S S S S S S S S log ] [ log log ]} S S N S S S S S1 S1 S2 S2 /{ log2 log 2 } In the equation above, we can easily learn that each item in both numerator and denominator has logarithmic calculation and N, Divide the numerator and denominator by log2e simultaneously, and multiplied by N simultaneously. We can get equation: Gain-Ratio(S, A) = p n S S p ln nln {[ S ln S ln ] S S S S [ S ln S ln ]}/ S2 S2 S1 S2 { S1ln S2ln } p n Because N p n 1Then replaces p N and n N with1 n p and1 respectively then we can get equation: GainRatio(S,A)= n p S12 { p ln(1 ) nln(1 ) {[ S11 ln(1 ) S S S S S ln(1 )] [ S ln(1 ) S ln(1 )]} S1 S2 S2 S2 S1 / S1ln(1 ) S2ln(1 ) Because we know that ln(1-x) =x, so we get: Gain-Ratio(S, A)= pn S * S S * S N S2 S1* S2 N {[ ] [ ]

5 In the expression above, Gain-Ratio (A) only has addition, subtraction, multiplication and division but no logarithmic calculation, so computing time is much shorter than the original expression. What s more, the simplification can be extended for multi-class. B. Reasonable arguments for the improvement: In the improvement of C4.5 above, there is no item increased or decreased only approximate calculation is used when we calculate the information gain rate. And the antilogarithm in logarithmic calculation is a probability which is less than 1. In order to facilitate the improvement of the calculation, there are only two categories in this article and the probability is a little bigger than in multiclass. And the probability will become smaller when the number of categories becomes larger; it is more helpful to justify the rationality. Furthermore, there is also the guarantee of L Hospital Rule in the approximate calculation, so the improvement is reasonable. C. Comparison of the complexity: To calculate Gain Ratio(S, A), the C4.5 algorithm s complexity is mainly concentrated in E(S) and E(S, A).When we compute E(s), each probability value is needed to calculated first and this need o (n) time. Then each one is multiplied and accumulated which need O(log2n) time. So the complexity is O(log2n).Again, in the calculation of E(S,A),the complexity is O(n(log2n)2),so the total complexity of Gain-Ration(S,A) is O(n(log2n)2). And the improved C4.5 algorithm only involves. Original data and only addition, subtract, multiply and divide operation. So it only needs one scan to obtain the total value and then do some simple calculations, the total complexity are O (n). VI. CONCLUSION AND FUTURE WORK In this Paper we study that C4.5 and improved c4.5 algorithm to improve the performance of existing algorithm in terms of time saving by the used of L hospital rule and increased the efficiency a lot.we can not only speed up the growing of the decision tree, but also better information of rules can be generated. In this paper algorithm we will verify different large datasets which are publicly available on UCI machine learning repository. With the improved algorithm, we can get faster and more effective results without the change of the final decision and the presented algorithm constructs the decision tree more clear and understandable. Efficiency and classification is greatly improved. REFERENCES [1 ] I. H. Witten, E. Frank, Data Mining Practical Machine Learning Tools and Techniques, China Machine Press, [2 ] S. F. Chen, Z. Q. Chen, Artificial intelligence in knowledge engineering [M]. Nanjing: Nanjing University Press, [3 ] Z. Z. Shi, Senior Artificial Intelligence [M]. Beijing: Science Press,1998. [4 ] D. Jiang, Information Theory and Coding [M]: Science and Technology of China University Press, [5 ] M. Zhu, Data Mining [M]. Hefei: China University of Science and Technology Press, [6 ] A. P. Engelbrecht., A new pruning heuristic based on variance analysis of sensitivity information[j]. IEEE Trans on Neural Networks, 2001, 12(6): [7 ] N. Kwad, C. H. Choi, Input feature selection for classification problem [J],IEEE Trans on Neural Networks, 2002,13(1): [8 ] Quinlan JR.Induction of decision tree [J].Machine Learing.1986 [9 ] Quinlan, J.R.C4.5: ProgramsforMachineLearning.SanMateo, CA:Morgan Kaufmann1993 [10 ] UCIRepository of machine earning databases. University of California, Department of Information and Computer Science, http: // uci. edu/~mlearn/mlrepository. Html [11 ] UCI Machine Learning Repository [12 ] Jaiwei Han and Micheline Kamber, Data Mining Concepts and Techniques.second Edition,Morgan Kaufmann Publishers. [13 ] Chen Jin,Luo De lin,mu Fen-xiang,An Improved ID3 Decision tree algorithm,xiamen University,2009. [14 ] Rong Cao,Lizhen Xu,Improved C4.5 Decision tree algorithm for the analysis of sales.southeast University Nanjing211189,china,2009. [15 ] Huang Ming,NiuWenying,Liang Xu,An improved decision tree classification algorithm based on ID3 and the application in score analysis.dalian jiao Tong University,2009. [16 ] Surbhi Hardikar, Ankur Shrivastava and Vijay Choudhary Comparison betweenid3 and C4.5 in Contrast to IDS VSRD- IJCSIT, Vol. 2 (7), [17 ] Khalid Ibnal Asad, Tanvir Ahmed,MD. Saiedur Rahman,Movie Popularity Classification based on Inherent, MovieAttributes using C4.5,PART and Correlation Coefficientd.IEEE/OSA/IAPR International Conference on Informatics, Electronics & Vision,