Data Mining based on Rough Set and Decision Tree Optimization

Transcription

1 Data Mining based on Rough Set and Decision Tree Optimization College of Information Engineering, North China University of Water Resources and Electric Power, China, Abstract This paper presents a new kind of decision tree classification algorithm based on rough set theory. Firstly, the growth of decision tree and tree pruning algorithms are analyzed and compared. And optimizes the decision tree algorithm from two aspects: attribute reduction and pruning. Secondly, presents a reduction algorithm which is called ER briefly, based on the attribute dependency and a pruning algorithm for decision tree based on rough set theory. Lastly, the proposed algorithm which is used in the supplier evaluation system verifies the validity by comparing with C4.5 algorithm. 1. Introduction Keywords: Data Mining, Decision Tree, Rough Set, Pruning In recent years, with the development of computer technology, the information and data we store are more and more, how to find hidden information behind the data, and further guidance on the behavior of our industry is an important issue we face. Data mining [1]technology is to solve the above problem, which can identify potential links between the data, employ higher level of analysis, and make the ideal decision-making in order to predict the future development trend. With development of many years, data mining has increasingly shown its strong vitality. The core part of the data mining is to set up the model of data set. The ways of constructing the data model are not same, either the data mining methods. There are many different methods can be used during data mining, such as neural networks, decision trees, genetic algorithms and visualization technology. Data classification is an important feature in the data mining. Also there are many ways of data classification, such as the decision tree method, Bayesian networks, genetic algorithms, association-based classification methods, rough sets, k-nearest neighbor method and so on. Among them, the decision tree method is one of the commonly used methods of data classification. Compared with other classification methods, the decision tree [2] method has the following significant advantages: high speed, high accuracy, easily be understood and strong scalability. Now, the technology of decision tree has been gained attention of researchers in many data mining systems. Data mining system has been launched by domestic and foreign companies, most of which have adopted the decision tree method. The SASEnterprise Miner of SAS Company is described in paper [3], where the system is a generic data mining tool. And the system collects and analyzes a variety of statistical dates and buying patterns of customers, which help user finding trends of business, explain the known facts, predict future results, and identify the key factors of tasks needed to complete. IBM's Intelligent Miner described in paper, with typical data sets automatically generated, the association discovery, the sequence of regular pattern discovery, conceptual classification and visualization functions, can realize data selection automatically, data conversion, data mining and results showing, which has been shown a better data mining tool. Clementine of Solution Inc. provides a visual rapid modeling environment, which is composed of data acquisition, mining, finishing, modeling, and reporting components. The Knowledge SEEKER of Angoss Company is an analysis program based on decision tree, with a fairly complete classification tree analysis functions. DataCruncher of RightPoint Company described in paper [4] is a data mining engine based on client / server mode, which has the capable of analysis of huge amounts data in data warehouse, and direct connection with many of today's mainstream relational database and data mining tools. For the deficiencies of the existing decision tree algorithms, many researchers try to make efforts to control the size of the decision tree and to improve the accuracy of the decision tree, by studying a variety of pre-pruning algorithm and post-pruning algorithm to control the size of the tree, at the same time modifying the test attribute space, improving test attribute selection method, limiting the data set, and changing the data structure, while put forward a number of new algorithms and standards. The paper presents an improved attribute reduction algorithm based on rough set theory of attribute dependence, in which the time complexity of the algorithm is greatly improved while keeping the International Journal of Digital Content Technology and its Applications(JDCTA) Volume6,Number12,July 2012 doi: /jdcta.vol6.issue

2 ability of classification and the optimum set of attribute reduction will be found without going through a lot of computing. After studying the existing post-pruning algorithms based on rough set theory, aiming at improving inefficiency of the algorithms, an improved post-pruning algorithm for decision tree based on rough set theory is presented in the paper, which reduces the time complexity. 2. Related concepts 2.1. Data Mining Data mining is a process of extracting information and knowledge which is potentially useful while people do not know in advance from a large number of incomplete, noisy, fuzzy and random data [5]. Data mining is a cross-discipline which is concerned by many researchers from various fields and is affected by a number of disciplines including database technology, statistics, artificial intelligence, machine learning, pattern recognition, high-performance computing, visualization technology, and information science etc. The entire data mining process is composed by a number of mining steps; the main steps are listed as followed: Data clearing, its role is to remove data noise and apparently unrelated data of mining topics. Data integration, its role is to combine the data from multiple data sources. Data transformation; its role is to convert the data for ease of data mining. Data mining which is a fundamental step in the knowledge mining is to mine the data model, or the law of knowledge using intelligent methods. Pattern evaluation, its role is to select meaningful patterns of knowledge based on certain evaluation criteria from the mining results.knowledge presentation, its role is to show users the mining related knowledge using visualization and knowledge representation technology. A variety of knowledge of decision-making can be achieved using data mining technology for users. In many cases, users do not know which information and knowledge is valuable. Therefore, as to a data mining system, it should be able to simultaneously search and find the knowledge of a variety of model to meet user expectations and actual needs. In addition, the data mining system should also be able to dig out the pattern of knowledge of a variety of levels. There are many commonly used techniques about data mining such as: decision trees, neural networks, genetic algorithms, rough set methods etc Rough set theory Rough set theory has become one of the important basic theories of data mining. Decision tree and rough set theory combined with data mining methods have been widely used in data mining. The most significant advantages of rough set theory are to deal with incomplete, inaccurate, and incompatible data which carry out attribute reduction of decision tree taking advantage of rough set theory while remove the redundant attribute of rough set under the premise of keeping the classification ability. Rough Sets as a new mathematical theory dealing with imprecise and incomplete data, is originally proposed by Polish mathematician Pawlak, which has attracted the attention of scholars of all countries from the early 1990s. The rough set theory is set up based on the classification, and knowledge understanding is treated as data division, which is composed of equivalence relations in a particular space. The rough set theory is widely used in pattern recognition, machine learning, data mining and intelligent control because it can extract the implicit knowledge while does not require any prior knowledge to process existing knowledge. The basic idea of rough set theory is to export the classification rules of concept by knowledge reduction under the premise of keeping the classification ability. At present, rough set theory has been successfully applied in many fields. In paper [6], specific application about rough set theory is introduced in many fields, such as knowledge discovery, expert systems, pattern recognition, stock data analysis, earthquake prediction, rough control, medical diagnostics, artificial neural networks, decision analysis etc. At present, the rough set theory applications include the following aspects: decision assessment and data mining and rule generation etc. Decision-making evaluation method based on rough set theory can improve the objectivity of evaluation; also can transform a complex, ambiguous, subjective reasoning of the evaluation process into a series of objective, quantifiable, stylized problemsolving activities, which carry out scientific evaluation and right choice to provide right decisionmaking advice with decision makers. Data Mining and Rule Generation is the most important application of rough set theory in practice. Rough set theory has the ability of 481

3 searching minimum set of data, using the qualitative and quantitative data, while produce decision from the data. Pattern recognition is one of the main applications of rough set theory, which can be used for feature selection, feature representation and classification and clustering. The new feature selecting technologies based on the characteristics of the rough set method has the function of avoiding loss of information and resolving dimension problem of data set. Rough set is very important to artificial intelligence and cognitive science, which has attracted much attention since it rises, and is given lots of attention Decision tree Decision tree is a tree structure which is similar to flow diagram. The every internal node represents a test on an attribute, namely, the logical judgment following the form of ai=vi, where ai represents the property and vi is an attribute value of the property. The branch of the tree is on behalf of each test results, that is, every possible value and the every side is one-to-one correspondence. And the leaf node represents a category [7]. The input of the decision tree is a group of data with a category label while the structure of the result is a binary tree or multiple trees. The node in the tree can be divided into two categories: decision nodes and leaf nodes. The decision tree is a commonly used method for supervised learning. Firstly, a subset of instance is selected from the training set. Then build a decision tree with these subsets and the remaining set of training instances are used to test the accuracy of decision tree. If the instances can be classified by the decision tree, the process ends. If there is any error in instance classification, the instance is added to the selected training instance subsets and builds a new tree until the decision tree can classify all the not selected training instances correctly. The basic algorithm of generating decision tree is as following: Input: the training samples, where the value of each attribution is discrete and the available candidate attribute set is represented by attribute_list. Output: the decision tree. Create a node N. If all samples in the node belong to the same class C, the root node corresponds to all the training samples. Return N as a leaf node and marked as category C; If attribute_list is null, then return N as a leaf node and mark the node as the category which has the largest of samples contained in the node. Select an attribute with the greatest information gain from the attribute_list and the node N is marked as test_attribute. For the each known value of the test_attribute, denoted by ai, divide the sample set included in the node N. Based on the condition of the following: test_attribute=ai, produce a corresponding branch from the node N to in dictate the test conditions. Supposing si as the sample collection produced in the condition of test_attribute=ai. If si is null, the corresponding leaf node is marked as the category whose number is the largest in the sample included in the node. Otherwise, the corresponding leaf node is marked as Generate_decision_tree. The termination of the recursive operations algorithm is as following: (1) All samples of nodes belong to the same class. (2) No remaining attributes can be used to further divide the sample. (3) There are no samples meeting the condition test_attribute=ai. The basic decision tree algorithm is a greedy algorithm, which constructs a decision tree using a recursive way with top-down and dividing and ruling. The generate_decision_tree algorithm is a basic version of the well-known decision tree algorithm ID Classification algorithm with decision tree While a decision tree is built up, many branches of the tree are constructed based on the abnormal data in the training sample set. Branches pruning is proposed to solve the problem of noise. There are many post-pruning algorithm such as REP(Reduced Error Pruning),PEP(Pessimistic Error Pruning), MPE(Minimum Error Pruning),CCP(Cost-Complexity Pruning),EBP(Error Based Pruning) and so on. At present the research about decision tree mainly exist in the following fields: dimensionality reduction (test attribute reduction), attributes test standards, pruning and other problems The data for data mining may contain hundreds of condition attributes, and each attribute is treated as a dimension. There are many significant key attributes for data mining in condition attributes and also there are a large number of irrelevant, redundant or even harmful attributes for mining task. So reducing the number of attributes used in the establishment of decision tree not only adept at 482

4 handling large-scale high-dimensional data and improve the practicality of decision tree but also an effective means to filter out the harmful, redundant attributes to improve the prediction accuracy of decision tree.in the decision tree-building process, how to choose the condition attribute as the root node and nodes at the test attribute is one of the core issues of the decision tree algorithm. Information entropy is an important metric used to analyze the degree of uncertainty in information theory, which gains the minimum amount of information for a given condition from the statistical point of view and measure the degree of uncertainty by the amount of information required. When the decision tree is created, due to noise and isolated points, many branches reflect the abnormal training data. At the same time due to the noise data, the error, or interference data in the training set, thus the decision tree generated based on the training set often contains some wrong information. The existing pruning methods can be divided into pre-pruning and post-pruning. In pre-pruning, the decision tree is pruned by early stopping the tree construction, while the node becomes a leaf node once stops. Post-pruning method prunes the less inappropriate branches on the growth decision tree. 3. Data Ming based on rough sets and decision tree optimization The rough set theory has become one of the important basic theories of data mining. Combination of data mining methods of the decision tree and rough set theory has been widely used in data mining [8]. Dealing with incomplete, inaccurate and incompatible data is the most significant advantage of rough set theory, which can take advantage of rough set theory to employ attribute reduction of decision tree and remove the redundant attributes in the premise of maintaining the same classification ability. The paper presents an algorithm based on attribute dependency-based decision tree attribute reduction and post-pruning with rough set theory Attribution reduction based on attributes dependency Attribute reduction is the core content of the rough set theory, which does not affect the original system by deleting irrelevant or unimportant condition attributes. So the original system can be simplified. Experimental results show that decision tree computational cost is proportional to the number of attributes used in the contribution. Generally speaking, the less the reduction attributes, the less rules generated and the lower test costs of new objects classification. In this paper, the purpose of the improved attribute reduction algorithm is to ensure the effectiveness of the algorithm under the premise of reduction Attribute reduction algorithm commonly used Attribute reduction algorithm commonly used in rough set is based on the core set and add more important attribute to the collection gradually until meet the conditions: POS Reduct (D) =POS C (D), where Reduct denotes reduction, C denotes condition attribute set and D denotes decision attribute set [9]. In the algorithm, the entire condition attribute set C is treated as a reduction. The collection of unnecessary attributes is removed gradually using heuristic information of the region to get t the attribute reduction set while meet the conditions of satisfying the above equation. Attribute reduction algorithm commonly used in rough set generally take two steps to complete as following: Input: a decision table Output: a relative reduction of the decision table Calculating core Core=C; // Core expresses core and C expresses Condition attribute set For (I=0; I<K; I++) // K expresses the number of attribute {P=C-{Ci}}; // Ci expresses the i-th attribute value Dependence p (D) = POS p (D)/ U ; // the dependence of collection D on P If (Dependence p (D) = =1) Core= Core P; } 483

5 Reduction Supposed R=Core D(C); P=C- Core D(C). Do Select the attribute of ai from P, calculating the maximum value of the following formula; Pos=POS P (D)-POS (p-{ai}) (D) R=R {ai} P=P-{ai} Until POS R (D) = POS C (D) Return R In this paper, an improved reduction algorithm based on the dependence of attribute is proposed, which is called ER for short. The better attribute reduction set can be found without going through a lot of computing. In the algorithm, the core is calculated firstly. Then add a reduction attribute based on the core, where the attribute should ensure that the new attribute set is bigger than the dependence of the original collection which the attribute is not added before. Repeat this process, until all the dependence of attributes in the reduction set and the original information tables are consistent. The algorithm is described as follows: ER(C,D),where C denotes condition attribute set and D denotes decision attribute set. Input: a decision table. Output: a relative reduction of the decision table. (1) Calculating core Core=C; // Core expresses core and C expresses Condition attribute set For (I=0; I<K; I++) //n expresses the number of attribute {P=C-{Ci}}; // Ci expresses the i-th attribute value Dependence p (D) = POS p (D)/ U ; // the dependence of collection D on P If (Dependence p (D) = =1) Core= Core P; } (2) Reduction R {Core} Do T R X (C-R) If Dependence R {x} (D) >Dependence T (D) { T R {x} } R T Until Dependence R (D) == Dependence C (D) Return R In this paper, the improved algorithm ER also calculates the core firstly, and then employs attribute reduction step. Compared with Reduct Algorithm, they have the same way to calculate the core. But in ER algorithm, attribute reduction is to meet the condition of Dependence R (D) == Dependence C (D).And do not calculate all the attributes of the collection of T,which will greatly reduce the time complexity. UCI data set is a commonly used standard test data, which collects a large number of the database used in a variety of machine learning methods. We choose five discrete databases and carry out experiment using attribute reduction algorithm based on attribute dependency. The experimental results are shown in the table

6 Table 1. UCI Test Data Sets Database name Number of attributes after Number of original reduction attributes Deduct ER Standardized Audio logy Database German Credit Date Kinship Domain Chess End-Game Mushroom database Domain The test results from the UCI data set show that the proposed algorithm has better improvement not only in time complexity but also the reduction results Decision tree optimization based on rough set Decision tree pruning is one of the main content in decision tree optimization study now. Pruning technology is divided into pre-pruning and post- pruning. Pre-pruning technology only focuses the local information of the tree. There is certain blindness, which may make the decision tree prematurely stop growing and difficult to determine whether the child nodes of the node be cut off has the value of existence. Generally, we can not obtain the optimal decision tree using pre-pruning method. However, post-pruning take advantage of the global information of the decision tree, so it is often better than prepruning and commonly used in practice. Based on the study of rough set theory, this paper presents an improved pruning algorithm for decision tree based on rough set theory Post- pruning method of decision tree While a decision tree is just built up, many branches of the tree are constructed based on the abnormal data in the training sample set (due to noise, etc.). Branches pruning is proposed to solve the problem of noise. The post-pruning method of decision tree trims off the excess branches from a "fully grown" tree [10-13]. Most of the existing decision tree post-pruning algorithms are often improved based on REP algorithm as a benchmark. REP algorithm is first proposed by Quinlan, which is one of the simplest pruning methods. In REP, it need an independent test set (set of pruning data) to calculate the accuracy of the sub-tree. And a tree node will be treated as the pruning of candidate objects, which process is as following: for each sub-tree S of tree T, make it a leaf node generate a new tree bottom-up. If in the test set, the new tree can get a smaller or equal classification error, and the sub-tree S does not contain the sub-tree of the same nature, then S is to be deleted instead of leaf node. Repeat this process, until without increase the classification error on the test set while each sub-tree is replaced by leaf nodes. However, the including nodes because of coincidence regularity of the training set will be deleted, because the same coincidences do not likely appear in the test set. Comparing the error rate repeat, always select the deletion node which may improve the accuracy of decision tree on the test set to prune until further pruning will reduce the accuracy of the decision tree on the test set. The decision tree obtained using REP method is the most accurate sub-tree on the test set and is the smallest scale tree. In addition, its computational complexity is linear. Because the probability of the sub-tree whether to be pruned is assessed by accessing each non-leaf nodes of the decision tree once. Furthermore, comparing with the original decision tree, the forecast bias of future examples based on post- pruning decision tree is small because of using an independent test set. However, there are inadequacies in this method, which is biased in favor of excessive pruning. Branches corresponding to those instances in the test set which rarely appear in the training data should be deleted in the pruning process. This problem is particularly prominent when the test set is much smaller than the training set. If the training data set is small, this method usually does not be considered Improved post-pruning algorithm based on rough set theory 485

7 In this paper, improved post-pruning algorithm is described as follows: Firstly, calculate the core of attributes using the above method. Core attribute is often more important for classification. So the nodes in decision tree corresponding with core attribute are called important nodes. Next, for the each non-leaf node A, assume the corresponding sub-tree as T, calculate the root node s error rate denoted as e of T' and calculate the important node s error rate contained in T ', finding out the minimum error rate denoted as e'. Then employ the decision tree pruning which meets one of the following conditions: (1)There is not important leaf nodes contained in sub-tree T ; (2) There is important leaf nodes contained in sub-tree T and e e'. Algorithm: Prune (T) Input: A decision tree T with complete growth Output: Pruning tree {TP} Starting from the root of the tree T For all sub-tree T of T { e: Classification error rate of the root node of T ' e : the minimum classification error rate of important nodes in branches of T. If not find important nodes or e e' { Pruning sub-tree to leaf nodes and be marked as the class represented by the majority instances of T } Compared with the commonly used pruning algorithm, the pruning method proposed in this paper; simply calculate the error rate of sub-tree root nodes and important nodes contained in sub-tree without having to calculate the non-critical node error rate in the pruning process, which largely reduce the computational complexity. In addition, when the pruning method combining the attribute reduction methods mentioned above to construct decision trees, the previously calculated core attributes can be used directly while calculate the important attribute in such pruning algorithm because the core attribute has been calculated during attribute reduction, which reduce the complexity of the decision tree algorithm and improve the efficiency of the decision tree constructing. We've already introduced REP pruning method, which is a relatively simple pruning method. The decision tree will be established using the training set in this section. 4. Experiments Data mining is a very complex process. Each type of data mining technology has its own characteristics, and implementation steps. The different requirements of input/output data in form, structure, parameter setting, and training, testing and model evaluation methods indicate the difference of the meaning and the ability of the algorithm application areas. Data mining is closely related with the specific application. The goal of each data mining application problem, the data collection, the extent of the problem and the selection of algorithm do not be same. We select service provider information from the SQL Server 2000 data base table as a data mining objects. The system supplier evaluation mainly carries out mining of supplier information and makes the discrimination of supplier s importance to guide decision-making of company. In this paper, the model of creating a decision tree is shown as figure

8 Figure 1. Flow chart of decision tree modeling We extract some of the data for data mining from 500 raw data after preprocessing. Then construct decision tree based on information gain of ID3 algorithm. The decision tree obtained follows the form of the figure 2. Finally, after pruning, the final decision tree can be obtained as figure 3. Figure 2. Initial Decision Tree Figure 3. Final Decision Tree In order to further assess the model proposed in this algorithm, the four databases in the public databases of the UCI are selected for simulation test. And the results obtained from the proposed decision tree algorithm are compared with the corresponding results from the EBP pruning method of C4.5 algorithm. The four basic databases information are shown in table 2 and the comparison results of test are shown in table 3. Table 2. Database information Database Australian German Sonar Sat Sample number Attribute number Category number

9 Table 3. Test Results Algorithm Database Decision tree C4.5 Australian Number of condition attributes for building tree German Sonar Sat Prediction accuracy Australian 87.2% 83.3% German 75.1% 73.2% Sonar 81.3% 74.1% Sat 81.9% 85.8% As showing in table 3, the algorithm decision tree significantly reduces the number of attributes used to create a decision tree comparing with the algorithm of C4.5, because attribute reduction has been employed using ER algorithm before building decision tree. Because the calculation cost of decision tree is proportional to the number of attributes for building decision tree, so the Decision Tree algorithm significantly reduced the computational cost. Meanwhile, the post-pruning algorithm of decision tree has less complexity, which also improves the efficiency of constructing decision tree. The experiments show that in most datasets, the prediction accuracy of the decision tree algorithm is better than C4.5.In addition, they have roughly the same size of building decision tree. 5. Conclusions The paper proposes an attribute reduction method based on the attribute dependence (ER) by studying on data mining and rough set theory and compares the method with commonly used rough set attribute reduction methods. A post-pruning method of decision tree based on rough set theory is proposed. And the experimental results show that the decision tree constructed from post- pruning method is smaller than the tree based on REP, and has high accuracy. The decision tree constructed using ER and post-pruning method based on rough set is applied in supplier evaluation system. Practice has proved that the decision tree constructed by this method has relatively small size, with high prediction accuracy. 6. References [1] Pawlak Z, Skowron A, "Rudiments of Rough Sets", Information Science, vol. 117, no.1, pp. 3-37, [2] Fan Ming, Meng Xiao Feng, "Data mining: Concept and technique", Beijing: Machinery Industry Publication, China, [3] Manish Mehta, Jordan Rissanen, Rakesh Arrayal, "MDL-based Decision Tree Pruning", International Conference on Knowledge Discovery in Databases and Data Mining, pp , [4] J. Mingers, "An Empirical Comparison of Pruning Methods for Decision Tree Induction", Machine Learning, vol. 4, no.2, pp , [5] Agrawal R, Lmielinshi T, Swim A, "Database Mining: A Performance Perspective", IEEE Trans. on Knowledge and Data Engineering, vol. 5, no.6, pp , [6] Shang Zhi, "Algorithm of Attribute Value Reduction and Its Application Based on Rough Sets", Computer Applications and Software, vol. 26, no.2, pp , [7] J.R. Quinlan, "Induction of Decision Tree", Machine Learning, vol. 1, no.1, pp ,

10 [8] Han J W, Kamber M, "Data Mining: Concepts and Techniques", Morgan Kaufmann Publishers, San Francisco, [9] Supriya K.D., Krishna R, "Clustering Web Transactions using Rough Approximation", Fuzzy Set and Systems, vol. 148, no.1, pp , [10] Jinmao Wei, "Rough set based Approach to Selection of Node", International Journal of computation Cognition, vol. 1, no.2, pp [11] Xuelei Xu, Chunwei Lou, "Applying Decision Tree Algorithms in English Vocabulary Test Item Selection", IJACT: International Journal of Advancements in Computing Technology, vol. 4, no. 4, pp , [12] Sudheep Elayidom.M, Sumam Mary Idikkula, Joseph Alexander, "Design and Performance analysis of Data mining techniques Based on Decision trees and Naive Bayes classifier For", JCIT: [13] Journal of Convergence Information Technology, vol. 6, no. 5, pp ,