Cluster Analysis of Data Sets - A Review

Transcription

1 Data Mining of Machine Learning Performance Data Remzi Salih Ibrahim Master of Applied Science (Information Technology) 1999 RMIT University

2 Abstract W ith the development and penetration of data mining within different fields and industries, many data mining algorithms have emerged. The selection of a good data mining algorithm to obtain the best result on a particular data set has become very important. What works well for a particular data set may not work well on another. The goal of this thesis is to find associations between classification algorithms and characteristics of data sets by first building a file of data sets, their characteristics and the performance of a number of algorithms on each data set; and second applying unsupervised clustering analysis to this file to analyze the generated clusters and determine whether there are any significant patterns. Six classification algorithms were applied to 59 data sets and then three clustering algorithms were applied to the data generated. The patterns and properties of the clusters formed were then studied. The six classification algorithms used were OneR (1R), Kernel Density, Naïve Bayes, C4.5, Rule Learner and IBK. The clustering algorithms used were K-means clustering, Kohonen Vector Quantization, and Autoclass Baysian clustering. The major discovery made by analyzing the generated clusters is that the clusters were formed based on accuracy of the algorithms. The data sets were grouped as either belonging to a cluster having average error rates, lower, about average or higher than average error rates of the population. This suggests that there are three kinds of data sets on the 59 data sets considered: easy-to-learn data sets, moderate-to-learn, and hardto-learn data sets. Another discovery made by this thesis is that the number of instances in a data set was not useful for clustering analysis of the machine learning performance data. It was the only significant variable in clustering the data sets and prevented analysis based on other variables including the variables that contain values for the accuracy of each classification algorithm. While not directly relevant to clustering, it was also found that the number of instances and number of attributes in the data sets do not have strong influence on the performance of the data mining algorithms on the 59 data sets considered as high error rates were obtained for both small data sets with a small number of attributes and large data sets with a large number of attributes. Experiments performed for this thesis also allowed the comparison of the performance of the 6 classification algorithms with their default parameter settings. It was discovered that in terms of performance, the top three algorithms were Kernel Density, C4.5, and Naïve Bayes, followed by Rule Learner, IBK and OneR. - II -

3 Declaration I certify that all work on this thesis was carried out between June 1998 and June 2000 and it has not been submitted for academic award at any other College, Institute or University. The work presented was carried out under the supervision of Dr. Vic Ciesielski. All other work in the thesis is my own except where acknowledged in the text. Signed, Remzi Salih Ibrahim June, III -

4 Table of Contents List of Tables...VI List of Graphs... VII Acknowledgements...VIII Chapter 1. Introduction Goals Scope Chapter 2. Literature Survey Supervised Learning Supervised Algorithms Used in this thesis C Rule Learner (PART) OneR (1R) IBK Naïve Bayes Kernel Density Unsupervised Learning Unsupervised Data Mining Algorithms Used in This Thesis K-means clustering Kohonen Vector Quantization Autoclass (Bayesian Classification System) Related work on Comparison of Classifiers Chapter 3. Data Generation Collection of Data sets Selection of Data Mining Algorithms Generating Data Chapter 4. Clustering and Pattern Analysis Results from k-means clustering Results from k-means clustering (without Number of Instances) Results from Kohonen Vector Quantization Clustering Results from Autoclass (Bayesian Classification System) Analysis Comparison Comparison of significant variables Comparison of Data sets in Different Clusters Influence of Characteristics of Data Sets on Performance of Classification Algorithms Chapter 5. Conclusion Appendix A. About WEKA Appendix B. About Enterprise Miner (Commercial software) IV -

5 Appendix C. Detail Results from K-means Analysis Appendix D. Detail results from Kohonen Vector Quantization/Kohonen Appendix E. Detail Result from Autoclass Clustering Analysis Appendix F. Sample Code used to run Multiple Algorithms on Multiple Data sets Appendix G. Sample Output from data generation References V -

6 List of Tables TABLE 1: SAMPLE IRIS DATA TABLE 2: RESULTS OBTAINED FROM APPLYING 6 DATA MINING ALGORITHMS TO 59 DATA SETS. BLANKS INDICATE SITUATIONS WHERE ALGORITHMS GAVE NO RESULT TABLE 3: DEFINITION OF VARIABLES USED TABLE 4: SUMMARY OF THE DATA GATHERED FROM RUNNING THE 6 DATA MINING ALGORITHMS ON 59 DATA SETS TABLE 5: IMPORTANCE LEVEL OF VARIABLES IN DETERMINING CLUSTER TABLE 6: PROPERTIES OF THE 5 CLUSTERS TABLE 7: IMPORTANCE VARIABLE IN DETERMINING CLUSTERS TABLE 8: GENERAL PROPERTIES OF THE CLUSTERS FROM K-MEANS ANALYSIS TABLE 9: GENERAL PROPERTIES OF CLUSTERS AND SIGNIFICANT VARIABLES TABLE 10: COMPLETE LIST OF DATA SETS IN EACH CLUSTER AND THE VALUES OF THE SIGNIFICANT CLUSTERING VARIABLES TABLE 11: IMPORTANCE OF VARIABLES IN DETERMINING CLUSTERS IN KOHONEN VECTOR QUANTIZATION ANALYSIS TABLE 12: GENERAL PROPERTIES OF THE CLUSTERS FROM KOHONEN VECTOR QUANTIZATION ANALYSIS.. 35 TABLE 13: GENERAL PROPERTIES OF CLUSTERS MEAN ERROR RATES OF THE SIGNIFICANT VARIABLES TABLE 14: COMPLETE LIST OF DATA SETS IN EACH CLUSTER AND THE VALUES OF THE SIGNIFICANT CLUSTERING VARIABLES TABLE 15: SIGNIFICANCE LEVEL OF VARIABLES IN AUTOCLASS CLUSTERING TABLE 16. PROPERTIES OF CLUSTERS FROM AUTOCLASS ANALYSIS TABLE 17: COMPLETE LIST OF DATA SETS IN EACH CLUSTER AND THE VALUES OF THE SIGNIFICANT CLUSTERING VARIABLES TABLE 18: COMPARISON OF SIGNIFICANT VARIABLES FOUND BY THE THREE CLUSTERING ALGORITHMS TABLE 19: SUMMARY OF THE SIGNIFICANCE OF EACH VARIABLE FOR THE 3 CLUSTERING ALGORITHMS TABLE 20: DATA SETS IN EACH COLUMN WERE GROUPED INTO ONE CLUSTER BY ALL THREE ALGORITHMS VI -

7 List of Graphs FIGURE 1: A DECISION TREE PRODUCED FOR THE IRIS DATA SET FIGURE 2: RULE OUTPUT PRODUCED FROM THE SAS ENTERPRISE MINER SOFTWARE FIGURE 3: PARTIAL OUTPUT FROM WEKA ONER PROGRAM VII -

8 Acknowledgements I would like to thank Dr. Vic Ciesleski for being supportive and very patient during the progress of my thesis. He has been very understanding of the problems that arise from working full time, studying part time and still having to fulfill family commitments. I would like to pass my thanks to Dr. Isaac Balbin for his support and his flexibility with the dateline. My thanks also go to the WEKA support team and the staff from SAS Institute Australia for their support and all staff from the RMIT AI (Artificial Intelligence) group who inspired me to research in this field. I would like to thank all members of my family and my friends for their patience during the progress of my thesis. - VIII -

9 Chapter 1. Introduction I n this current age of technology data has become more readily available than ever. Using technologies like data warehousing, data is being stored in large quantities. The availability of such data opened the door for new data analysis techniques to emerge. As Weiss and Indurkhya [55] explain, as the amount of data stored in existing information systems mushroomed, a new set of objectives for data management has emerged. Mining data has become one of the important means of obtaining useful information. The term data mining is defined by Fayeed, Piatetsky-Shapiro and Smyth [19] as the part of Knowledge Discovery in Databases (KDD) process relating to methods for extracting patterns from data. The KDD process involves the complete steps of obtaining knowledge from data and includes selecting, pre-processing, transformation and mining of data followed by interpretation and evaluation of patterns. Data mining has many advantages across different industries. It allows large historical data to be used as the background for prediction. The interpretation and evaluation of the patterns obtained by data mining produces new knowledge that decision-makers can act upon [42]. Data mining provides a means to obtain information that can support decision making and predict new business opportunities. For example, telecommunications, stock exchanges, and credit card and insurance companies use data mining to detect fraudulent use of their services; the medical industry uses data mining to predict the effectiveness of surgical procedures, medical tests, and medications; and retailers use data mining to assess the effectiveness of coupons and special events [41]. With the development and penetration of data mining within different fields, many data mining algorithms emerged. The selection of a good data mining algorithm to obtain the best result on a particular data set has become very important. What works well for a particular data may not work well on another. Furthermore, the No Free Lunch theorem by Wolpert and Macready [61,page 2] has established that it is impossible to say that any technique is better than another over the space of all problems. In particular, if algorithm A outperforms algorithm B on some cost functions, then loosely speaking, there must exist exactly as many other functions where B outperforms A. An example of the No Free Lunch Theorem that has been encountered in this thesis is the case of the performance of the C4.5 and Rule Learner algorithms on the EchMonths and Hungarian data sets (table 2, page 26). While C4.5 algorithm obtained an error rate of only 0.6 percent on the EchMonths, Rule Learner obtained percent. But on the Hungarian data set, Rule Learner outperformed C to While the No Free Lunch Theorem has established that there can be no one `best' learning algorithm, the question of `What kinds of algorithms are best suited to what kinds of data?' remains an open question. While there has been some work comparing different algorithms on a range of data sets (the STATLOG project [12], Lim and Loh [35]), there has been little work on trying to characterize data sets (for example, big, - 9 -

10 small, numeric, symbolic, mixed) and matching algorithms to data characteristics. With the emergence of hundreds of data mining algorithms today, such information will help data mining analysts to make intelligent decisions in choosing an appropriate data mining algorithm for certain types of data mining files. 1.1 Goals The major goal of this thesis is to find associations between classification algorithms and characteristics of data sets by a two-step process: 1. Build a file of data set names, their characteristics and the performance of a number of algorithms on each data set. 2. Apply unsupervised clustering to the file built in step 1, analyze the generated clusters and determine whether there are any significant patterns. 1.2 Scope Due to time limitations for an MBC minor thesis, the scope of this thesis will be restricted to: 6 supervised learning algorithms 59 small to medium size data sets with number of attributes ranging from 7 to 76. Running of the 6 supervised algorithms on the 59 data sets using only default settings of the algorithms Using 3 unsupervised learning algorithms for cluster analysis Characteristics of the data sets limited to only number of attributes and number of instances

11 Chapter 2. Literature Survey C oncepts and papers that are relevant to this thesis are discussed in this chapter. First both supervised and unsupervised learning techniques are discussed followed by the description of all the algorithms used in this thesis. Finally, three papers that are related to this thesis are discussed in detail. Machine learning is described by Witten and Frank [58] as the acquisition of knowledge and the ability to use it. They explain that learning in data mining involves finding and describing structural patterns in data for the purpose of helping to explain that data and make predictions from it. For example, the data could contain examples of customers who have switched to another service provider in the telecommunication industry and some that have not. The output of learning could be the prediction of whether a particular customer will switch to another service provider or not. There are two common types of learning: supervised and unsupervised. 2.1 Supervised Learning Learning or adaptation is supervised when there is a desired response that can be used by the system to guide the learning. Decision trees and neural nets are two common types of supervised learning. This type of learning always requires a target variable to predict. Supervised learning algorithms have been used in many applications. For example, supervised learning has been used in the seismic phase identification in the field of nuclear science [28] and for the prediction of tornados [36]. Supervised learning involves the gathering of data to be used for data mining, identifying the target variable, breaking up of the data into training and testing data and developing the classifier. The training data is used by the data mining algorithm to learn the data and build a classifier. The test data is used to evaluate the performance of the classifier on new data. The performance a classifier is commonly measured by the percentage of incorrectly classified instances on the data used. Train error rate refers to the percentage of incorrectly classified instances on the training data and test error rate refers to percentage of incorrectly classified instances on the test data. One of the problems of supervised learning is overfitting [58]. The classifier works well on the training data but not on test data. This happens when the model learns the training data too well. To get an indication of the amount of overfitting, the model should be tested using a test data set or cross validation. If, after training, the test error rate is approximately equal to training error rate, the test error rate is an indication of the kind of generalization that will occur

12 Cross-validation is a method for estimating how good the classifier will perform on new data and is based on "resampling" [33]. Cross validation is good for use when the size of data is small. It allows the use of all of the data for training. In k-fold cross-validation, the data is divided into k subsets of equal size. The model is trained k times, each time leaving out one of the subsets from training, but using only the omitted subset to compute the error rate. If k equals the sample size, this is called leave-one-out cross-validation. Leave-one-out cross-validation often works well for continuous error functions such as the mean squared error, but it may perform poorly for non-continuous error functions such as the number of misclassified cases [33]. A value of 10 for k is commonly used and is also used for this thesis. Some data mining algorithms do not support continuous target variables. In such cases, binning or discretization is used. Binning is a method of converting continuous targets into categorical values. For instance, if one of the independent variables is age, the values must be transformed from the specific value into ranges such as "less than 20 years", "21 to 30 years", "31 to 40 years" and so on [4] Supervised Algorithms Used in This Thesis The basic theory behind each of the 6 classification algorithms and the details of how each algorithm works are discussed in this section. The parameters that affect the performance of each of the algorithms are also discussed and where possible, papers that describe successful applications are also cited C4.5 C4.5 is a decision tree algorithm devised by Quinlan [43]. Decision trees are used to classify instances into different categories and are common types of classification algorithms. First what a decision tree is will be discussed followed by the properties of C4.5 algorithm. The iris data set will be used to explain how decision tree works. A sample of the iris data set is shown in table 1. The data contains the petal length, petal width, sepal length and sepal width of iris plants. There are three different categories this plant: Irisversicolor, Iris-verginica, Iris-setosa. This is shown as the class variable in table 1 and is the target variable for the iris data set. There are 50 cases of each category in the data set. The goal is to determine what distinguishes each category of iris plants from one another so that it is possible to know to which category an iris plant belongs given the four input variables. An example of a decision tree produced from analysis of this data is shown in figure 1. The root node (top node) of the tree in figure 1 shows how many of each category is found before any analysis is made. There are three leaves in the tree. These leaves are assigned a class with the major class instances in the leaf. For example, the second leaf in the tree in figure 1 is considered as a class of Iris-versicolor because the

13 majority class in this leaf is Iris-versicolor with 48 instances. The other class in this leaf has only 4 observations and the node has error rate of 7.7%. The decision tree can be interpreted as follows: an iris plant with petallwidth less than 0.8 is classified as Iris-setosa and an iris plant with the petallwidth greater than 0.8 and less than 1.65 is categorized as Iris-versicolor. All the rest (with petallwidth greater than 1.65) are classified as Iris-verginica. Based on this tree, an unknown iris plant with petallwidth of 1.4 would be classified as Iris-versicolor. Note that only petalwidth is used to classify the instances and all the other input variables have been determined as irrelevant. The classification from the tree made an error of 6 out of 150, which is 4 %. Therefore, it can be said that the error rate for this tree on training data is 4 %. SEPALLENGTH SEPALWIDTH PETALLENGTH PETALWIDTH CLASS Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-virginica Iris-virginica Iris-virginica Iris-virginica Table 1: sample iris data Iris-virginica Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor

14 Iris-verginica 33.3% 50 Iris-versicolor 33.3% 50 Iris-setosa 33.3% 50 Total 100.0% 150 Petalwidth < 0.8 < 1.65 >= 1.65 Iris-verginica 0.0% 0 Iris-versicolor 0.0% 0 Iris-setosa 100.0% 50 Total 100.0% 50 Iris-verginica 7.7% 4 Iris-versicolor 92.3% 48 Iris-setosa 0.0% 0 Total 100% 52 Iris-verginica 95.8% 46 Iris-versicolor 4.2% 2 Iris-setosa 0.0% 0 Total 100.0% 48 Figure 1: A decision tree produced for the Iris data set. According to [43], the first task for C4.5 is to decide which of the non-target variables is the best variable to split the instances. In the example above, the petallwidth variable was chosen. To choose this attribute, at a node, the decision tree algorithm considers each attribute field in turn (for example petallwidth, petallength, sepallengh and sepalwidth in the case of the iris data). Then, every possible split is tried. C4.5 uses a criterion called information ratio to compare the value of potential splits. The information ratio provides an estimate of how likely a split on a variable is to lead to a leaf which contains few errors or has low disorder. Disorder is a measure of how pure a given node is. A node with high disorder contains instances having multiple target variables while a node with low disorder contains instances with one major target variable. The information ratio is calculated for all the variables and the winner variable is the one with the largest information ratio and is chosen as the split variable. The tree will grow in a similar method. For each child node of the root node, the decision tree algorithm examines all the remaining attributes to find candidate for splitting. If the field takes on only one value, it is eliminated from consideration since there is no way it can be used to make a split. The best split for each of the remaining attributes is determined. When all cases in a node are of the same type, then the node is a leaf node

15 But how good is this tree in classifying unknown data? Perhaps not very good as it is built using training data only which could lead to overfitting. So how does C4.5 avoid this problem of overfitting? C4.5 uses a method called pruning to avoid overfitting. There are two types of pruning: prepruning and post pruning. Postpruning refers to the building of a complete tree and pruning it afterwards. Postpruning makes the tree less complex and also probably more general by replacing a subtree with a leaf or with the most common branch. When this is done, the leaf will correspond to several classes but the label will be the most common class in the leaf (as was the case in figure 1). A parameter that affects postpruning is confidence interval. Using lower confidence cause more drastic pruning. The default confidence value is 25 %. Prepruning involves deciding when to stop developing subtrees during the tree building process. For example specifying the minimum number of observations in a leaf can determine the size of the tree. The default value of minimum number of instances is 2. By default, C4.5 uses postpruning only but it can use prepruning. After a tree is constructed, the C4.5 rule induction program can be used to produce a set of equivalent rules. The rules are formed by writing a rule for each path in the tree and then eliminating any unnecessary antecedents and rules. An example of the rules produced from the decision tree in figure 1 is shown in figure 2. Rule 1, for example, shows that if Petalwidth is less than 0.8 then the instance belongs in node 2 which has 50 observations and is classified as an Iris-Setosa

16 IF Petalwidth < 0.8 THEN NODE : 2 N : 50 IRIS-VIRGINICA: 0.0% IRIS-VERSICOLOR: 0.0% IRIS-SETOSA: 100.0% IF 0.8 <= Petalwidth < 1.65 THEN NODE : 3 N : 52 IRIS-VIRGINICA: 7.7% IRIS-VERSICOLOR: 92.3% IRIS-SETOSA: 0.0% IF 1.65 <= Petalwidth THEN NODE : 4 N : 48 IRIS-VIRGINICA: 95.8% IRIS-VERSICOLOR: 4.2% IRIS-SETOSA: 0.0% Figure 2: Rule output produced from the SAS Enterprise Miner software. C4.5 is currently one of the most commonly used data mining algorithms and is available in many commercial data mining products. The ease of its interpretability as well as its methods for dealing with numeric attributes, missing values, noisy data, and generating rules from trees make it a very good choice for practical classification. C4.5 was successfully used in the application of automated identification of bat calls using 160 reference calls from eight bat species. The automated identification of pulse parameters led to good results for species with distinct differences in calls, with four out of eight species classified correctly in 95% of attempts [24] Rule Learner (PART) The PART algorithm forms rules from pruned partial decision trees built using C4.5 s heuristics. According to Witten and Frank [58], the main advantage of PART over C4.5 is that, unlike C4.5, the rule learner algorithm does not need to perform global optimization to produce accurate rule sets. To make a single rule, a pruned decision tree is built, the leaf with the largest coverage is made into a rule, and the tree is discarded. This avoids overfitting by only generalizing once the implications are known. For example, going back to figure 1, PART would consider the first branch in the tree and builds the rule: If sepalwidth is less than 0.8 then the plant is Iris-setosa and discard all the Iris-setosa instances from consideration. It continues with similar rules for the rest of the tree

17 As for C4.5, the parameters that affect the performance of the algorithm are the minimum number of instances in each leaf and the confidence threshold for pruning. Frank and Witten [20] describe the results of an experiment performed on multiple data sets. The result from this experiment showed that PART outperformed the C4.5 algorithm on 9 occasions whereas C4.5 outperformed PART on OneR (1R) OneR is one of the simplest classification algorithms. As described by Holte [26], OneR produces simple rules based on one attribute only. It generates a one-level decision tree, which is expressed in the form of a set of rules that all test one particular attribute. It is a simple, cheap method that often comes up with quite good rules for characterizing the structure in data [59]. It often gets reasonable accuracy on many tasks by simply looking at one attribute. An example of a classification performed by OneR on the iris data set is shown in figure 3. As can be seen from the figure, OneR produced rules that when the petallength is less than 2.45, then the iris plant is classified as Iris-setosa and when the petallength is greater than 2.45 and less than 4.75 then the iris plant is classified as iris-versicolor and when the petallength is greater than or equal to 4.8, then the iris plant is classified as Iris-virginica. This gave 143 correct classification out of 150 on the training data with an error rate of 4.7 %. The output produced by the 1R algorithm is shown in figure 3. ## 1R Rule Output % rule for 'petallength': 'class'('iris-setosa') :- 'petallength'(x), X <2.45 % 50/50 'class'('iris-versicolor') :- 'petallength'(x), X <4.75 % 44/50 'class'('iris-virginica') :- 'petallength'(x), 4.75 =< X. % 48/50 % 1Rw Error Rate 4.7 % (143/150) (on training set) Figure 3: Partial output from WEKA OneR program. A comprehensive study of the performance of OneR algorithm by Holte [26] was reported on sixteen data sets frequently used by machine learning researchers to evaluate their algorithms. Cross-validation was used to ensure that the results were representative of what would be obtained on independent test sets. The research found that OneR performed very well in comparison with other more complex algorithms and Holte encourages the use of simple data mining algorithms like OneR to establish a performance baseline before progressing to more sophisticated learning algorithms

18 IBK IBK is an implementation of the k-nearest-neighbors classifier. Each case is considered as a point in multi-dimensional space and classification is done based on the nearest neighbors. The value of k for nearest neighbors can vary. This determines how many cases are to be considered as neighbors to decide how to classify an unknown instance. For example, for the iris data, IBK would consider the 4 dimensional space for the four input variables. A new instance would be classified as belonging to the class of its closest neighbor using Euclidean distance measurement. If 5 is used as the value of k, then 5 closest neighbors are considered. The class of the new instance is considered to be the class of the majority of the instances. If 5 is used as the value of k and 3 of the closest neighbors are of type Iris-setosa, then the class of the test instance would be assigned as Iris-setosa. The time taken to classify a test instance with a nearest-neighbor classifier increases linearly with the number of training instances that are kept in the classifier. It has a large storage requirement [59]. Its performance degrades quickly with increasing noise levels. It also performs badly when different attributes affect the outcome to different extents. One parameter that can affect the performance of the IBK algorithm is the number of nearest neighbors to be used. By default it uses just one nearest neighbor. IBK has been used for gesture recognition as discussed by Kadus [30]. With 95 signs collected from 5 people with a total of 6650 instances, the accuracy obtained from this research was approximately 80 per cent. The signs used were very similar to each other and an accuracy of 80 percent was considered to be very high. This research also found that instance based learning was better than C4.5 at tasks involved in the gesture tasks tested Naïve Bayes The Naive Bayes classification algorithm is based on Bayes rule which is used to compute the probabilities which are used to make predictions. Naïve Bayes assumes that the input attributes are statistically independent. It analyses the relationship between each input attribute and the dependent attribute to derive a conditional probability for each relationship[11]. These conditional probabilities are then combined to classify new cases. An advantage of Naïve Bayes algorithm over some other algorithms is that it requires only one pass through the training set to generate a classification model. Naïve Bayes works very well when tested on many real world data sets [58]. Naïve Bayes can obtain results that are much better than other sophisticated algorithms. However, if a particular attribute value does not occur in the training set in conjunction with every class value, then Naïve Bayes may not perform very well. It can also perform poorly on some data sets because attributes were treated as though they are independent, whereas in reality they are correlated

19 Kernel Density Kernel Density algorithm works in a very similar fashion to Naïve Bayes. The main difference is that, unlike Naïve Bayes, Kernel Density does not assume normal distribution of the data. Kernel Density tries to fit a combination of kernel functions. According to Beardah and Baxter [2], Kernel Density estimates are similar to histograms but provide smoother representation of the data. Beardah and Baxter [2] illustrate some of the advantages of kernel density estimates for data presentation in archaeology. They show that Kernel Density estimates can be used as a basis for producing contour plots of archeological data which lead to a useful graphical representation of the data

20 2.2 Unsupervised Learning Unsupervised learning deals with finding clusters of records that are similar in some way. As discussed earlier, unsupervised learning does not require a target variable for analysis. According to Berry and Gordon [4], unsupervised learning is often useful when there are many competing patterns in the data, making it hard to spot any single pattern. Building clusters of similar records reduces the complexity within clusters so that other data mining techniques are more likely to succeed. In unsupervised learning, the main concern is in obtaining clusters in data that have useful patterns Unsupervised Data Mining Algorithms Used in This Thesis K-means clustering The way k-means clustering works is that first the number of clusters (k) desired is specified, then the algorithm selects k cluster seeds (centers) which are located approximately uniformly in a multi-dimensional space. Each observation is then assigned to the nearest cluster mean to form temporary clusters. The cluster mean positions are then calculated and used as new cluster centers. The observations are then reallocated clusters according to the new cluster centers. This is repeated until no further change in the cluster centers occurs. The observations are assigned into clusters so that every observation belongs to at most one cluster [57]. According to Weiss and Indurkhya [56], not all the variables are equally important in determining the clusters. For each variable, an importance value is computed as a value between 0 and 1 to represent the relative importance of the given variable to the formation of the clusters. Variables that have the greatest contribution to the cluster profile have importance values closer to 1. A decision tree analysis can be used to calculate the relative importance values from a selected sample of the training data. The first split is most important. It has been discovered that variables having large variance tend to have more effect on the resulting clusters than variables with small variance. Some implementations of k-means clustering use these importance values in assigning cases to clusters [48] Kohonen Vector Quantization Kohonen Vector Quantization is a clustering method invented by Kohonen [48]. The algorithm is similar to the k-means clustering algorithm. But the original seeds, called code book vectors, are totally random. The algorithm finds the seed closest to each training case in a multidimensional space and moves that "winning" seed closer to the training case. The seed is moved a certain proportion of the distance between it and the training case. The proportion is specified by the learning rate [48]

21 Autoclass (Bayesian Classification System) Autoclass is an unsupervised Bayesian classification system that infers classes based on Bayesian statistics [14]. It divides the problem into two parts - the calculation of the number of classes and the estimation of the classification parameters. It uses the Expectation Maximization (EM) algorithm to estimate the parameter values that best fit the data for a given number of classes. EM is an approximation algorithm that can find a local minimum of a probability distribution. By default, Autoclass fits a normal probability distribution for numeric data and a multinomial distribution for symbolic data. According to Cheeseman and Stutz [14], Autoclass can consider different underlying probability distribution types for the numeric attributes and is computationally intensive. Autoclass is a development from NASA. Autoclass has been used for extracting useful information from databases [14]. It has been used to extract information from Infrared Astronomical satellite (IRAS) data [21]. 2.3 Related work on Comparison of Classifiers Lim and Loh [35] discuss the comparison of prediction accuracy, the complexity and training time different of classification algorithms. The paper discusses the results of a comparison of twenty-two decision tree, nine statistical, and neural network algorithms on thirty-two data sets in terms of classification accuracy, training time and (in the case of trees) number of leaves. Some of the twenty-two decision tree algorithms compared are CART, S-Plus tree, C4.5, FACT (Fast Classification Tree), QUEST, IND, OC1, LMDT, CAL5, T1. The Statistical algorithms compared include LDA (Linear Discriminant Analysis), QDA (Quadratic Discriminat Analysis), NN (Nearest Neighbor), LOG (Logistic Discriminat Analysis), FDA (Flexible Discriminat Analysis), PDA (Penalized LDA), MDA (Mixture Discriminat Analysis) and POL (PLYCLASS algorithm). The neural networks algorithms compared include LVQ (Learning Vector Quantization) and RBF (Radial Basis Function) This paper revealed that an algorithm called POLYCLASS, which provides estimates of conditional class probabilities, performed better than the other algorithms, although its accuracy was not statistically significantly different from twenty other algorithms. Another statistical algorithm, logistic regression, was second with respect to the two accuracy criteria (accuracy and training time). The most accurate decision tree algorithm was QUEST with linear splits, which was ranked fourth. It was noted that although spline-based statistical algorithms tend to have good accuracy, they also require relatively long training times. POLYCLASS, for example, was the third last in terms of median training time

22 The research discovered that among decision tree algorithms with univariate splits, C4.5 IND-CART and QUEST had the best combinations of error rate and speed. It was also noted that C4.5 tends to produce trees with twice as many leaves as those from IND- CART and QUEST. The main conclusion from this research was that the mean error rates of many algorithms are sufficiently similar that their differences are statistically insignificant and are also probably insignificant in practical terms. However, as will be discussed later, using default settings, this thesis discovered that there were significant differences in error rates among the different algorithms used. The STATLOG Project [12] has shown the results of evaluation of the performance of machine learning, neural and statistical algorithms on large-scale, complex commercial and industrial problems. The overall aim was to give an objective assessment of the potential for classification algorithms in solving significant commercial and industrial problems. Some of the twenty-four algorithms compared on the STATLOG project are Alloc80, Ac2, BayTree, NewId, Dipol92, C4.5, Cart, Cal5, Kohonen, Bayes, and Cascasde. The data sets used for the STATLOG project are from the UCI repository. On test data, it was discovered that the algorithm Alloc80, followed by Ac2, and BayTree, performed better than the rest. Alloc80 and BayTree are statistical classifier algorithms whereas Ac2 is a decision tree algorithm. Salzberg [46] cautions that care is required when comparing different algorithms. The dangers to avoid and a recommended approach to compare data mining algorithms are discussed. The main claims made by the paper are: Finding a good classification algorithm requires very careful thought about experimental design. If not done carefully, comparative studies of classification and other types of algorithms can easily result in statistically invalid conclusions and this is especially true when one is using data mining techniques to analyze very large databases, which inevitably contain some statistically unlikely data. Comparative analysis is more important in evaluating some types of algorithms than others. The key recommendations made by Salzberg [47] regarding the comparison of algorithms are:

23 That data miners must be careful not to rely too heavily on stored repositories such as the UCI repository because it is difficult to produce major new results using well studied and widely shared data. That data miners should follow a proper methodology that allows the designer of a new algorithm to establish the new algorithm s comparative merits

24 Chapter 3. Data Generation T his chapter discusses the data generation phase of this thesis which involved collecting data sets and applying each of 6 supervised algorithms to each of 59 data sets. 3.1 Collection of Data sets To achieve the goal of applying multiple data mining algorithms on multiple data sets, a search for data sets was necessary. Data sets were mainly obtained through the Internet, particularly from the UCI data set collection. Fifty-nine data sets were collected and used to perform the experiments. The number of attributes of the data sets used ranged from 3 to 76 while the number of observations ranged from 13 to Selection of Data Mining Algorithms The selection of appropriate types of data mining algorithms to ensure that they can be run on all the data sets collected was very important to minimize missing values in the file produced from the data generation phase. The 6 data mining algorithms chosen for this experiment are Rule Learner, OneR, Kernel Density, IBK, C4.5, Naïve Bayes. These algorithms are described in detail in chapter Generating Data Once the data sets and algorithms to use for the experiment were chosen, the actual data generation for the experiment was conducted. This was done by running the 6 data mining algorithms on the 59 data sets. Default settings were used for all algorithms. For the purpose of testing, cross validation with 10 folds was used for all the algorithms. Once all runs were completed the results were stored in one file, which was later used in the clustering analysis. The percentage of incorrectly classified instances for each algorithm on each data set for both training and cross validation was stored in this file. Also contained in this file is the size and number of attributes of each data set. Table 2 shows the complete result of the data generation process. For example, it shows that for the anneal data set, which has 38 attributes and 898 instances, IBK error rate on training data was 5.90 percent whereas on the test data it was 5.57 percent. The definition for each variable is shown on table

25 Data_Name Num_Attr Num_Ins IBK_TRAIN IBK_TEST C45_TRAIN C45_TEST RL_TRAIN RL_TEST NB_TRAIN NB_TEST OR_TRAIN OR_TEST KR_TRAIN KR_TEST Anneal Audiology Balance-scale Breast-cancer Breast-w Colic Credit-a Credit-g Diabetes Glass Heart-c Heart-statlog Iris Kr-vs-kp Labor Segment Sick Sonar Soybean Autos Heart-h Hepatitis Lymph Mushroom Primary-tumor Splice vehicle vote vowel Waveform AutoPrice baskball bodyfat bolts BreastTumor Cleveland cloud cpu detroit EchoMonths elusage Fishcatch Gascons housing Hungarian longley

26 Data_Name Num_Attr Num_Ins IBK_TRAIN IBK_TEST C45_TRAIN C45_TEST RL_TRAIN RL_TEST NB_TRAIN NB_TEST OR_TRAIN OR_TEST KR_TRAIN KR_TEST lowbwt Mbagrade meta Pharynx Pollution PwLinear quake Schlvote servo sleep strike veteran Vineyard Table 2: Results obtained from applying 6 data mining algorithms to 59 data sets. Blanks indicate situations where algorithms gave no result. Note that the word TRAIN in the table indicates the percentage of incorrectly classified training cases whereas the word TEST indicates the percentage of incorrectly classified cases using cross validation. For example NB_TRAIN indicates percentage of incorrectly classified instances (error rate) for the Naïve Bayes algorithm on training data. The definition of each variable is shown below. Name Definition NB_TRAIN Naive Bayes Training Error (%) NB_TEST Naive Bayes Testing Error (%) C45_TRAIN C4.5 Training Error (%) C45_TEST C45 Testing Error (%) OR_TRAIN OneR Training Error (%) OR_TEST OneR Training Error (%) RL_TEST Rule Learner Testing Error (%) RL_TRAIN Rule Learner Training Error (%) KR_TRAIN Kernel Density Training Error (%) KR_TEST Kernel Density Testing Error (%) IBK_TEST IBK Testing Error (%) IBK_TRAIN IBK Training Error (%) NUM_INS Number of Instances NUM_ATTR Number of Attributes Table 3: Definition of variables used

27 Table 4 shows a summary of table 2. It provides the minimum, maximum, mean, standard deviation, and missing percentage of each numeric variable. For example, it shows that for the Kernel Density algorithm on training data, the minimum error rate was 0 percent while the maximum and mean were and 2.53 percent respectively. It also shows that there was 5 % of the values were missing for this algorithm on training data indicating that no result was found for some data sets. Table 4 also shows the overall performance of the 6 algorithms in classifying the 59 data sets. The table is sorted by the mean error rates of each of the algorithms for both train and test cases. Looking at the training results indicates that Kernel Density (KR_TRAIN) with an average error rate of 2.53 percent, followed by Rule Learner (RL_TRAIN) and C4.5 (C45_TRAIN) with average error rates of 8.79 and percent respectively had lower training errors than the other algorithms. More importantly, looking at the cross validation results, Kernel Density (KR_TEST) with an average error rate of percent, followed by C4.5 and Naïve Bayes, with average error rates of and percent respectively, performed better than the other algorithms. Name Mean Min Max Std Dev. Missing % KR_TRAIN % RL_TRAIN % C45_TRAIN % IBK_TRAIN % NB_TRAIN % OR_TRAIN % KR_TEST % C45_TEST % NB_TEST % RL_TEST % IBK_TEST % OR_TEST % NUM_INS % NUM_ATTR % Table 4: Summary of the data gathered from running the 6 data mining algorithms on 59 data sets

28 Chapter 4. Clustering and Pattern Analysis F or the purpose of analyzing the data generated by applying the 6 data mining algorithms to 59 data sets (table 2, page 26), running unsupervised learning algorithms is necessary. For this experiment, the 3 algorithms used are k-means clustering using least squares, Kohonen Vector Quantization, and Autoclass Bayesian analysis. These algorithms are described in section 2.21 (pages 21-22). The results of the unsupervised clustering analysis performed are discussed in the next four sections followed by the summary and comparison of the results. 4.1 Results from k-means clustering Table 5 shows the ranking of variables resulting from the application of the k-means algorithm to the data generated (table 2, page 26). A value of 5 was used for the maximum number of clusters. This value was chosen as (1) more than 5 clusters in a data set of 59 cases are unlikely to be useful and (2) preliminary runs of the algorithm suggested there were 3 to 5 clusters. As shown in table 5, only number of instances is significant in determining the clusters. This model gives five clusters with 52,3,2,1,1 observations. The table shows the name, importance, measurement type and label of each variable. For example, it indicates that the num_inst (number of instances) variable has importance level of 1 and that it is a numeric interval variable. Numeric variables containing values that vary across a continuous range are shown as interval variables. NAME IMPORTANCE MEASUREMENT TYPE LABEL NUM_INS 1 interval Num Number of Instances KR_TEST 0 interval Num Kernel Density Test KR_TRAIN 0 interval Num Kernel Density Train OR_TEST 0 interval Num OneR TEST OR_TRAIN 0 interval Num OneR TRAIN NB_TEST 0 interval Num Naïve Bayes Test NB_TRAIN 0 interval Num Naïve Bayes Train RL_TEST 0 interval Num Rule Learner Test RL_TRAIN 0 interval Num Rule Learner Train C45_TEST 0 interval Num C45 Test C45_TRAIN 0 interval Num C45 Train IBK_TEST 0 interval Num IBK Test IBK_TRAIN 0 interval Num IBK Train NUM_ATTR 0 interval Num Number of Attributes Table 5: Importance level of variables in determining cluster