Cluster Analysis of Data Sets - A Review

Size: px
Start display at page:

Download "Cluster Analysis of Data Sets - A Review"

Transcription

1 Data Mining of Machine Learning Performance Data Remzi Salih Ibrahim Master of Applied Science (Information Technology) 1999 RMIT University

2 Abstract W ith the development and penetration of data mining within different fields and industries, many data mining algorithms have emerged. The selection of a good data mining algorithm to obtain the best result on a particular data set has become very important. What works well for a particular data set may not work well on another. The goal of this thesis is to find associations between classification algorithms and characteristics of data sets by first building a file of data sets, their characteristics and the performance of a number of algorithms on each data set; and second applying unsupervised clustering analysis to this file to analyze the generated clusters and determine whether there are any significant patterns. Six classification algorithms were applied to 59 data sets and then three clustering algorithms were applied to the data generated. The patterns and properties of the clusters formed were then studied. The six classification algorithms used were OneR (1R), Kernel Density, Naïve Bayes, C4.5, Rule Learner and IBK. The clustering algorithms used were K-means clustering, Kohonen Vector Quantization, and Autoclass Baysian clustering. The major discovery made by analyzing the generated clusters is that the clusters were formed based on accuracy of the algorithms. The data sets were grouped as either belonging to a cluster having average error rates, lower, about average or higher than average error rates of the population. This suggests that there are three kinds of data sets on the 59 data sets considered: easy-to-learn data sets, moderate-to-learn, and hardto-learn data sets. Another discovery made by this thesis is that the number of instances in a data set was not useful for clustering analysis of the machine learning performance data. It was the only significant variable in clustering the data sets and prevented analysis based on other variables including the variables that contain values for the accuracy of each classification algorithm. While not directly relevant to clustering, it was also found that the number of instances and number of attributes in the data sets do not have strong influence on the performance of the data mining algorithms on the 59 data sets considered as high error rates were obtained for both small data sets with a small number of attributes and large data sets with a large number of attributes. Experiments performed for this thesis also allowed the comparison of the performance of the 6 classification algorithms with their default parameter settings. It was discovered that in terms of performance, the top three algorithms were Kernel Density, C4.5, and Naïve Bayes, followed by Rule Learner, IBK and OneR. - II -

3 Declaration I certify that all work on this thesis was carried out between June 1998 and June 2000 and it has not been submitted for academic award at any other College, Institute or University. The work presented was carried out under the supervision of Dr. Vic Ciesielski. All other work in the thesis is my own except where acknowledged in the text. Signed, Remzi Salih Ibrahim June, III -

4 Table of Contents List of Tables...VI List of Graphs... VII Acknowledgements...VIII Chapter 1. Introduction Goals Scope Chapter 2. Literature Survey Supervised Learning Supervised Algorithms Used in this thesis C Rule Learner (PART) OneR (1R) IBK Naïve Bayes Kernel Density Unsupervised Learning Unsupervised Data Mining Algorithms Used in This Thesis K-means clustering Kohonen Vector Quantization Autoclass (Bayesian Classification System) Related work on Comparison of Classifiers Chapter 3. Data Generation Collection of Data sets Selection of Data Mining Algorithms Generating Data Chapter 4. Clustering and Pattern Analysis Results from k-means clustering Results from k-means clustering (without Number of Instances) Results from Kohonen Vector Quantization Clustering Results from Autoclass (Bayesian Classification System) Analysis Comparison Comparison of significant variables Comparison of Data sets in Different Clusters Influence of Characteristics of Data Sets on Performance of Classification Algorithms Chapter 5. Conclusion Appendix A. About WEKA Appendix B. About Enterprise Miner (Commercial software) IV -

5 Appendix C. Detail Results from K-means Analysis Appendix D. Detail results from Kohonen Vector Quantization/Kohonen Appendix E. Detail Result from Autoclass Clustering Analysis Appendix F. Sample Code used to run Multiple Algorithms on Multiple Data sets Appendix G. Sample Output from data generation References V -

6 List of Tables TABLE 1: SAMPLE IRIS DATA TABLE 2: RESULTS OBTAINED FROM APPLYING 6 DATA MINING ALGORITHMS TO 59 DATA SETS. BLANKS INDICATE SITUATIONS WHERE ALGORITHMS GAVE NO RESULT TABLE 3: DEFINITION OF VARIABLES USED TABLE 4: SUMMARY OF THE DATA GATHERED FROM RUNNING THE 6 DATA MINING ALGORITHMS ON 59 DATA SETS TABLE 5: IMPORTANCE LEVEL OF VARIABLES IN DETERMINING CLUSTER TABLE 6: PROPERTIES OF THE 5 CLUSTERS TABLE 7: IMPORTANCE VARIABLE IN DETERMINING CLUSTERS TABLE 8: GENERAL PROPERTIES OF THE CLUSTERS FROM K-MEANS ANALYSIS TABLE 9: GENERAL PROPERTIES OF CLUSTERS AND SIGNIFICANT VARIABLES TABLE 10: COMPLETE LIST OF DATA SETS IN EACH CLUSTER AND THE VALUES OF THE SIGNIFICANT CLUSTERING VARIABLES TABLE 11: IMPORTANCE OF VARIABLES IN DETERMINING CLUSTERS IN KOHONEN VECTOR QUANTIZATION ANALYSIS TABLE 12: GENERAL PROPERTIES OF THE CLUSTERS FROM KOHONEN VECTOR QUANTIZATION ANALYSIS.. 35 TABLE 13: GENERAL PROPERTIES OF CLUSTERS MEAN ERROR RATES OF THE SIGNIFICANT VARIABLES TABLE 14: COMPLETE LIST OF DATA SETS IN EACH CLUSTER AND THE VALUES OF THE SIGNIFICANT CLUSTERING VARIABLES TABLE 15: SIGNIFICANCE LEVEL OF VARIABLES IN AUTOCLASS CLUSTERING TABLE 16. PROPERTIES OF CLUSTERS FROM AUTOCLASS ANALYSIS TABLE 17: COMPLETE LIST OF DATA SETS IN EACH CLUSTER AND THE VALUES OF THE SIGNIFICANT CLUSTERING VARIABLES TABLE 18: COMPARISON OF SIGNIFICANT VARIABLES FOUND BY THE THREE CLUSTERING ALGORITHMS TABLE 19: SUMMARY OF THE SIGNIFICANCE OF EACH VARIABLE FOR THE 3 CLUSTERING ALGORITHMS TABLE 20: DATA SETS IN EACH COLUMN WERE GROUPED INTO ONE CLUSTER BY ALL THREE ALGORITHMS VI -

7 List of Graphs FIGURE 1: A DECISION TREE PRODUCED FOR THE IRIS DATA SET FIGURE 2: RULE OUTPUT PRODUCED FROM THE SAS ENTERPRISE MINER SOFTWARE FIGURE 3: PARTIAL OUTPUT FROM WEKA ONER PROGRAM VII -

8 Acknowledgements I would like to thank Dr. Vic Ciesleski for being supportive and very patient during the progress of my thesis. He has been very understanding of the problems that arise from working full time, studying part time and still having to fulfill family commitments. I would like to pass my thanks to Dr. Isaac Balbin for his support and his flexibility with the dateline. My thanks also go to the WEKA support team and the staff from SAS Institute Australia for their support and all staff from the RMIT AI (Artificial Intelligence) group who inspired me to research in this field. I would like to thank all members of my family and my friends for their patience during the progress of my thesis. - VIII -

9 Chapter 1. Introduction I n this current age of technology data has become more readily available than ever. Using technologies like data warehousing, data is being stored in large quantities. The availability of such data opened the door for new data analysis techniques to emerge. As Weiss and Indurkhya [55] explain, as the amount of data stored in existing information systems mushroomed, a new set of objectives for data management has emerged. Mining data has become one of the important means of obtaining useful information. The term data mining is defined by Fayeed, Piatetsky-Shapiro and Smyth [19] as the part of Knowledge Discovery in Databases (KDD) process relating to methods for extracting patterns from data. The KDD process involves the complete steps of obtaining knowledge from data and includes selecting, pre-processing, transformation and mining of data followed by interpretation and evaluation of patterns. Data mining has many advantages across different industries. It allows large historical data to be used as the background for prediction. The interpretation and evaluation of the patterns obtained by data mining produces new knowledge that decision-makers can act upon [42]. Data mining provides a means to obtain information that can support decision making and predict new business opportunities. For example, telecommunications, stock exchanges, and credit card and insurance companies use data mining to detect fraudulent use of their services; the medical industry uses data mining to predict the effectiveness of surgical procedures, medical tests, and medications; and retailers use data mining to assess the effectiveness of coupons and special events [41]. With the development and penetration of data mining within different fields, many data mining algorithms emerged. The selection of a good data mining algorithm to obtain the best result on a particular data set has become very important. What works well for a particular data may not work well on another. Furthermore, the No Free Lunch theorem by Wolpert and Macready [61,page 2] has established that it is impossible to say that any technique is better than another over the space of all problems. In particular, if algorithm A outperforms algorithm B on some cost functions, then loosely speaking, there must exist exactly as many other functions where B outperforms A. An example of the No Free Lunch Theorem that has been encountered in this thesis is the case of the performance of the C4.5 and Rule Learner algorithms on the EchMonths and Hungarian data sets (table 2, page 26). While C4.5 algorithm obtained an error rate of only 0.6 percent on the EchMonths, Rule Learner obtained percent. But on the Hungarian data set, Rule Learner outperformed C to While the No Free Lunch Theorem has established that there can be no one `best' learning algorithm, the question of `What kinds of algorithms are best suited to what kinds of data?' remains an open question. While there has been some work comparing different algorithms on a range of data sets (the STATLOG project [12], Lim and Loh [35]), there has been little work on trying to characterize data sets (for example, big, - 9 -

10 small, numeric, symbolic, mixed) and matching algorithms to data characteristics. With the emergence of hundreds of data mining algorithms today, such information will help data mining analysts to make intelligent decisions in choosing an appropriate data mining algorithm for certain types of data mining files. 1.1 Goals The major goal of this thesis is to find associations between classification algorithms and characteristics of data sets by a two-step process: 1. Build a file of data set names, their characteristics and the performance of a number of algorithms on each data set. 2. Apply unsupervised clustering to the file built in step 1, analyze the generated clusters and determine whether there are any significant patterns. 1.2 Scope Due to time limitations for an MBC minor thesis, the scope of this thesis will be restricted to: 6 supervised learning algorithms 59 small to medium size data sets with number of attributes ranging from 7 to 76. Running of the 6 supervised algorithms on the 59 data sets using only default settings of the algorithms Using 3 unsupervised learning algorithms for cluster analysis Characteristics of the data sets limited to only number of attributes and number of instances

11 Chapter 2. Literature Survey C oncepts and papers that are relevant to this thesis are discussed in this chapter. First both supervised and unsupervised learning techniques are discussed followed by the description of all the algorithms used in this thesis. Finally, three papers that are related to this thesis are discussed in detail. Machine learning is described by Witten and Frank [58] as the acquisition of knowledge and the ability to use it. They explain that learning in data mining involves finding and describing structural patterns in data for the purpose of helping to explain that data and make predictions from it. For example, the data could contain examples of customers who have switched to another service provider in the telecommunication industry and some that have not. The output of learning could be the prediction of whether a particular customer will switch to another service provider or not. There are two common types of learning: supervised and unsupervised. 2.1 Supervised Learning Learning or adaptation is supervised when there is a desired response that can be used by the system to guide the learning. Decision trees and neural nets are two common types of supervised learning. This type of learning always requires a target variable to predict. Supervised learning algorithms have been used in many applications. For example, supervised learning has been used in the seismic phase identification in the field of nuclear science [28] and for the prediction of tornados [36]. Supervised learning involves the gathering of data to be used for data mining, identifying the target variable, breaking up of the data into training and testing data and developing the classifier. The training data is used by the data mining algorithm to learn the data and build a classifier. The test data is used to evaluate the performance of the classifier on new data. The performance a classifier is commonly measured by the percentage of incorrectly classified instances on the data used. Train error rate refers to the percentage of incorrectly classified instances on the training data and test error rate refers to percentage of incorrectly classified instances on the test data. One of the problems of supervised learning is overfitting [58]. The classifier works well on the training data but not on test data. This happens when the model learns the training data too well. To get an indication of the amount of overfitting, the model should be tested using a test data set or cross validation. If, after training, the test error rate is approximately equal to training error rate, the test error rate is an indication of the kind of generalization that will occur

12 Cross-validation is a method for estimating how good the classifier will perform on new data and is based on "resampling" [33]. Cross validation is good for use when the size of data is small. It allows the use of all of the data for training. In k-fold cross-validation, the data is divided into k subsets of equal size. The model is trained k times, each time leaving out one of the subsets from training, but using only the omitted subset to compute the error rate. If k equals the sample size, this is called leave-one-out cross-validation. Leave-one-out cross-validation often works well for continuous error functions such as the mean squared error, but it may perform poorly for non-continuous error functions such as the number of misclassified cases [33]. A value of 10 for k is commonly used and is also used for this thesis. Some data mining algorithms do not support continuous target variables. In such cases, binning or discretization is used. Binning is a method of converting continuous targets into categorical values. For instance, if one of the independent variables is age, the values must be transformed from the specific value into ranges such as "less than 20 years", "21 to 30 years", "31 to 40 years" and so on [4] Supervised Algorithms Used in This Thesis The basic theory behind each of the 6 classification algorithms and the details of how each algorithm works are discussed in this section. The parameters that affect the performance of each of the algorithms are also discussed and where possible, papers that describe successful applications are also cited C4.5 C4.5 is a decision tree algorithm devised by Quinlan [43]. Decision trees are used to classify instances into different categories and are common types of classification algorithms. First what a decision tree is will be discussed followed by the properties of C4.5 algorithm. The iris data set will be used to explain how decision tree works. A sample of the iris data set is shown in table 1. The data contains the petal length, petal width, sepal length and sepal width of iris plants. There are three different categories this plant: Irisversicolor, Iris-verginica, Iris-setosa. This is shown as the class variable in table 1 and is the target variable for the iris data set. There are 50 cases of each category in the data set. The goal is to determine what distinguishes each category of iris plants from one another so that it is possible to know to which category an iris plant belongs given the four input variables. An example of a decision tree produced from analysis of this data is shown in figure 1. The root node (top node) of the tree in figure 1 shows how many of each category is found before any analysis is made. There are three leaves in the tree. These leaves are assigned a class with the major class instances in the leaf. For example, the second leaf in the tree in figure 1 is considered as a class of Iris-versicolor because the

13 majority class in this leaf is Iris-versicolor with 48 instances. The other class in this leaf has only 4 observations and the node has error rate of 7.7%. The decision tree can be interpreted as follows: an iris plant with petallwidth less than 0.8 is classified as Iris-setosa and an iris plant with the petallwidth greater than 0.8 and less than 1.65 is categorized as Iris-versicolor. All the rest (with petallwidth greater than 1.65) are classified as Iris-verginica. Based on this tree, an unknown iris plant with petallwidth of 1.4 would be classified as Iris-versicolor. Note that only petalwidth is used to classify the instances and all the other input variables have been determined as irrelevant. The classification from the tree made an error of 6 out of 150, which is 4 %. Therefore, it can be said that the error rate for this tree on training data is 4 %. SEPALLENGTH SEPALWIDTH PETALLENGTH PETALWIDTH CLASS Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-virginica Iris-virginica Iris-virginica Iris-virginica Table 1: sample iris data Iris-virginica Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor

14 Iris-verginica 33.3% 50 Iris-versicolor 33.3% 50 Iris-setosa 33.3% 50 Total 100.0% 150 Petalwidth < 0.8 < 1.65 >= 1.65 Iris-verginica 0.0% 0 Iris-versicolor 0.0% 0 Iris-setosa 100.0% 50 Total 100.0% 50 Iris-verginica 7.7% 4 Iris-versicolor 92.3% 48 Iris-setosa 0.0% 0 Total 100% 52 Iris-verginica 95.8% 46 Iris-versicolor 4.2% 2 Iris-setosa 0.0% 0 Total 100.0% 48 Figure 1: A decision tree produced for the Iris data set. According to [43], the first task for C4.5 is to decide which of the non-target variables is the best variable to split the instances. In the example above, the petallwidth variable was chosen. To choose this attribute, at a node, the decision tree algorithm considers each attribute field in turn (for example petallwidth, petallength, sepallengh and sepalwidth in the case of the iris data). Then, every possible split is tried. C4.5 uses a criterion called information ratio to compare the value of potential splits. The information ratio provides an estimate of how likely a split on a variable is to lead to a leaf which contains few errors or has low disorder. Disorder is a measure of how pure a given node is. A node with high disorder contains instances having multiple target variables while a node with low disorder contains instances with one major target variable. The information ratio is calculated for all the variables and the winner variable is the one with the largest information ratio and is chosen as the split variable. The tree will grow in a similar method. For each child node of the root node, the decision tree algorithm examines all the remaining attributes to find candidate for splitting. If the field takes on only one value, it is eliminated from consideration since there is no way it can be used to make a split. The best split for each of the remaining attributes is determined. When all cases in a node are of the same type, then the node is a leaf node

15 But how good is this tree in classifying unknown data? Perhaps not very good as it is built using training data only which could lead to overfitting. So how does C4.5 avoid this problem of overfitting? C4.5 uses a method called pruning to avoid overfitting. There are two types of pruning: prepruning and post pruning. Postpruning refers to the building of a complete tree and pruning it afterwards. Postpruning makes the tree less complex and also probably more general by replacing a subtree with a leaf or with the most common branch. When this is done, the leaf will correspond to several classes but the label will be the most common class in the leaf (as was the case in figure 1). A parameter that affects postpruning is confidence interval. Using lower confidence cause more drastic pruning. The default confidence value is 25 %. Prepruning involves deciding when to stop developing subtrees during the tree building process. For example specifying the minimum number of observations in a leaf can determine the size of the tree. The default value of minimum number of instances is 2. By default, C4.5 uses postpruning only but it can use prepruning. After a tree is constructed, the C4.5 rule induction program can be used to produce a set of equivalent rules. The rules are formed by writing a rule for each path in the tree and then eliminating any unnecessary antecedents and rules. An example of the rules produced from the decision tree in figure 1 is shown in figure 2. Rule 1, for example, shows that if Petalwidth is less than 0.8 then the instance belongs in node 2 which has 50 observations and is classified as an Iris-Setosa

16 IF Petalwidth < 0.8 THEN NODE : 2 N : 50 IRIS-VIRGINICA: 0.0% IRIS-VERSICOLOR: 0.0% IRIS-SETOSA: 100.0% IF 0.8 <= Petalwidth < 1.65 THEN NODE : 3 N : 52 IRIS-VIRGINICA: 7.7% IRIS-VERSICOLOR: 92.3% IRIS-SETOSA: 0.0% IF 1.65 <= Petalwidth THEN NODE : 4 N : 48 IRIS-VIRGINICA: 95.8% IRIS-VERSICOLOR: 4.2% IRIS-SETOSA: 0.0% Figure 2: Rule output produced from the SAS Enterprise Miner software. C4.5 is currently one of the most commonly used data mining algorithms and is available in many commercial data mining products. The ease of its interpretability as well as its methods for dealing with numeric attributes, missing values, noisy data, and generating rules from trees make it a very good choice for practical classification. C4.5 was successfully used in the application of automated identification of bat calls using 160 reference calls from eight bat species. The automated identification of pulse parameters led to good results for species with distinct differences in calls, with four out of eight species classified correctly in 95% of attempts [24] Rule Learner (PART) The PART algorithm forms rules from pruned partial decision trees built using C4.5 s heuristics. According to Witten and Frank [58], the main advantage of PART over C4.5 is that, unlike C4.5, the rule learner algorithm does not need to perform global optimization to produce accurate rule sets. To make a single rule, a pruned decision tree is built, the leaf with the largest coverage is made into a rule, and the tree is discarded. This avoids overfitting by only generalizing once the implications are known. For example, going back to figure 1, PART would consider the first branch in the tree and builds the rule: If sepalwidth is less than 0.8 then the plant is Iris-setosa and discard all the Iris-setosa instances from consideration. It continues with similar rules for the rest of the tree

17 As for C4.5, the parameters that affect the performance of the algorithm are the minimum number of instances in each leaf and the confidence threshold for pruning. Frank and Witten [20] describe the results of an experiment performed on multiple data sets. The result from this experiment showed that PART outperformed the C4.5 algorithm on 9 occasions whereas C4.5 outperformed PART on OneR (1R) OneR is one of the simplest classification algorithms. As described by Holte [26], OneR produces simple rules based on one attribute only. It generates a one-level decision tree, which is expressed in the form of a set of rules that all test one particular attribute. It is a simple, cheap method that often comes up with quite good rules for characterizing the structure in data [59]. It often gets reasonable accuracy on many tasks by simply looking at one attribute. An example of a classification performed by OneR on the iris data set is shown in figure 3. As can be seen from the figure, OneR produced rules that when the petallength is less than 2.45, then the iris plant is classified as Iris-setosa and when the petallength is greater than 2.45 and less than 4.75 then the iris plant is classified as iris-versicolor and when the petallength is greater than or equal to 4.8, then the iris plant is classified as Iris-virginica. This gave 143 correct classification out of 150 on the training data with an error rate of 4.7 %. The output produced by the 1R algorithm is shown in figure 3. ## 1R Rule Output % rule for 'petallength': 'class'('iris-setosa') :- 'petallength'(x), X <2.45 % 50/50 'class'('iris-versicolor') :- 'petallength'(x), X <4.75 % 44/50 'class'('iris-virginica') :- 'petallength'(x), 4.75 =< X. % 48/50 % 1Rw Error Rate 4.7 % (143/150) (on training set) Figure 3: Partial output from WEKA OneR program. A comprehensive study of the performance of OneR algorithm by Holte [26] was reported on sixteen data sets frequently used by machine learning researchers to evaluate their algorithms. Cross-validation was used to ensure that the results were representative of what would be obtained on independent test sets. The research found that OneR performed very well in comparison with other more complex algorithms and Holte encourages the use of simple data mining algorithms like OneR to establish a performance baseline before progressing to more sophisticated learning algorithms

18 IBK IBK is an implementation of the k-nearest-neighbors classifier. Each case is considered as a point in multi-dimensional space and classification is done based on the nearest neighbors. The value of k for nearest neighbors can vary. This determines how many cases are to be considered as neighbors to decide how to classify an unknown instance. For example, for the iris data, IBK would consider the 4 dimensional space for the four input variables. A new instance would be classified as belonging to the class of its closest neighbor using Euclidean distance measurement. If 5 is used as the value of k, then 5 closest neighbors are considered. The class of the new instance is considered to be the class of the majority of the instances. If 5 is used as the value of k and 3 of the closest neighbors are of type Iris-setosa, then the class of the test instance would be assigned as Iris-setosa. The time taken to classify a test instance with a nearest-neighbor classifier increases linearly with the number of training instances that are kept in the classifier. It has a large storage requirement [59]. Its performance degrades quickly with increasing noise levels. It also performs badly when different attributes affect the outcome to different extents. One parameter that can affect the performance of the IBK algorithm is the number of nearest neighbors to be used. By default it uses just one nearest neighbor. IBK has been used for gesture recognition as discussed by Kadus [30]. With 95 signs collected from 5 people with a total of 6650 instances, the accuracy obtained from this research was approximately 80 per cent. The signs used were very similar to each other and an accuracy of 80 percent was considered to be very high. This research also found that instance based learning was better than C4.5 at tasks involved in the gesture tasks tested Naïve Bayes The Naive Bayes classification algorithm is based on Bayes rule which is used to compute the probabilities which are used to make predictions. Naïve Bayes assumes that the input attributes are statistically independent. It analyses the relationship between each input attribute and the dependent attribute to derive a conditional probability for each relationship[11]. These conditional probabilities are then combined to classify new cases. An advantage of Naïve Bayes algorithm over some other algorithms is that it requires only one pass through the training set to generate a classification model. Naïve Bayes works very well when tested on many real world data sets [58]. Naïve Bayes can obtain results that are much better than other sophisticated algorithms. However, if a particular attribute value does not occur in the training set in conjunction with every class value, then Naïve Bayes may not perform very well. It can also perform poorly on some data sets because attributes were treated as though they are independent, whereas in reality they are correlated

19 Kernel Density Kernel Density algorithm works in a very similar fashion to Naïve Bayes. The main difference is that, unlike Naïve Bayes, Kernel Density does not assume normal distribution of the data. Kernel Density tries to fit a combination of kernel functions. According to Beardah and Baxter [2], Kernel Density estimates are similar to histograms but provide smoother representation of the data. Beardah and Baxter [2] illustrate some of the advantages of kernel density estimates for data presentation in archaeology. They show that Kernel Density estimates can be used as a basis for producing contour plots of archeological data which lead to a useful graphical representation of the data

20 2.2 Unsupervised Learning Unsupervised learning deals with finding clusters of records that are similar in some way. As discussed earlier, unsupervised learning does not require a target variable for analysis. According to Berry and Gordon [4], unsupervised learning is often useful when there are many competing patterns in the data, making it hard to spot any single pattern. Building clusters of similar records reduces the complexity within clusters so that other data mining techniques are more likely to succeed. In unsupervised learning, the main concern is in obtaining clusters in data that have useful patterns Unsupervised Data Mining Algorithms Used in This Thesis K-means clustering The way k-means clustering works is that first the number of clusters (k) desired is specified, then the algorithm selects k cluster seeds (centers) which are located approximately uniformly in a multi-dimensional space. Each observation is then assigned to the nearest cluster mean to form temporary clusters. The cluster mean positions are then calculated and used as new cluster centers. The observations are then reallocated clusters according to the new cluster centers. This is repeated until no further change in the cluster centers occurs. The observations are assigned into clusters so that every observation belongs to at most one cluster [57]. According to Weiss and Indurkhya [56], not all the variables are equally important in determining the clusters. For each variable, an importance value is computed as a value between 0 and 1 to represent the relative importance of the given variable to the formation of the clusters. Variables that have the greatest contribution to the cluster profile have importance values closer to 1. A decision tree analysis can be used to calculate the relative importance values from a selected sample of the training data. The first split is most important. It has been discovered that variables having large variance tend to have more effect on the resulting clusters than variables with small variance. Some implementations of k-means clustering use these importance values in assigning cases to clusters [48] Kohonen Vector Quantization Kohonen Vector Quantization is a clustering method invented by Kohonen [48]. The algorithm is similar to the k-means clustering algorithm. But the original seeds, called code book vectors, are totally random. The algorithm finds the seed closest to each training case in a multidimensional space and moves that "winning" seed closer to the training case. The seed is moved a certain proportion of the distance between it and the training case. The proportion is specified by the learning rate [48]

21 Autoclass (Bayesian Classification System) Autoclass is an unsupervised Bayesian classification system that infers classes based on Bayesian statistics [14]. It divides the problem into two parts - the calculation of the number of classes and the estimation of the classification parameters. It uses the Expectation Maximization (EM) algorithm to estimate the parameter values that best fit the data for a given number of classes. EM is an approximation algorithm that can find a local minimum of a probability distribution. By default, Autoclass fits a normal probability distribution for numeric data and a multinomial distribution for symbolic data. According to Cheeseman and Stutz [14], Autoclass can consider different underlying probability distribution types for the numeric attributes and is computationally intensive. Autoclass is a development from NASA. Autoclass has been used for extracting useful information from databases [14]. It has been used to extract information from Infrared Astronomical satellite (IRAS) data [21]. 2.3 Related work on Comparison of Classifiers Lim and Loh [35] discuss the comparison of prediction accuracy, the complexity and training time different of classification algorithms. The paper discusses the results of a comparison of twenty-two decision tree, nine statistical, and neural network algorithms on thirty-two data sets in terms of classification accuracy, training time and (in the case of trees) number of leaves. Some of the twenty-two decision tree algorithms compared are CART, S-Plus tree, C4.5, FACT (Fast Classification Tree), QUEST, IND, OC1, LMDT, CAL5, T1. The Statistical algorithms compared include LDA (Linear Discriminant Analysis), QDA (Quadratic Discriminat Analysis), NN (Nearest Neighbor), LOG (Logistic Discriminat Analysis), FDA (Flexible Discriminat Analysis), PDA (Penalized LDA), MDA (Mixture Discriminat Analysis) and POL (PLYCLASS algorithm). The neural networks algorithms compared include LVQ (Learning Vector Quantization) and RBF (Radial Basis Function) This paper revealed that an algorithm called POLYCLASS, which provides estimates of conditional class probabilities, performed better than the other algorithms, although its accuracy was not statistically significantly different from twenty other algorithms. Another statistical algorithm, logistic regression, was second with respect to the two accuracy criteria (accuracy and training time). The most accurate decision tree algorithm was QUEST with linear splits, which was ranked fourth. It was noted that although spline-based statistical algorithms tend to have good accuracy, they also require relatively long training times. POLYCLASS, for example, was the third last in terms of median training time

22 The research discovered that among decision tree algorithms with univariate splits, C4.5 IND-CART and QUEST had the best combinations of error rate and speed. It was also noted that C4.5 tends to produce trees with twice as many leaves as those from IND- CART and QUEST. The main conclusion from this research was that the mean error rates of many algorithms are sufficiently similar that their differences are statistically insignificant and are also probably insignificant in practical terms. However, as will be discussed later, using default settings, this thesis discovered that there were significant differences in error rates among the different algorithms used. The STATLOG Project [12] has shown the results of evaluation of the performance of machine learning, neural and statistical algorithms on large-scale, complex commercial and industrial problems. The overall aim was to give an objective assessment of the potential for classification algorithms in solving significant commercial and industrial problems. Some of the twenty-four algorithms compared on the STATLOG project are Alloc80, Ac2, BayTree, NewId, Dipol92, C4.5, Cart, Cal5, Kohonen, Bayes, and Cascasde. The data sets used for the STATLOG project are from the UCI repository. On test data, it was discovered that the algorithm Alloc80, followed by Ac2, and BayTree, performed better than the rest. Alloc80 and BayTree are statistical classifier algorithms whereas Ac2 is a decision tree algorithm. Salzberg [46] cautions that care is required when comparing different algorithms. The dangers to avoid and a recommended approach to compare data mining algorithms are discussed. The main claims made by the paper are: Finding a good classification algorithm requires very careful thought about experimental design. If not done carefully, comparative studies of classification and other types of algorithms can easily result in statistically invalid conclusions and this is especially true when one is using data mining techniques to analyze very large databases, which inevitably contain some statistically unlikely data. Comparative analysis is more important in evaluating some types of algorithms than others. The key recommendations made by Salzberg [47] regarding the comparison of algorithms are:

23 That data miners must be careful not to rely too heavily on stored repositories such as the UCI repository because it is difficult to produce major new results using well studied and widely shared data. That data miners should follow a proper methodology that allows the designer of a new algorithm to establish the new algorithm s comparative merits

24 Chapter 3. Data Generation T his chapter discusses the data generation phase of this thesis which involved collecting data sets and applying each of 6 supervised algorithms to each of 59 data sets. 3.1 Collection of Data sets To achieve the goal of applying multiple data mining algorithms on multiple data sets, a search for data sets was necessary. Data sets were mainly obtained through the Internet, particularly from the UCI data set collection. Fifty-nine data sets were collected and used to perform the experiments. The number of attributes of the data sets used ranged from 3 to 76 while the number of observations ranged from 13 to Selection of Data Mining Algorithms The selection of appropriate types of data mining algorithms to ensure that they can be run on all the data sets collected was very important to minimize missing values in the file produced from the data generation phase. The 6 data mining algorithms chosen for this experiment are Rule Learner, OneR, Kernel Density, IBK, C4.5, Naïve Bayes. These algorithms are described in detail in chapter Generating Data Once the data sets and algorithms to use for the experiment were chosen, the actual data generation for the experiment was conducted. This was done by running the 6 data mining algorithms on the 59 data sets. Default settings were used for all algorithms. For the purpose of testing, cross validation with 10 folds was used for all the algorithms. Once all runs were completed the results were stored in one file, which was later used in the clustering analysis. The percentage of incorrectly classified instances for each algorithm on each data set for both training and cross validation was stored in this file. Also contained in this file is the size and number of attributes of each data set. Table 2 shows the complete result of the data generation process. For example, it shows that for the anneal data set, which has 38 attributes and 898 instances, IBK error rate on training data was 5.90 percent whereas on the test data it was 5.57 percent. The definition for each variable is shown on table

25 Data_Name Num_Attr Num_Ins IBK_TRAIN IBK_TEST C45_TRAIN C45_TEST RL_TRAIN RL_TEST NB_TRAIN NB_TEST OR_TRAIN OR_TEST KR_TRAIN KR_TEST Anneal Audiology Balance-scale Breast-cancer Breast-w Colic Credit-a Credit-g Diabetes Glass Heart-c Heart-statlog Iris Kr-vs-kp Labor Segment Sick Sonar Soybean Autos Heart-h Hepatitis Lymph Mushroom Primary-tumor Splice vehicle vote vowel Waveform AutoPrice baskball bodyfat bolts BreastTumor Cleveland cloud cpu detroit EchoMonths elusage Fishcatch Gascons housing Hungarian longley

26 Data_Name Num_Attr Num_Ins IBK_TRAIN IBK_TEST C45_TRAIN C45_TEST RL_TRAIN RL_TEST NB_TRAIN NB_TEST OR_TRAIN OR_TEST KR_TRAIN KR_TEST lowbwt Mbagrade meta Pharynx Pollution PwLinear quake Schlvote servo sleep strike veteran Vineyard Table 2: Results obtained from applying 6 data mining algorithms to 59 data sets. Blanks indicate situations where algorithms gave no result. Note that the word TRAIN in the table indicates the percentage of incorrectly classified training cases whereas the word TEST indicates the percentage of incorrectly classified cases using cross validation. For example NB_TRAIN indicates percentage of incorrectly classified instances (error rate) for the Naïve Bayes algorithm on training data. The definition of each variable is shown below. Name Definition NB_TRAIN Naive Bayes Training Error (%) NB_TEST Naive Bayes Testing Error (%) C45_TRAIN C4.5 Training Error (%) C45_TEST C45 Testing Error (%) OR_TRAIN OneR Training Error (%) OR_TEST OneR Training Error (%) RL_TEST Rule Learner Testing Error (%) RL_TRAIN Rule Learner Training Error (%) KR_TRAIN Kernel Density Training Error (%) KR_TEST Kernel Density Testing Error (%) IBK_TEST IBK Testing Error (%) IBK_TRAIN IBK Training Error (%) NUM_INS Number of Instances NUM_ATTR Number of Attributes Table 3: Definition of variables used

27 Table 4 shows a summary of table 2. It provides the minimum, maximum, mean, standard deviation, and missing percentage of each numeric variable. For example, it shows that for the Kernel Density algorithm on training data, the minimum error rate was 0 percent while the maximum and mean were and 2.53 percent respectively. It also shows that there was 5 % of the values were missing for this algorithm on training data indicating that no result was found for some data sets. Table 4 also shows the overall performance of the 6 algorithms in classifying the 59 data sets. The table is sorted by the mean error rates of each of the algorithms for both train and test cases. Looking at the training results indicates that Kernel Density (KR_TRAIN) with an average error rate of 2.53 percent, followed by Rule Learner (RL_TRAIN) and C4.5 (C45_TRAIN) with average error rates of 8.79 and percent respectively had lower training errors than the other algorithms. More importantly, looking at the cross validation results, Kernel Density (KR_TEST) with an average error rate of percent, followed by C4.5 and Naïve Bayes, with average error rates of and percent respectively, performed better than the other algorithms. Name Mean Min Max Std Dev. Missing % KR_TRAIN % RL_TRAIN % C45_TRAIN % IBK_TRAIN % NB_TRAIN % OR_TRAIN % KR_TEST % C45_TEST % NB_TEST % RL_TEST % IBK_TEST % OR_TEST % NUM_INS % NUM_ATTR % Table 4: Summary of the data gathered from running the 6 data mining algorithms on 59 data sets

28 Chapter 4. Clustering and Pattern Analysis F or the purpose of analyzing the data generated by applying the 6 data mining algorithms to 59 data sets (table 2, page 26), running unsupervised learning algorithms is necessary. For this experiment, the 3 algorithms used are k-means clustering using least squares, Kohonen Vector Quantization, and Autoclass Bayesian analysis. These algorithms are described in section 2.21 (pages 21-22). The results of the unsupervised clustering analysis performed are discussed in the next four sections followed by the summary and comparison of the results. 4.1 Results from k-means clustering Table 5 shows the ranking of variables resulting from the application of the k-means algorithm to the data generated (table 2, page 26). A value of 5 was used for the maximum number of clusters. This value was chosen as (1) more than 5 clusters in a data set of 59 cases are unlikely to be useful and (2) preliminary runs of the algorithm suggested there were 3 to 5 clusters. As shown in table 5, only number of instances is significant in determining the clusters. This model gives five clusters with 52,3,2,1,1 observations. The table shows the name, importance, measurement type and label of each variable. For example, it indicates that the num_inst (number of instances) variable has importance level of 1 and that it is a numeric interval variable. Numeric variables containing values that vary across a continuous range are shown as interval variables. NAME IMPORTANCE MEASUREMENT TYPE LABEL NUM_INS 1 interval Num Number of Instances KR_TEST 0 interval Num Kernel Density Test KR_TRAIN 0 interval Num Kernel Density Train OR_TEST 0 interval Num OneR TEST OR_TRAIN 0 interval Num OneR TRAIN NB_TEST 0 interval Num Naïve Bayes Test NB_TRAIN 0 interval Num Naïve Bayes Train RL_TEST 0 interval Num Rule Learner Test RL_TRAIN 0 interval Num Rule Learner Train C45_TEST 0 interval Num C45 Test C45_TRAIN 0 interval Num C45 Train IBK_TEST 0 interval Num IBK Test IBK_TRAIN 0 interval Num IBK Train NUM_ATTR 0 interval Num Number of Attributes Table 5: Importance level of variables in determining cluster

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Visualizing class probability estimators

Visualizing class probability estimators Visualizing class probability estimators Eibe Frank and Mark Hall Department of Computer Science University of Waikato Hamilton, New Zealand {eibe, mhall}@cs.waikato.ac.nz Abstract. Inducing classifiers

More information

Data Mining of Web Access Logs

Data Mining of Web Access Logs Data Mining of Web Access Logs A minor thesis submitted in partial fulfilment of the requirements for the degree of Master of Applied Science in Information Technology Anand S. Lalani School of Computer

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

Predicting Student Performance by Using Data Mining Methods for Classification

Predicting Student Performance by Using Data Mining Methods for Classification BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

Decision Tree Learning on Very Large Data Sets

Decision Tree Learning on Very Large Data Sets Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Logistic Model Trees

Logistic Model Trees Logistic Model Trees Niels Landwehr 1,2, Mark Hall 2, and Eibe Frank 2 1 Department of Computer Science University of Freiburg Freiburg, Germany landwehr@informatik.uni-freiburg.de 2 Department of Computer

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

not possible or was possible at a high cost for collecting the data.

not possible or was possible at a high cost for collecting the data. Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day

More information

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Machine Learning and Data Mining. Fundamentals, robotics, recognition Machine Learning and Data Mining Fundamentals, robotics, recognition Machine Learning, Data Mining, Knowledge Discovery in Data Bases Their mutual relations Data Mining, Knowledge Discovery in Databases,

More information

Classification and Prediction

Classification and Prediction Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser

More information

Data mining techniques: decision trees

Data mining techniques: decision trees Data mining techniques: decision trees 1/39 Agenda Rule systems Building rule systems vs rule systems Quick reference 2/39 1 Agenda Rule systems Building rule systems vs rule systems Quick reference 3/39

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS Abstract D.Lavanya * Department of Computer Science, Sri Padmavathi Mahila University Tirupati, Andhra Pradesh, 517501, India lav_dlr@yahoo.com

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

LVQ Plug-In Algorithm for SQL Server

LVQ Plug-In Algorithm for SQL Server LVQ Plug-In Algorithm for SQL Server Licínia Pedro Monteiro Instituto Superior Técnico licinia.monteiro@tagus.ist.utl.pt I. Executive Summary In this Resume we describe a new functionality implemented

More information

Introduction to Machine Learning Using Python. Vikram Kamath

Introduction to Machine Learning Using Python. Vikram Kamath Introduction to Machine Learning Using Python Vikram Kamath Contents: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction/Definition Where and Why ML is used Types of Learning Supervised Learning Linear Regression

More information

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data Fifth International Workshop on Computational Intelligence & Applications IEEE SMC Hiroshima Chapter, Hiroshima University, Japan, November 10, 11 & 12, 2009 Extension of Decision Tree Algorithm for Stream

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Roulette Sampling for Cost-Sensitive Learning

Roulette Sampling for Cost-Sensitive Learning Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge

More information

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19 PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

How To Do Data Mining In R

How To Do Data Mining In R Data Mining with R John Maindonald (Centre for Mathematics and Its Applications, Australian National University) and Yihui Xie (School of Statistics, Renmin University of China) December 13, 2008 Data

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

More information

Data Mining Solutions for the Business Environment

Data Mining Solutions for the Business Environment Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania ruxandra_stefania.petre@yahoo.com Over

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

An Overview and Evaluation of Decision Tree Methodology

An Overview and Evaluation of Decision Tree Methodology An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com

More information

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM KATE GLEASON COLLEGE OF ENGINEERING John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE (KGCOE- CQAS- 747- Principles of

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Data Mining: Overview. What is Data Mining?

Data Mining: Overview. What is Data Mining? Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin Data Mining for Customer Service Support Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin Traditional Hotline Services Problem Traditional Customer Service Support (manufacturing)

More information

Data Mining Techniques for Prognosis in Pancreatic Cancer

Data Mining Techniques for Prognosis in Pancreatic Cancer Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree

More information

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1 Data Mining 1 Introduction 2 Data Mining methods Alfred Holl Data Mining 1 1 Introduction 1.1 Motivation 1.2 Goals and problems 1.3 Definitions 1.4 Roots 1.5 Data Mining process 1.6 Epistemological constraints

More information

Data mining and statistical models in marketing campaigns of BT Retail

Data mining and statistical models in marketing campaigns of BT Retail Data mining and statistical models in marketing campaigns of BT Retail Francesco Vivarelli and Martyn Johnson Database Exploitation, Segmentation and Targeting group BT Retail Pp501 Holborn centre 120

More information

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

More information

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

More information

Studying Auto Insurance Data

Studying Auto Insurance Data Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.

More information

Decision-Tree Learning

Decision-Tree Learning Decision-Tree Learning Introduction ID3 Attribute selection Entropy, Information, Information Gain Gain Ratio C4.5 Decision Trees TDIDT: Top-Down Induction of Decision Trees Numeric Values Missing Values

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association

More information

Data Mining Techniques and its Applications in Banking Sector

Data Mining Techniques and its Applications in Banking Sector Data Mining Techniques and its Applications in Banking Sector Dr. K. Chitra 1, B. Subashini 2 1 Assistant Professor, Department of Computer Science, Government Arts College, Melur, Madurai. 2 Assistant

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Data Preprocessing. Week 2

Data Preprocessing. Week 2 Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA Welcome Xindong Wu Data Mining: Updates in Technologies Dept of Math and Computer Science Colorado School of Mines Golden, Colorado 80401, USA Email: xwu@ mines.edu Home Page: http://kais.mines.edu/~xwu/

More information

Performance Analysis of Decision Trees

Performance Analysis of Decision Trees Performance Analysis of Decision Trees Manpreet Singh Department of Information Technology, Guru Nanak Dev Engineering College, Ludhiana, Punjab, India Sonam Sharma CBS Group of Institutions, New Delhi,India

More information

COLLEGE OF SCIENCE. John D. Hromi Center for Quality and Applied Statistics

COLLEGE OF SCIENCE. John D. Hromi Center for Quality and Applied Statistics ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM COLLEGE OF SCIENCE John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE: COS-STAT-747 Principles of Statistical Data Mining

More information

Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

More information

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION ISSN 9 X INFORMATION TECHNOLOGY AND CONTROL, 00, Vol., No.A ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION Danuta Zakrzewska Institute of Computer Science, Technical

More information

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI Data Mining Knowledge Discovery, Data Warehousing and Machine Learning Final remarks Lecturer: JERZY STEFANOWSKI Email: Jerzy.Stefanowski@cs.put.poznan.pl Data Mining a step in A KDD Process Data mining:

More information

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery Index Contents Page No. 1. Introduction 1 1.1 Related Research 2 1.2 Objective of Research Work 3 1.3 Why Data Mining is Important 3 1.4 Research Methodology 4 1.5 Research Hypothesis 4 1.6 Scope 5 2.

More information

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition Brochure More information from http://www.researchandmarkets.com/reports/2170926/ Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.7 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Linear Regression Other Regression Models References Introduction Introduction Numerical prediction is

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

More information

A Review of Missing Data Treatment Methods

A Review of Missing Data Treatment Methods A Review of Missing Data Treatment Methods Liu Peng, Lei Lei Department of Information Systems, Shanghai University of Finance and Economics, Shanghai, 200433, P.R. China ABSTRACT Missing data is a common

More information

A Review of Data Mining Techniques

A Review of Data Mining Techniques Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Impact of Boolean factorization as preprocessing methods for classification of Boolean data Impact of Boolean factorization as preprocessing methods for classification of Boolean data Radim Belohlavek, Jan Outrata, Martin Trnecka Data Analysis and Modeling Lab (DAMOL) Dept. Computer Science,

More information

Overview. Background. Data Mining Analytics for Business Intelligence and Decision Support

Overview. Background. Data Mining Analytics for Business Intelligence and Decision Support Mining Analytics for Business Intelligence and Decision Support Chid Apte, PhD Manager, Abstraction Research Group IBM TJ Watson Research Center apte@us.ibm.com http://www.research.ibm.com/dar Overview

More information

REVIEW OF ENSEMBLE CLASSIFICATION

REVIEW OF ENSEMBLE CLASSIFICATION Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

More information

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University

More information