Suitability analysis of data mining tools and methods

Transcription

1 MASARYK UNIVERSITY FACULTY OF INFORMATICS Suitability analysis of data mining tools and methods BACHELOR S THESIS Samuel Kováč Brno, fall 2012

2

3 Declaration Hereby I declare, that this paper is an original work of my own, to which I hold all authorship rights. All sources, references and literature that I have used are properly cited and listed in complete reference to due source. Samuel Kováč

4 Abstract The goal of this thesis is to provide theoretical introduction to the topic of data mining and suitability analysis of freely available software tools. First part of the thesis aims to describe data types, methods and algorithms that are used in data mining and machine learning. Second part of the work deals with analysis of WEKA, KNIME, Rapid Miner, Orange and jhepwork tools and compares their suitability for different purposes. Keywords: Data mining, machine learning, analysis of software tools, WEKA, KNIME, Rapid Miner, Orange, jhepwork

5 Contents 1 Introduction Data Definition of Data Experimental and Observational Data Qualitative and Quantitative Data Discrete and Continuous Data Data Mining Approaches to Data Mining Predictive Approach Descriptive Approach Data Mining Steps Data Preprocessing Data Mining Methods Classification and Regression Clustering Anomaly Detection Association Rule Learning Summarization Machine Learning Meta Learning, Bagging and Boosting Bayesian Learning and Naive Bayesian Classifier... 16

6 5.3 Lazy Learning and k-nearest Neighbor Algorithm Decision Tree Algorithm Information Entropy and C4.5 Algorithm Neural Networks Support Vector Machines Freely Available Software Tools WEKA KNIME Rapid Miner Orange jhepwork Freely Available Libraries and Add-Ons Comparison of Software Usage Conclusion Bibliography... 50

7 Chapter 1 Introduction Recent advances in the field of information technology have made usage of data mining increasingly simple and affordable. However, this has also had an impact on availability of tools to work with, and as such, the number of freely available tools has grown rapidly. As a result of this, it has become rather difficult for an inexperienced user to choose the best possible software solution for his work. My motivation to write this thesis, was to create an introduction to data mining, that would not only introduce basic principles and algorithms, but also provide the reader with the knowledge of pros and cons of the most widespread of freely available software tools. The work starts off with a light introduction into data types and their differences in chapter 2. The following chapter 3 offers the reader insight into the different approaches that could be taken towards analysis of the data in data mining and how data mining functions as a process. This chapter also deals with the issue of preprocessing. The methods used in data mining themselves, are the topic of chapter 4, where all standard data mining tasks are described. Directly connected to this is chapter 5, machine learning, that offers insight into the inner workings of the separate algorithms that power the data mining tools. Chapter 6 marks the analytical part of the thesis. Here, the five most popular freely available data mining tools are analyzed and compared to each other. The thesis comes to a closure in chapter 7, that is the conclusion. 1

8 Chapter 2 Data 2.1 Definition of Data The term data refers to qualitative or quantitative attributes of a variable or set of variables [4]. In the case of data used for data mining, such variables are stored inside objects within the given data set. The notion of a single object could be simply understood as an instance of a given kind of entity or event, where differentiation between individual objects is derived from the difference in attributes of their corresponding variables. Provided a standard table form used in most current database systems, entries inside rows would represent objects. Concurrently, columns represent variable types. 2.2 Experimental and Observational Data Experimental data Experimental data [6] type encompasses data, that were collected excersizing strict control over all variables. The standard way of obtaining these, is by conducting an experiment, where the change in one or more variables is induced and observed by purposefuly altering the attributes of another variable. Such data carry the quality of providing the possibility to draw definite conclusion, or so called causality, based on the knowledge obtained via data mining them. That is because, the cause of the effect that is being observed is produced by the one experimenting, and can be replicated with ease. On the other hand, the downside to experimental data is the cost at which they are obtained, that limits both their quantity and availability. Observational data This data type describes data, that were collected having no control over their variables [6]. This data type is prevalent in data mining. The collection process 2

9 usually consists of nothing more, than implementation of means to gather information that is already being produced at low to no cost. Such data may not offer the insight as to the reasoning of results that data mining brings, yet still comes with valuable information. For example, data gathered from collecting information about customer purchases at a supermarket can be used to determine purchase patterns, which in turn offer statistics about commonly bought products. Commonly bought items can be placed closer to each other in order to increase customer satisfaction, or prepared as sale packages. 2.3 Qualitative and Quantitative Data Qualitative data Are data that carry attributes, that represent distinct categories rather than numbers [5]. Values of such attributes inhibit a certain form of quality, rather than interval or ratio. For example, an attribute type of eye color can carry the quality of blue, green, brown or black. There are also data such as zip postal codes or IP adresses, that consist of numbers, yet mathematical operations such as addition or division make no sense. Therefore rather than treating them like numbers, they are placed under the qualitative data type. This data type further splits into nominal and ordinal categories. Nominal data Are such data, that they cannot be ordered in a meaningful manner [6]. An arbitrary order might be imposed on these data, but they carry no innate characteristic trait that would provide information about it's object's placement inside the ordered sequence. Ordinal data A meaningful order can be made from these data, yet mathematical operations produce no usable result [6]. An example of this data type are rankings of commercial products, test results, letter shirt sizes. 3

10 Quantitative data Data numeric in origin are called quantitative [1]. Attributes of quantitative data are numbers and should be treated as such. Examples could be dimensions of an object, count, temperatures or occurences of an event per hour. Quantitative data further splits into interval and ratio categories. Interval data According to [6], this type carries two distinctive traits. There exists no true zero for data of the interval type, and division cannot be applied on such data. If a temperature is taken in both Farenheit and Celsius measurement systems, the value describing the same amount of heat would have two different numbers. That is because neither of the two measurement systems has a true zero. The zero of each of the two is but a point inside an interval, that has been arbitrarily appointed the role of a zero. Furthermore, division would make no sense in this case, for dividing the temperature equivalents in both systems by the same number would produce different numerical results. Dividing 50 degrees celsius by two would produce the result of 25, while dividing it's equivalent in farenheit would produce the result of 61. Ratio data There exists a true zero, and division makes sense. The concept of a true zero comes from understanding of the meaning behind the number 0. It is generally accepted, that zero refers to nothing in terms of measurement. For this very reason, the temperature measurement system of Kelvin is of the ratio data type, rather than interval. Zero in Kelvin is placed at the very beginning of the scale, so that it is impossible to go any lower. The existance of a true zero allows for the division operation to make sense within the number domain, for it would always produce the same result regardless of the measurement system. 4

11 2.4 Discrete and Continuous Data The difference between discrete and continuous data types comes from the precision of measurement that each of them allows. Discrete data Measurements that require only a finite, or countably infinite set of values belong into this type of data. There exists no midpoint between two discrete values, so they are often seen as having gaps in between each other. Discrete data are usually represented in the form of integer variables. Examples of discrete data are: counts, syllables in a word, zip codes. Binary data are a special case of discrete data, that only have two values. Continuous data Are such data, that can be computed as accurately as instruments allow. Attributes contain real numbers as values, and the precision of measurement is only limited by the number of digits that the variable is provided with. They are often represented by floating-point variables. Examples of continuous data are: height, weight or velocity. Qualitative (categorical) data are always discrete. Quantitative (numeric) data can be either discrete or continuous. 5

12 Chapter 3 Data Mining In this day and age, information is being produced in the form of data by the most trivial of everyday human actions and interactions. Such data are more often than not being stored and kept for further processing. Each of them holds the possibility of containing hidden knowledge, that may not be explicitly stored inside the data structures, but can be derived from real world relations and processes. For extraction of such hidden information, data mining is used. The abundance of data, coupled with the need for powerful data analysis tools, has been described as a data rich but information poor situation [2]. 3.1 Approaches to Data Mining Predictive Approach The aim of predictive approach to data mining is to acquire the ability to predict values of attributes, without those being explicitly observed prior to evaluation [2]. This is achieved by employing the usage of machine learning (further refered to as ML) algorithms, that are explained in detail in chapter 5. A set of training data is provided to these algorithms. It contains examples, and labels associated with them. If sufficient number of examples has been provided, the algorithm is expected to acquire the ability to predict the label Descriptive Approach Descriptive approach to data mining strives to find characteristic attributes of data, and discover relationships between them [2]. These are often returned in the form of data sets, that contain attributes deemed to be the most characteristic given the mined data set, visual representation of the data set, or association rules that are explained in further detail in section 3.5 of this work. 6

13 3.2 Data Mining Steps Data mining is a process that involves a number of steps: Data Cleaning In the first step, data that contain corrupted or empty records are removed. Data Integration In order to proceed with data mining, data need to be collected and integrated into a single formated structure. However, different sources of data usually do not provide uniform structures and interpretations of data, therefore intergration into a single format needs to take place. Data Selection For data mining to provide results of high quality, data of high quality need to be supplied. Not all of the data collected are needed though. Data selection allows for choosing only such data, that are relevant to the task to be performed. Data Transformation The data that have passed the cleaning step are still not ready for data mining purposes, for they still need to be transformed into format accepted by the data mining algorithm. This is achieved via application of techniques such as smoothing, aggregation or normalisation. Data Mining In this step, various algorithms may be applied on the data in order to discover potential knowledge hidden within the data. Some of the algorithms applied might be classification, clustering or association analysis (chapter 4). 7

14 Pattern Evaluation The importance of results provided by data mining needs to be evaluated, for not all of the findings may be of interest to the inquiry. Some results may display lower levels of prediction precision, or may be hard to interpret to a human being. Redundant patterns are therefore removed. Knowledge Presentation Results that appear to be the most important undergo transformation and visualisation in order to be presented in the most understandable form. 3.3 Data Preprocessing Data preprocessing is a key element to improve the accuracy of data mining algorithms [7]. Not all data that are collected for the purposes of data mining are necessarily suited for the task. Wrong selection of data may heavily influence results of data mining, and therefore certain treatments need to be imposed on the data before they are supplied to the algorithms. Data cleaning This method of data preprocessing deals with removal of imperfect data, such as noise for continuous attributes. Another of this method s responsibilities is replacing the missing data records. [8] Feature selection and feature reduction Feature selection and feature reduction are data preprocessing methods that aim to improve the performance of the learning model by eliminating redundant features and keeping the important ones. Reduction of the number of features in the model helps alleviate the curse of dimensionality [8], which is one of the major problems in regards to data used for machine learning and data mining. The increase in dimensionality of an object results in major increase in it's volume, that causes the available data to become sparse. This 8

15 sparsity is rather problematic for any method that requires statistical significance, for the sheer amount of data that needs to be supplied in order for that significance to be achieved grows exponentially with the each dimension of the data. By removal of irrelevant features, dimensionality is decreased and performance of the model is further augmented in enhancing it's generalisation capability, speeding up of the learning process, improval of the model's interpretability and helping the user in acquiring a better understanding of their data by highlighting the important features and their relations. Data Transformation Some data may need normalization or discretization [8] to be usable by data mining algorithms. Normalization is a process that is applied on quantitative attributes, in order to eliminate the effect of having different scale measures. Discretization is used to transform quantitative attributes to qualitative ones. Additionaly, data transformation might assign weights to attributes in order to highlight significant ones. 9

16 Chapter 4 Data Mining Methods 4.1 Classification and Regression Classification [2] is one of the supervised learning methods for data mining, that uses predictive approach. The purpose of classificaton is to aid in construction of models that describe different classes of data. Different examples are separated into their respective classes according to their differentiating patterns. Watanabe described a pattern as the opposite of chaos; it is an entity, vaguely defined, that could be given a name [9]. Therefore, entities carrying a common trait usually form patterns. Voice recordings, human fingerprints, or even DNA sequences that carry a specific trait are examples of this. In order to classify an exemple, training data is supplied to the algorithm in order to establish a pattern. This pattern is then used to find functional mapping of the input data to their specific class label, so that the class label is the functional value of the input data. Each of the examples presented to the algorithm consists of a set of attributes, and membership of a class is determined according to one of these attributes, the target attribute. Data, for which the target attribute is the same are placed within the same class, and according to this placement a pattern is established. Since this pattern does not change once the training phase is over, the number of classes for classification tasks is known ahead of time. A method that deals with uncertain number of classes is called clustering, and it is the topic of section 4.2. Regression Regression is a form of prediction. Prediction is a process that is very similar to classification, with the single difference that target attribute is continuous or ordinal in value rather than discrete. As noted in sections 4.3 and 4.4, many algorithms used for classification can be easily adjusted to perform regression tasks as well. 10

17 Application Domains Classification and regression analysis have many possible applications. From hard science research fields such as speech recognition, biological classification, and robotics to document classification tasks. The only limitation in their area of usage appears to be the need to provide training data in order to achieve a certain degree of statistical significance, which effectively limits their usage to scientific fields rather than the analysis of constantly changing real-world data. 4.2 Clustering The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering [2]. A cluster is a collection of objects that carry similarity to each other, and at the same time dissimilarity to objects contained within other clusters. Classification is capable of discovering these similarities. However to do so, it requires the costly preparation of a large number of training data examples. On the other hand, the approach that clustering takes is often more desirable. Since clustering is an unsupervised learning model, it does not require such extensive preparation. The data are first divided into smaller number of partitions, to which labels are assigned. This approach helps to single out useful features, that can be used to differentiate between individual clusters. This learning method works on the basis of measuring distance within a feature space. Each attribute of an object corresponds to one dimension within the space. Objects, that are close to each other within this space, are therefore treated as similar, and form clusters. On the other hand objects, that are distant from each other carry little similarity in their attributes, and therefore belong into separate clusters. Clustering has found it s usage in a vast area of applications. Among the most prominent ones, is business marketing, image processing and medical science. 11

18 Due to it s nature, clustering is especially well suited for initial analysis of a data set in data mining. Once the clusters have been found within the set of data that is being analyzed, more sofisticated methods such as anomaly detection or association rule learning usually take place. Application Domains Clustering is a method that is used in virtually every possible application domain. It has found it s uses in fields ranging from medicine, through marketing, world wide web, computer science and social science to mathematical and ecological domains. Since clustering does not have any learning phase, its application on real world data is achieving outstanding results. It is capable of adapting to changes very well and therefore fits with both static, as well as dynamic models. 4.3 Anomaly Detection Anomaly detection [12] can be described as the opposite of clustering. The aim of this method is to find data, that do not fit within any established pattern. There are three general categories of anomaly detection: supervised, semisupervised and unsupervised anomaly detection. The difference between supervised and unsupervised learning techniques is a problem adressed in chapter 5 of this work. As for the semi-supervised anomaly detection, some training data are supplied to the algorithm. These are the normal data exapmles, that lack labels, which would have been abnormal data in the case of supervised anomaly detection. Application Domains The most important of this method s applications is fraud detection. Although this may appear to be a task for prediction models, the speed with which criminals develop new types of fraud poses a considerable problem for models that rely on learned patterns. Those quickly loose their accuracy and become 12

19 obsolete. Anomaly detection focuses on modeling normal data in order to identify data that differ from them. Anomaly detection was proposed for Intrusion detection systems (IDS) by Dorothy Denning in 1986 as well [10]. 4.4 Association Rule Learning Association analysis [12] is a process of discovering association rules, that describe relationships and dependencies between attributes and their values. Different association rules express different regularities that underline the dataset, and they generally predict different things [2]. For purposes of this work, the definition of association rule from [11] is used: By an association rule, we mean an implication of the form X => I j, where X is a set of some items in I, and I j is a single item in I that is not present in X. The idea of association rules comes from the market basket analysis, where the point of analysis was to discover which products are bought often, and which products are bought together. Information acquired like this has proved to serve well for enhancing the placement of products in a market, or preparing sale packages. Other possible uses of this method are web usage mining, intrusion detection or bioinformatics. Association rules consist of two parts. The first part is an antecedent, that is the conditional part of the rule, and can be found in the data. The second part is the consequent, that is described as the item found in combination with the antecedent. The form that these rules usually take is implication, where the part on the left side of implication corresponds with the antecedent and the part on the right side corresponds to consequent. Application Domains Although Agrawal et al [11] has introduced the concept of association rules in the context of large scale transaction data from supermarkets, it has found its usage in the areas of web usage mining, intrusion detection and bioinformatics. 13

20 4.5 Summarization Summarization [12] is one of the descriptive methods of data mining. One of characteristic traits of data is, that they usually belong to some cluster, group or class. The aim of the summarization method is to provide a short, yet precise description of the specified class of data. There are several approaches to solving this problem, yet the most widespread of them is generalization and summarization-based characterization. The attributes of data are generalized onto a higher level of abstraction and attributes that are not relevant to the class that is being characterized are left out. Numeric attributes associated with the attribute that has been generalized are aggregated. The output of this method consists of various graphs or other kinds of visualizations of data, that help represent the characteristics of the class. Application Domains Regardless of what data is supplied to this data mining method, it will be able to offer an acceptable result. Although potential usage is not limited to any single application domain, education and research is where it thrives. 14

21 Chapter 5 Machine Learning Machine Learning (further refered to as ML) is a field that has evolved from the study of Artificial Inteligence [3]. In simplified terms, the goal of ML is to mimic the principles of human cognitive processes by machines. Human brain determines the identity of an object based on a set of it's characteristic attributes. This principle applies to ML as well, as cluster analysis works on a similar principle (section 4.2). However, the most important question is, how to make machines able to learn. In the context of ML, the process of learning can be understood as an inductive inference, where the machine is presented with examples that contain incomplete information about some statistical phenomenon. From here on, either supervised, or unsupervised learning takes place. Supervised learning usually takes the predictive approach (explained in sub-section 3.1.1, and further expanded in section 4.1), while unsupervised learning tends to focus on descriptive approach (sections 4.2, 4.4, 4.5). 5.1 Meta Learning, Bagging and Boosting Meta learning [13] is a field of study in ML, where automatic learning algorithms are applied on meta-data about machine learning experiments. In the context of machine learning algorithms, the term meta-data refers to such data, that contain information about other data. Meta learning carries resemblance to this. The main goal of meta learning is to increase the understanding of how to improve performance of existing learning algorithms. Bootstrap aggregating, or so called bagging, is a meta learning algorithm that strives to improve classification and regression models in terms of stability and classification accuracy. This algorithm is usually used with decision trees, however its area of appliance covers all ML models. The algorithm works by taking a given set of training samples, and creating a set of new ones, that are filled with bootstrap samples. A bootstrap sample is a statistical sample taken 15

22 uniformly and with replacement, which means that the result sample set will contain duplicates. For m models, bagging generates m new training data sets. The model is then fitted using above m bootstrap samples. For regression, it is then combined by averaging the output, and for classification, voting. Boosting is a meta learning algorithm, that finds it's root in a question, that has been posed by Kearns: "can a set of weak learners create a single strong learner? [14]. A weak learner in this context refers to such classifier, that is able to label examples better than random guessing, but is not capable of true classification. A strong learner on the other hand, is capable of correlating well with the true classification. The answer to this question has been given by Schapire, and it has proved to carry significant ramifications in machine learning and statistics, as well as it has consequently led to development of boosting. The answer to Kearn's question is yes. While boosting is not algorithmicaly constrained, the majority of boosting algorithms are composed by utilising iterative addition of weak learners into a single composition, so that each of them is weighted according to precision rate of it's classification. With each added learner, weights of classified samples are re-evaluated, so that misclassified samples gain more importance for the learners that are yet to be added into the composition. 5.2 Bayesian Learning and Naive Bayesian Classifier Bayesian learning [3] is a method of statistical inference, that utilises Bayes' theorem to calculate how the degree of belief in a proposition changes in accordance to evidence. The model uses Bayesian interpolation of probability, in which probabilities represent the degrees of belief. The Bayesian learning works as such: before any data has been observed, the expectation as to what the true relationship between those data might be, can be expressed in a probability distribution over the assumptions, that define this relationship. This distribution is refered to as "prior". Once the data has been revealed to the algorithm, the revised findings are captured as a posterior probability 16

23 distribution. The assumptions, that have appeared to be plausible before, yet were revealed to be incorrect, will experience decrease in their probability. On the other hand, assumptions that have managed to meet the expectations will have their probability increased. Bayesian learning has found its application in the field of artificial intelligence, and therefore machine learning. This learning method has played a fundamental part in pattern recognition since the early 1950s. The classification algorithm based on this method is called the naive Bayesian classifier. Input Data Naive Bayesian classifiers [3] can handle any number of variables, regardless of whether they are qualitative or quantitative. The algorithm works on the assumption, that variables provided to the classifier are independent. Even though this might not always be the case, it greatly simplifies the classification task. Instead of being presented with a multi-dimensional task, the algorithm has to compute only a set of one-dimensional tasks. Furthermore, the regions near decision boundaries do not seem to be greatly affected by doing this, thus leaving the classification task unaffected. Algorithm description The classification task in naive Bayesian classifier is performed by evaluating the posterior probability. Given a set of variables X = {x 1, x 2,..., x n }, the algorithm attempts to determine the posterior probability of an event C i among a set of possible outcomes C = {c 1, c 2,..., c j }. Simplified formula for calculating posterior is: j p C i X = p C i k=1 p x k C i Using this modified Bayes' rule formula, a new case X is labeled with a class level C i that achieves the highest posterior probability. 17

24 Application Domains The most notable area of usage that employs the naive Bayesian classifier is spam detection in , however thanks to it's excellent performance and low hardware requirements, this algorithm is spreading into other areas such as text classification, medical diagnosis and system performance management. An example of this is the Naive Bayesian Classifier for Rapid Assignment of rrna Sequences into the New Bacterial Taxonomy. Pros and Cons The usage of naive Bayesian classifier provides incredibly short training time and fast evaluation. It should also be noted, that this algorithm has proven to be rather well suited for real world problems. Despite its limitations, naive Bayes was shown to be optimal for some important classes of concepts that have a high degree of feature dependencies, such as disjunctive and conjunctive concepts [15]. However, because of its simplistic nature, solving of more complex classification problems is not possible with naive Bayesian classifier. Computational and Memory Requirements Bayesian classifier is well known for modest requirements that it puts on memory and fast computational times, that are often better than those of other algorithms. 5.3 Lazy Learning and k-nearest Neighbor Algorithm The principle behind lazy learning is, that generalisation beyond the training data is being held off until a query has been made to the system. The main advantage of this method is that the target function will only be approximated locally, and therefore all computation is left until classification takes place. Another significant advantage of lazy learning systems is, that they can simultaneously solve multiple problems and deal successfuly with changes in 18

25 the problem domain. However, there are two major disadvantages that this method suffers from. It has rather high memory requirements, since the whole data set needs to be stored, and it's evaluation is slower compared to other algorithms. Input Data Since k-nearest Neighbor Algorithm [16][2] is a non parametric lazy learning algorithm, it does not make any assumptions on the underlying data distribution. This is espetially important for applications that deal with observational data retrieved from real world interactions, because such data rarely follow theoretical assumptions. KNN assumes it's data points to be in a metric space. They are stored in the form of vectors, where all training examples consist of a vector and class label. The algorithm is also supplied with the number k, according to which the classification is performed. The vectors can carry both qualitative and quantitative attributes, as well as continuous and discrete, provided minor adjustments to the algorithm. Algorithm Out of all the classification methods, Arguably the simplest method is the k- Nearest Neighbor classifier [16]. For the test point in the training data, it's nearest k neighbors are selected. Then a label is given to the test point by the majority vote between it's k neighbors. The test point is assigned it's class according to the most common class amongst it's k nearest neighbors in accordance. Application Domains The main usage of the KNN algorithm is classification, yet there are several other purposes for which it can be used. One of them is regression. That is achieved by altering the algorithm, so that the property value of the test point is 19

26 the average of the distance-weighted values of its k nearest neighbors. Density estimation is done by placing a hypercube at the test point x, and increasing it's size until the required number k of it's neighbors is enclosed within the cube's space. Density then can be estimated using the following formula: p x = k /n V Another possible application of KNN is the estimation of continuous variables. There are several different implementations of this functionality, yet the most well-known of them uses the inverse distance weighted average of the k-nearest multivariate neighbors. Pros and Cons Since KNN is one of the easiest algorithms of ML, it is easy to understand and programm. If there is no majority agreement, there exists an explicit reject option. It also contains easy handling of missing values by restricting the distance calculation to subspace and it has asymptotic misclassification rate that is bounded above by twice the Bayes error rate. As for the downsides, it is easily affected by local structure of data, as well as it is sensitive to noise and irrelevant features. For data sets that contain a big number of examples that belong to the same class, KNN has a tendency to have that specific class dominate the prediction because of the majority vote mechanism. This can be overcome by weighting the classification according to the distance from the test point to each of it's k neighbors. It also suffers from the curse of dimensionality, where given large data sets, the nearest neighbor might not even be near at all. Computational and Memory Requirements Since all the training examples are kept, the KNN has a rather large memory consumption. Compared to algorithms such as decision trees and linear classifiers, KNN is rather ineffective in the area of memory usage. Time 20

27 efficiency is an issue as well, since it's O(nd). To classify a specific point x, the algorithm has to go through all the training examples. KNN becomes a viable choice once the data set contains many classes, for it computes predictions for all of them. 5.4 Decision Tree Algorithm Decision tree [2][3] is a learning model designed to predict the value of a target variable based on a number of input variables. There are several notable usages of decision trees, such as random forest classifiers, boosted trees or C4.5 algorithm [17]. Input Data The algorithm itself accepts data in the form records, where each entry is in the form of (x, Y) = (x 1, x 2, x 3,..., x k, Y), where Y is the target variable that is the subject of interest to classification or regression. The vector x is a collection of variables that are used for the task. Application Domains Decision tree is constructed by repeatedly partitioning the input space, so that the partitions form a tree structure. The algorithms that are used for constructing decision trees usually work top-down by choosing a variable at each step that is the next best variable to use in splitting the set of items [18]. Once the tree has been constructed, leaves correspond to class labels and branches conjunctions of features, that lead to those labels. Classification or regression is performed by advancing through the tree from top to bottom along the branches, until a terminal node has been reached. Each leaf in the decision tree corresponds to a value of the target variable given the values of the input variables that were visited on the way from the root node to the leaf. 21

28 Application Domains The basic usage of decision trees is to solve classification and regression problems. Regression tree analysis is used when the predicted outcome can be considered a real number. Classification tree analysis is performed when the predicted outcome is the class to which the data belong. Extended usage of decision trees can be seen in a number of algorithms such as random forests, boosted trees, ID3 and it's extension C4.5 algorithms, CHi-squared Automatic Interaction Detector or MARS. Pros and Cons Decision trees are simple to understand and interpret. They are not only used for ML purposes, but even for general business decision making purposes. Even though the results of data mining via decision trees may not offer such functionality, classification trees built from such decision trees can be used to make decisions. This stems from the usage of the white box model, which allows us to view and understand the inner workings of the algorithm. Another strong point of decision trees is, that little to no data preparation is needed. The problem of learning an optimal decision tree is known to be NPcomplete under several aspects of optimality and even for simple concepts [19]. Also, decision trees require mechanisms such as pruning, for decision tree learners can create trees that are too complex and don't generalise data well. Another limitation of decision trees is their inability to express some concepts well, such as XOR, parity or multiplexer problems. Lastly, for categorical data with different numbers of levels, information gain in decision trees is biased in favor of those with more levels. Computational and Memory Requirements Decision trees are very effective in regards to resources. They use less memory than some other classification and regression algorithms and require little time to analyse large data sets. 22

29 5.5 Information Entropy and C4.5 Algorithm C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan [17]. It is an extended version of Quinlan's earlier algorithm, ID3. C4.5 uses the concept of information entropy, to build decision trees. Information entropy, in the context of information theory usually refering to Shannon entropy, is a measure of uncertainty associated with a random variable. It may be quantified as the expected value of information contained within a message, or equivalently as the average information content that is missing when the value of the random variable is not known. The principle of Shannon's entropy represents an absolute limit set upon losless compression of any communication, so that for messages that are encoded as a sequence of independent and identically-distributed random variables, the average length of the shortest possible representation to encode the message in the given alphabet is their entropy divided by logarithm of the number of symbols in that alphabet. The entropy rate of a data source means the average number of bits per symbol needed to encode it. Information Gain Also refered to as Kullback-Leibler divergence is a non-symmetric measure of difference between two probability distributions [20]. Information gain measures the extra number of bits that are needed to code samples from one distribution while using a code based on the other. Usually the first distribution represents the "true" distribution of data, while the other is a theory or a model. Algorithm C4.5 uses two sets of vectors as it's building blocks for decision trees. The first set of vectors, training data, is supplied as a set of examples that have already been classified. Each sample is in the form of a vector whose values are its attributes or features. The second vector contains classes, to which each sample 23

30 belongs. It is used to augment the training data. At each partitioning step, C4.5 chooses an attribute based on the criterion of normalized information gain, that results from choosing an attribute for splitting data. The attribute with the highest normalized information gain is chosen [17]. 5.6 Neural Networks Neural networks are a computational model inspired by the connectivity of neurons in animate nervous systems [21]. As an alternative to von Neumann computer architecture, which is based on sequential processing, neural networks are derived from observation of how animal brains work. The principle of a neural network has first been suggested by Turing in Input Data Artificial neural networks are restricted to numeric data, and are only capable of processing those in a fairly limited range. If data is in an unusual range, or if there is some missing data, problems may arise. Fortunately, there are methods that can be used to adress each of these issues. Numeric data can be scaled to fit within the required range of the network and missing data can be substituted using mean value or some other statistic. Algorithm A neural network consists of computational elements called neurons. A neuron is a device, that has many inputs and only one output. It can function in two modes. In training mode, the neuron is presented with training input patterns and taught for which to fire, and for which to withold firing a signal. In the using mode, neuron is evaluating the input patterns looking for a learned pattern. If such pattern is detected, the associated output from the training data becomes the current output. If the input pattern is not recognised, the firing rule is used to determine whether to fire or not. 24

31 Each neuron computes a weighted sum of its inputs and can perform a nonlinear function on this sum. If sufficient amount of examples has been provided, and required number of neurons is present in the network, the function computed by the network can approximate any function. The architecture of neural networks comes in two basic types. Feedforward networks allow signals between neurons to travel in one direction only; from input to output. Such networks are used primarily for pattern recognition. Feedback networks allow for signals to travel in both directions, which in turns makes them computationaly very powerful. However, it also tends to make them extremely complicated. Such a network functions dynamically, where the state of the network changes continuously until an equilibrium is reached for the given input pattern. Neural networks consist of three types of layers. Input layer represents the raw information that is fed to the network. It is connected to the hidden layer, where activity of each hidden unit is determined by the input units and their weights on connections. Lastly, the activity of the output units is determined by the evaluation of hidden units and weights on connections that lead to the output units. There are neural networks designed for supervised as well as unsupervised learning. Application Domains Neural networks are best suited for pattern recognition tasks, and therefore excel at solving real world problems related to prediction and forecasting. Many businesses already employ the usage of neural networks to do customer research, data validation, sales forecasting, risk management or target marketing, because these areas usually have high amounts of training data available. The usage of neural networks is not restricted to marketing needs. There are many applications of this paradigm in medical field and scientific research. 25

32 Possible applications include function approximation, regression analysis, classification, data processing such as filtering, clustering or compression, and robotics. Pros and Cons The primary advantage of neural networks is, that they can be deployed to deal with problems that cannot be solved by constructing an algorithmic solution. This sets them apart from any other tool available, for there is a number of cases where no other solution will suffice. It is important to note that models that use neural networks need to have high tolerance to errors, which stems from the non-algorithmic way they work. Another concern regarding this algorithm is the principle of the black box. Unlike white box algorithms such as decision trees, where it is possible to see and understand the structure and inner workings of the algorithm, neural networks and black box algorithms in general do not offer such insight. Computational and Memory Requirements Requirements that neural networks pose on hardware vary depending on the size of the network and it's implementation. ANN that solve difficult problems such as speech recognition may consist of thousands neurons with hundreds of inputs, and such networks would require enormous resources, or even custom hardware solutions. Also, input data supplied during the training phase tend to be considerably more complex than those for classification, so different hardware requirements need to be met for each of the two modes. 5.7 Support Vector Machines Support vector machine [22] is a non-probabilistic binary linear classifier. It provides supervised learning methods for classification and regression analysis. SVM takes a set of training data, and predicts for each given input to which of the two possible classes it belongs. To achieve this, kernel function to map 26

33 training data into a feature space is used. The data are then separated using a large margin hyperplane which establishes two separate classes. Input Data Input data for SVMs are sets of vectors, often given in a finite dimensional space. However, some of the data given in such a way might not be linearly separable. Mapping such data onto a high-dimensional space makes their separation easier. Algorithm In the first part of the learning process, SVM learns the division of classes by separating the training data into two sectors within the specified highdimensional space. The training data is mapped onto a high-dimensional plane using a kernel function. The algorithm then constructs a hyperplane within the specified space. The hyperplane is chosen in such a way, so that the distance from the nearest training data is the largest possible. This is because the larger the margin, the lower the generalisation error of the classifier. Most commonly used kernel functions for the mapping between training data and high-dimensional space are: x x ' 2 RBF kernels: k x, x ' = σ 2 and polynomial kernels: k x, x ' = x x' d. Application Domains The most common usage of SVMs is for classification and regression. The motivation towards using SVMs is that the solution is always global and unique. This algorithm also has a simple geometric interpretation and gives a sparse solution. Real world applications that use SVM range from research, through education to commerce. Some examples of such applications are: Protein 27

34 Structure Prediction, Support Vector Classifiers for Land Cover Classification or even Intrusion Detection Systems. Pros and Cons One of the most notable limitations connected to usage of SVMs is, that classification is only directed to two-class tasks. Therefore, in order to solve multi-class tasks, algorithms that deal with reduction to several binary problems have to be applied. Also parameters of a solved model are hard to interpret, and class membership probabilities are uncalibred. Computational and Memory Requirements Since SVMs work in high-dimensional space, the curse of dimensionality takes it's toll in regards to hardware requirements. Memory consumption is rather high. 28

35 Chapter 6 Freely Available Software Tools 6.1 WEKA Weka is a fully functional data mining software package that provides a high level of functionality for users. For example, the software provides API support to users that allows for the ability of various components of the software to communicate with each other. API In fact, it has been noted that the API functionality of Weka provides users with the ability to achieve increased functionality because of the many freely available programming codes that are available online [23]. Even more, the software contains the ability to perform over 100 types of data mining methods, including Bayesian methods, rule-based methods, and statistical analysis. The inclusion of so many different types of data mining methods in Weka makes the software useful for a variety of data mining methods and in a variety of industries. Users in different industries are not likely to face any inability to use a desired data mining method with Weka. Database System Support Another strength of Weka is that the software natively supports the ability to read files from a variety of database formats [24]. For users who obtain data from the internet, a specific strength of Weka is the ability to acquire data from both SQL databases and from actual webpages by entering the URL of the webpage containing the information [25]. This makes it possible for users to easily input information into the software that might not actually be in a format that would make it easily read by other data mining packages. 29

36 Visualization Capabilities While Weka has strong support for the use of APIs, a variety of data mining methods, and supported database systems, one of the weaknesses of the software is its visualization support. It is important to note that the software provides visualization of data, results, and processes, but the support that is provided is somewhat limited. What is meant by this is that the visualization of data, results, and processes is not highly colorful or as detailed as other data mining software packages [26]. However, the visualization that is provided is certainly sufficient for being able to view the data on which the analyses are being performed and the results of the data analyses efforts. In addition, add-ons are available that can increase the visualization functionality of the software [24]. As a part of the visualization support in the software and the add-ons that are available,. Weka is able to interface with the R statistical package in order to not only increase its statistical analysis functions, but to also allow for increased visualization of statistical analyses and results [26]. PMML Support Weka has support for PMML. This allows users to import PMML files that are created in both propriety and open-source data mining and statistical software packages. However, the software does not currently have support for exporting data files in the PMML format for use in other applications. This functionality is planned for future releases of the software [24]. Statistical Analysis Capabilities Weka can perform just about any type of statistical analysis. In addition to performing the most basic descriptive and inferential statistical analyses, the software also allows for cluster analysis to be performed. Also, as has already been noted, Weka has the ability to interact directly with the R statistical package. This makes it possible to increase the statistical functionally of the software, as well as makes it possible for users that are more comfortable or familiar with R to 30