Suitability analysis of data mining tools and methods

Size: px
Start display at page:

Download "Suitability analysis of data mining tools and methods"

Transcription

1 MASARYK UNIVERSITY FACULTY OF INFORMATICS Suitability analysis of data mining tools and methods BACHELOR S THESIS Samuel Kováč Brno, fall 2012

2

3 Declaration Hereby I declare, that this paper is an original work of my own, to which I hold all authorship rights. All sources, references and literature that I have used are properly cited and listed in complete reference to due source. Samuel Kováč

4 Abstract The goal of this thesis is to provide theoretical introduction to the topic of data mining and suitability analysis of freely available software tools. First part of the thesis aims to describe data types, methods and algorithms that are used in data mining and machine learning. Second part of the work deals with analysis of WEKA, KNIME, Rapid Miner, Orange and jhepwork tools and compares their suitability for different purposes. Keywords: Data mining, machine learning, analysis of software tools, WEKA, KNIME, Rapid Miner, Orange, jhepwork

5 Contents 1 Introduction Data Definition of Data Experimental and Observational Data Qualitative and Quantitative Data Discrete and Continuous Data Data Mining Approaches to Data Mining Predictive Approach Descriptive Approach Data Mining Steps Data Preprocessing Data Mining Methods Classification and Regression Clustering Anomaly Detection Association Rule Learning Summarization Machine Learning Meta Learning, Bagging and Boosting Bayesian Learning and Naive Bayesian Classifier... 16

6 5.3 Lazy Learning and k-nearest Neighbor Algorithm Decision Tree Algorithm Information Entropy and C4.5 Algorithm Neural Networks Support Vector Machines Freely Available Software Tools WEKA KNIME Rapid Miner Orange jhepwork Freely Available Libraries and Add-Ons Comparison of Software Usage Conclusion Bibliography... 50

7 Chapter 1 Introduction Recent advances in the field of information technology have made usage of data mining increasingly simple and affordable. However, this has also had an impact on availability of tools to work with, and as such, the number of freely available tools has grown rapidly. As a result of this, it has become rather difficult for an inexperienced user to choose the best possible software solution for his work. My motivation to write this thesis, was to create an introduction to data mining, that would not only introduce basic principles and algorithms, but also provide the reader with the knowledge of pros and cons of the most widespread of freely available software tools. The work starts off with a light introduction into data types and their differences in chapter 2. The following chapter 3 offers the reader insight into the different approaches that could be taken towards analysis of the data in data mining and how data mining functions as a process. This chapter also deals with the issue of preprocessing. The methods used in data mining themselves, are the topic of chapter 4, where all standard data mining tasks are described. Directly connected to this is chapter 5, machine learning, that offers insight into the inner workings of the separate algorithms that power the data mining tools. Chapter 6 marks the analytical part of the thesis. Here, the five most popular freely available data mining tools are analyzed and compared to each other. The thesis comes to a closure in chapter 7, that is the conclusion. 1

8 Chapter 2 Data 2.1 Definition of Data The term data refers to qualitative or quantitative attributes of a variable or set of variables [4]. In the case of data used for data mining, such variables are stored inside objects within the given data set. The notion of a single object could be simply understood as an instance of a given kind of entity or event, where differentiation between individual objects is derived from the difference in attributes of their corresponding variables. Provided a standard table form used in most current database systems, entries inside rows would represent objects. Concurrently, columns represent variable types. 2.2 Experimental and Observational Data Experimental data Experimental data [6] type encompasses data, that were collected excersizing strict control over all variables. The standard way of obtaining these, is by conducting an experiment, where the change in one or more variables is induced and observed by purposefuly altering the attributes of another variable. Such data carry the quality of providing the possibility to draw definite conclusion, or so called causality, based on the knowledge obtained via data mining them. That is because, the cause of the effect that is being observed is produced by the one experimenting, and can be replicated with ease. On the other hand, the downside to experimental data is the cost at which they are obtained, that limits both their quantity and availability. Observational data This data type describes data, that were collected having no control over their variables [6]. This data type is prevalent in data mining. The collection process 2

9 usually consists of nothing more, than implementation of means to gather information that is already being produced at low to no cost. Such data may not offer the insight as to the reasoning of results that data mining brings, yet still comes with valuable information. For example, data gathered from collecting information about customer purchases at a supermarket can be used to determine purchase patterns, which in turn offer statistics about commonly bought products. Commonly bought items can be placed closer to each other in order to increase customer satisfaction, or prepared as sale packages. 2.3 Qualitative and Quantitative Data Qualitative data Are data that carry attributes, that represent distinct categories rather than numbers [5]. Values of such attributes inhibit a certain form of quality, rather than interval or ratio. For example, an attribute type of eye color can carry the quality of blue, green, brown or black. There are also data such as zip postal codes or IP adresses, that consist of numbers, yet mathematical operations such as addition or division make no sense. Therefore rather than treating them like numbers, they are placed under the qualitative data type. This data type further splits into nominal and ordinal categories. Nominal data Are such data, that they cannot be ordered in a meaningful manner [6]. An arbitrary order might be imposed on these data, but they carry no innate characteristic trait that would provide information about it's object's placement inside the ordered sequence. Ordinal data A meaningful order can be made from these data, yet mathematical operations produce no usable result [6]. An example of this data type are rankings of commercial products, test results, letter shirt sizes. 3

10 Quantitative data Data numeric in origin are called quantitative [1]. Attributes of quantitative data are numbers and should be treated as such. Examples could be dimensions of an object, count, temperatures or occurences of an event per hour. Quantitative data further splits into interval and ratio categories. Interval data According to [6], this type carries two distinctive traits. There exists no true zero for data of the interval type, and division cannot be applied on such data. If a temperature is taken in both Farenheit and Celsius measurement systems, the value describing the same amount of heat would have two different numbers. That is because neither of the two measurement systems has a true zero. The zero of each of the two is but a point inside an interval, that has been arbitrarily appointed the role of a zero. Furthermore, division would make no sense in this case, for dividing the temperature equivalents in both systems by the same number would produce different numerical results. Dividing 50 degrees celsius by two would produce the result of 25, while dividing it's equivalent in farenheit would produce the result of 61. Ratio data There exists a true zero, and division makes sense. The concept of a true zero comes from understanding of the meaning behind the number 0. It is generally accepted, that zero refers to nothing in terms of measurement. For this very reason, the temperature measurement system of Kelvin is of the ratio data type, rather than interval. Zero in Kelvin is placed at the very beginning of the scale, so that it is impossible to go any lower. The existance of a true zero allows for the division operation to make sense within the number domain, for it would always produce the same result regardless of the measurement system. 4

11 2.4 Discrete and Continuous Data The difference between discrete and continuous data types comes from the precision of measurement that each of them allows. Discrete data Measurements that require only a finite, or countably infinite set of values belong into this type of data. There exists no midpoint between two discrete values, so they are often seen as having gaps in between each other. Discrete data are usually represented in the form of integer variables. Examples of discrete data are: counts, syllables in a word, zip codes. Binary data are a special case of discrete data, that only have two values. Continuous data Are such data, that can be computed as accurately as instruments allow. Attributes contain real numbers as values, and the precision of measurement is only limited by the number of digits that the variable is provided with. They are often represented by floating-point variables. Examples of continuous data are: height, weight or velocity. Qualitative (categorical) data are always discrete. Quantitative (numeric) data can be either discrete or continuous. 5

12 Chapter 3 Data Mining In this day and age, information is being produced in the form of data by the most trivial of everyday human actions and interactions. Such data are more often than not being stored and kept for further processing. Each of them holds the possibility of containing hidden knowledge, that may not be explicitly stored inside the data structures, but can be derived from real world relations and processes. For extraction of such hidden information, data mining is used. The abundance of data, coupled with the need for powerful data analysis tools, has been described as a data rich but information poor situation [2]. 3.1 Approaches to Data Mining Predictive Approach The aim of predictive approach to data mining is to acquire the ability to predict values of attributes, without those being explicitly observed prior to evaluation [2]. This is achieved by employing the usage of machine learning (further refered to as ML) algorithms, that are explained in detail in chapter 5. A set of training data is provided to these algorithms. It contains examples, and labels associated with them. If sufficient number of examples has been provided, the algorithm is expected to acquire the ability to predict the label Descriptive Approach Descriptive approach to data mining strives to find characteristic attributes of data, and discover relationships between them [2]. These are often returned in the form of data sets, that contain attributes deemed to be the most characteristic given the mined data set, visual representation of the data set, or association rules that are explained in further detail in section 3.5 of this work. 6

13 3.2 Data Mining Steps Data mining is a process that involves a number of steps: Data Cleaning In the first step, data that contain corrupted or empty records are removed. Data Integration In order to proceed with data mining, data need to be collected and integrated into a single formated structure. However, different sources of data usually do not provide uniform structures and interpretations of data, therefore intergration into a single format needs to take place. Data Selection For data mining to provide results of high quality, data of high quality need to be supplied. Not all of the data collected are needed though. Data selection allows for choosing only such data, that are relevant to the task to be performed. Data Transformation The data that have passed the cleaning step are still not ready for data mining purposes, for they still need to be transformed into format accepted by the data mining algorithm. This is achieved via application of techniques such as smoothing, aggregation or normalisation. Data Mining In this step, various algorithms may be applied on the data in order to discover potential knowledge hidden within the data. Some of the algorithms applied might be classification, clustering or association analysis (chapter 4). 7

14 Pattern Evaluation The importance of results provided by data mining needs to be evaluated, for not all of the findings may be of interest to the inquiry. Some results may display lower levels of prediction precision, or may be hard to interpret to a human being. Redundant patterns are therefore removed. Knowledge Presentation Results that appear to be the most important undergo transformation and visualisation in order to be presented in the most understandable form. 3.3 Data Preprocessing Data preprocessing is a key element to improve the accuracy of data mining algorithms [7]. Not all data that are collected for the purposes of data mining are necessarily suited for the task. Wrong selection of data may heavily influence results of data mining, and therefore certain treatments need to be imposed on the data before they are supplied to the algorithms. Data cleaning This method of data preprocessing deals with removal of imperfect data, such as noise for continuous attributes. Another of this method s responsibilities is replacing the missing data records. [8] Feature selection and feature reduction Feature selection and feature reduction are data preprocessing methods that aim to improve the performance of the learning model by eliminating redundant features and keeping the important ones. Reduction of the number of features in the model helps alleviate the curse of dimensionality [8], which is one of the major problems in regards to data used for machine learning and data mining. The increase in dimensionality of an object results in major increase in it's volume, that causes the available data to become sparse. This 8

15 sparsity is rather problematic for any method that requires statistical significance, for the sheer amount of data that needs to be supplied in order for that significance to be achieved grows exponentially with the each dimension of the data. By removal of irrelevant features, dimensionality is decreased and performance of the model is further augmented in enhancing it's generalisation capability, speeding up of the learning process, improval of the model's interpretability and helping the user in acquiring a better understanding of their data by highlighting the important features and their relations. Data Transformation Some data may need normalization or discretization [8] to be usable by data mining algorithms. Normalization is a process that is applied on quantitative attributes, in order to eliminate the effect of having different scale measures. Discretization is used to transform quantitative attributes to qualitative ones. Additionaly, data transformation might assign weights to attributes in order to highlight significant ones. 9

16 Chapter 4 Data Mining Methods 4.1 Classification and Regression Classification [2] is one of the supervised learning methods for data mining, that uses predictive approach. The purpose of classificaton is to aid in construction of models that describe different classes of data. Different examples are separated into their respective classes according to their differentiating patterns. Watanabe described a pattern as the opposite of chaos; it is an entity, vaguely defined, that could be given a name [9]. Therefore, entities carrying a common trait usually form patterns. Voice recordings, human fingerprints, or even DNA sequences that carry a specific trait are examples of this. In order to classify an exemple, training data is supplied to the algorithm in order to establish a pattern. This pattern is then used to find functional mapping of the input data to their specific class label, so that the class label is the functional value of the input data. Each of the examples presented to the algorithm consists of a set of attributes, and membership of a class is determined according to one of these attributes, the target attribute. Data, for which the target attribute is the same are placed within the same class, and according to this placement a pattern is established. Since this pattern does not change once the training phase is over, the number of classes for classification tasks is known ahead of time. A method that deals with uncertain number of classes is called clustering, and it is the topic of section 4.2. Regression Regression is a form of prediction. Prediction is a process that is very similar to classification, with the single difference that target attribute is continuous or ordinal in value rather than discrete. As noted in sections 4.3 and 4.4, many algorithms used for classification can be easily adjusted to perform regression tasks as well. 10

17 Application Domains Classification and regression analysis have many possible applications. From hard science research fields such as speech recognition, biological classification, and robotics to document classification tasks. The only limitation in their area of usage appears to be the need to provide training data in order to achieve a certain degree of statistical significance, which effectively limits their usage to scientific fields rather than the analysis of constantly changing real-world data. 4.2 Clustering The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering [2]. A cluster is a collection of objects that carry similarity to each other, and at the same time dissimilarity to objects contained within other clusters. Classification is capable of discovering these similarities. However to do so, it requires the costly preparation of a large number of training data examples. On the other hand, the approach that clustering takes is often more desirable. Since clustering is an unsupervised learning model, it does not require such extensive preparation. The data are first divided into smaller number of partitions, to which labels are assigned. This approach helps to single out useful features, that can be used to differentiate between individual clusters. This learning method works on the basis of measuring distance within a feature space. Each attribute of an object corresponds to one dimension within the space. Objects, that are close to each other within this space, are therefore treated as similar, and form clusters. On the other hand objects, that are distant from each other carry little similarity in their attributes, and therefore belong into separate clusters. Clustering has found it s usage in a vast area of applications. Among the most prominent ones, is business marketing, image processing and medical science. 11

18 Due to it s nature, clustering is especially well suited for initial analysis of a data set in data mining. Once the clusters have been found within the set of data that is being analyzed, more sofisticated methods such as anomaly detection or association rule learning usually take place. Application Domains Clustering is a method that is used in virtually every possible application domain. It has found it s uses in fields ranging from medicine, through marketing, world wide web, computer science and social science to mathematical and ecological domains. Since clustering does not have any learning phase, its application on real world data is achieving outstanding results. It is capable of adapting to changes very well and therefore fits with both static, as well as dynamic models. 4.3 Anomaly Detection Anomaly detection [12] can be described as the opposite of clustering. The aim of this method is to find data, that do not fit within any established pattern. There are three general categories of anomaly detection: supervised, semisupervised and unsupervised anomaly detection. The difference between supervised and unsupervised learning techniques is a problem adressed in chapter 5 of this work. As for the semi-supervised anomaly detection, some training data are supplied to the algorithm. These are the normal data exapmles, that lack labels, which would have been abnormal data in the case of supervised anomaly detection. Application Domains The most important of this method s applications is fraud detection. Although this may appear to be a task for prediction models, the speed with which criminals develop new types of fraud poses a considerable problem for models that rely on learned patterns. Those quickly loose their accuracy and become 12

19 obsolete. Anomaly detection focuses on modeling normal data in order to identify data that differ from them. Anomaly detection was proposed for Intrusion detection systems (IDS) by Dorothy Denning in 1986 as well [10]. 4.4 Association Rule Learning Association analysis [12] is a process of discovering association rules, that describe relationships and dependencies between attributes and their values. Different association rules express different regularities that underline the dataset, and they generally predict different things [2]. For purposes of this work, the definition of association rule from [11] is used: By an association rule, we mean an implication of the form X => I j, where X is a set of some items in I, and I j is a single item in I that is not present in X. The idea of association rules comes from the market basket analysis, where the point of analysis was to discover which products are bought often, and which products are bought together. Information acquired like this has proved to serve well for enhancing the placement of products in a market, or preparing sale packages. Other possible uses of this method are web usage mining, intrusion detection or bioinformatics. Association rules consist of two parts. The first part is an antecedent, that is the conditional part of the rule, and can be found in the data. The second part is the consequent, that is described as the item found in combination with the antecedent. The form that these rules usually take is implication, where the part on the left side of implication corresponds with the antecedent and the part on the right side corresponds to consequent. Application Domains Although Agrawal et al [11] has introduced the concept of association rules in the context of large scale transaction data from supermarkets, it has found its usage in the areas of web usage mining, intrusion detection and bioinformatics. 13

20 4.5 Summarization Summarization [12] is one of the descriptive methods of data mining. One of characteristic traits of data is, that they usually belong to some cluster, group or class. The aim of the summarization method is to provide a short, yet precise description of the specified class of data. There are several approaches to solving this problem, yet the most widespread of them is generalization and summarization-based characterization. The attributes of data are generalized onto a higher level of abstraction and attributes that are not relevant to the class that is being characterized are left out. Numeric attributes associated with the attribute that has been generalized are aggregated. The output of this method consists of various graphs or other kinds of visualizations of data, that help represent the characteristics of the class. Application Domains Regardless of what data is supplied to this data mining method, it will be able to offer an acceptable result. Although potential usage is not limited to any single application domain, education and research is where it thrives. 14

21 Chapter 5 Machine Learning Machine Learning (further refered to as ML) is a field that has evolved from the study of Artificial Inteligence [3]. In simplified terms, the goal of ML is to mimic the principles of human cognitive processes by machines. Human brain determines the identity of an object based on a set of it's characteristic attributes. This principle applies to ML as well, as cluster analysis works on a similar principle (section 4.2). However, the most important question is, how to make machines able to learn. In the context of ML, the process of learning can be understood as an inductive inference, where the machine is presented with examples that contain incomplete information about some statistical phenomenon. From here on, either supervised, or unsupervised learning takes place. Supervised learning usually takes the predictive approach (explained in sub-section 3.1.1, and further expanded in section 4.1), while unsupervised learning tends to focus on descriptive approach (sections 4.2, 4.4, 4.5). 5.1 Meta Learning, Bagging and Boosting Meta learning [13] is a field of study in ML, where automatic learning algorithms are applied on meta-data about machine learning experiments. In the context of machine learning algorithms, the term meta-data refers to such data, that contain information about other data. Meta learning carries resemblance to this. The main goal of meta learning is to increase the understanding of how to improve performance of existing learning algorithms. Bootstrap aggregating, or so called bagging, is a meta learning algorithm that strives to improve classification and regression models in terms of stability and classification accuracy. This algorithm is usually used with decision trees, however its area of appliance covers all ML models. The algorithm works by taking a given set of training samples, and creating a set of new ones, that are filled with bootstrap samples. A bootstrap sample is a statistical sample taken 15

22 uniformly and with replacement, which means that the result sample set will contain duplicates. For m models, bagging generates m new training data sets. The model is then fitted using above m bootstrap samples. For regression, it is then combined by averaging the output, and for classification, voting. Boosting is a meta learning algorithm, that finds it's root in a question, that has been posed by Kearns: "can a set of weak learners create a single strong learner? [14]. A weak learner in this context refers to such classifier, that is able to label examples better than random guessing, but is not capable of true classification. A strong learner on the other hand, is capable of correlating well with the true classification. The answer to this question has been given by Schapire, and it has proved to carry significant ramifications in machine learning and statistics, as well as it has consequently led to development of boosting. The answer to Kearn's question is yes. While boosting is not algorithmicaly constrained, the majority of boosting algorithms are composed by utilising iterative addition of weak learners into a single composition, so that each of them is weighted according to precision rate of it's classification. With each added learner, weights of classified samples are re-evaluated, so that misclassified samples gain more importance for the learners that are yet to be added into the composition. 5.2 Bayesian Learning and Naive Bayesian Classifier Bayesian learning [3] is a method of statistical inference, that utilises Bayes' theorem to calculate how the degree of belief in a proposition changes in accordance to evidence. The model uses Bayesian interpolation of probability, in which probabilities represent the degrees of belief. The Bayesian learning works as such: before any data has been observed, the expectation as to what the true relationship between those data might be, can be expressed in a probability distribution over the assumptions, that define this relationship. This distribution is refered to as "prior". Once the data has been revealed to the algorithm, the revised findings are captured as a posterior probability 16

23 distribution. The assumptions, that have appeared to be plausible before, yet were revealed to be incorrect, will experience decrease in their probability. On the other hand, assumptions that have managed to meet the expectations will have their probability increased. Bayesian learning has found its application in the field of artificial intelligence, and therefore machine learning. This learning method has played a fundamental part in pattern recognition since the early 1950s. The classification algorithm based on this method is called the naive Bayesian classifier. Input Data Naive Bayesian classifiers [3] can handle any number of variables, regardless of whether they are qualitative or quantitative. The algorithm works on the assumption, that variables provided to the classifier are independent. Even though this might not always be the case, it greatly simplifies the classification task. Instead of being presented with a multi-dimensional task, the algorithm has to compute only a set of one-dimensional tasks. Furthermore, the regions near decision boundaries do not seem to be greatly affected by doing this, thus leaving the classification task unaffected. Algorithm description The classification task in naive Bayesian classifier is performed by evaluating the posterior probability. Given a set of variables X = {x 1, x 2,..., x n }, the algorithm attempts to determine the posterior probability of an event C i among a set of possible outcomes C = {c 1, c 2,..., c j }. Simplified formula for calculating posterior is: j p C i X = p C i k=1 p x k C i Using this modified Bayes' rule formula, a new case X is labeled with a class level C i that achieves the highest posterior probability. 17

24 Application Domains The most notable area of usage that employs the naive Bayesian classifier is spam detection in , however thanks to it's excellent performance and low hardware requirements, this algorithm is spreading into other areas such as text classification, medical diagnosis and system performance management. An example of this is the Naive Bayesian Classifier for Rapid Assignment of rrna Sequences into the New Bacterial Taxonomy. Pros and Cons The usage of naive Bayesian classifier provides incredibly short training time and fast evaluation. It should also be noted, that this algorithm has proven to be rather well suited for real world problems. Despite its limitations, naive Bayes was shown to be optimal for some important classes of concepts that have a high degree of feature dependencies, such as disjunctive and conjunctive concepts [15]. However, because of its simplistic nature, solving of more complex classification problems is not possible with naive Bayesian classifier. Computational and Memory Requirements Bayesian classifier is well known for modest requirements that it puts on memory and fast computational times, that are often better than those of other algorithms. 5.3 Lazy Learning and k-nearest Neighbor Algorithm The principle behind lazy learning is, that generalisation beyond the training data is being held off until a query has been made to the system. The main advantage of this method is that the target function will only be approximated locally, and therefore all computation is left until classification takes place. Another significant advantage of lazy learning systems is, that they can simultaneously solve multiple problems and deal successfuly with changes in 18

25 the problem domain. However, there are two major disadvantages that this method suffers from. It has rather high memory requirements, since the whole data set needs to be stored, and it's evaluation is slower compared to other algorithms. Input Data Since k-nearest Neighbor Algorithm [16][2] is a non parametric lazy learning algorithm, it does not make any assumptions on the underlying data distribution. This is espetially important for applications that deal with observational data retrieved from real world interactions, because such data rarely follow theoretical assumptions. KNN assumes it's data points to be in a metric space. They are stored in the form of vectors, where all training examples consist of a vector and class label. The algorithm is also supplied with the number k, according to which the classification is performed. The vectors can carry both qualitative and quantitative attributes, as well as continuous and discrete, provided minor adjustments to the algorithm. Algorithm Out of all the classification methods, Arguably the simplest method is the k- Nearest Neighbor classifier [16]. For the test point in the training data, it's nearest k neighbors are selected. Then a label is given to the test point by the majority vote between it's k neighbors. The test point is assigned it's class according to the most common class amongst it's k nearest neighbors in accordance. Application Domains The main usage of the KNN algorithm is classification, yet there are several other purposes for which it can be used. One of them is regression. That is achieved by altering the algorithm, so that the property value of the test point is 19

26 the average of the distance-weighted values of its k nearest neighbors. Density estimation is done by placing a hypercube at the test point x, and increasing it's size until the required number k of it's neighbors is enclosed within the cube's space. Density then can be estimated using the following formula: p x = k /n V Another possible application of KNN is the estimation of continuous variables. There are several different implementations of this functionality, yet the most well-known of them uses the inverse distance weighted average of the k-nearest multivariate neighbors. Pros and Cons Since KNN is one of the easiest algorithms of ML, it is easy to understand and programm. If there is no majority agreement, there exists an explicit reject option. It also contains easy handling of missing values by restricting the distance calculation to subspace and it has asymptotic misclassification rate that is bounded above by twice the Bayes error rate. As for the downsides, it is easily affected by local structure of data, as well as it is sensitive to noise and irrelevant features. For data sets that contain a big number of examples that belong to the same class, KNN has a tendency to have that specific class dominate the prediction because of the majority vote mechanism. This can be overcome by weighting the classification according to the distance from the test point to each of it's k neighbors. It also suffers from the curse of dimensionality, where given large data sets, the nearest neighbor might not even be near at all. Computational and Memory Requirements Since all the training examples are kept, the KNN has a rather large memory consumption. Compared to algorithms such as decision trees and linear classifiers, KNN is rather ineffective in the area of memory usage. Time 20

27 efficiency is an issue as well, since it's O(nd). To classify a specific point x, the algorithm has to go through all the training examples. KNN becomes a viable choice once the data set contains many classes, for it computes predictions for all of them. 5.4 Decision Tree Algorithm Decision tree [2][3] is a learning model designed to predict the value of a target variable based on a number of input variables. There are several notable usages of decision trees, such as random forest classifiers, boosted trees or C4.5 algorithm [17]. Input Data The algorithm itself accepts data in the form records, where each entry is in the form of (x, Y) = (x 1, x 2, x 3,..., x k, Y), where Y is the target variable that is the subject of interest to classification or regression. The vector x is a collection of variables that are used for the task. Application Domains Decision tree is constructed by repeatedly partitioning the input space, so that the partitions form a tree structure. The algorithms that are used for constructing decision trees usually work top-down by choosing a variable at each step that is the next best variable to use in splitting the set of items [18]. Once the tree has been constructed, leaves correspond to class labels and branches conjunctions of features, that lead to those labels. Classification or regression is performed by advancing through the tree from top to bottom along the branches, until a terminal node has been reached. Each leaf in the decision tree corresponds to a value of the target variable given the values of the input variables that were visited on the way from the root node to the leaf. 21

28 Application Domains The basic usage of decision trees is to solve classification and regression problems. Regression tree analysis is used when the predicted outcome can be considered a real number. Classification tree analysis is performed when the predicted outcome is the class to which the data belong. Extended usage of decision trees can be seen in a number of algorithms such as random forests, boosted trees, ID3 and it's extension C4.5 algorithms, CHi-squared Automatic Interaction Detector or MARS. Pros and Cons Decision trees are simple to understand and interpret. They are not only used for ML purposes, but even for general business decision making purposes. Even though the results of data mining via decision trees may not offer such functionality, classification trees built from such decision trees can be used to make decisions. This stems from the usage of the white box model, which allows us to view and understand the inner workings of the algorithm. Another strong point of decision trees is, that little to no data preparation is needed. The problem of learning an optimal decision tree is known to be NPcomplete under several aspects of optimality and even for simple concepts [19]. Also, decision trees require mechanisms such as pruning, for decision tree learners can create trees that are too complex and don't generalise data well. Another limitation of decision trees is their inability to express some concepts well, such as XOR, parity or multiplexer problems. Lastly, for categorical data with different numbers of levels, information gain in decision trees is biased in favor of those with more levels. Computational and Memory Requirements Decision trees are very effective in regards to resources. They use less memory than some other classification and regression algorithms and require little time to analyse large data sets. 22

29 5.5 Information Entropy and C4.5 Algorithm C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan [17]. It is an extended version of Quinlan's earlier algorithm, ID3. C4.5 uses the concept of information entropy, to build decision trees. Information entropy, in the context of information theory usually refering to Shannon entropy, is a measure of uncertainty associated with a random variable. It may be quantified as the expected value of information contained within a message, or equivalently as the average information content that is missing when the value of the random variable is not known. The principle of Shannon's entropy represents an absolute limit set upon losless compression of any communication, so that for messages that are encoded as a sequence of independent and identically-distributed random variables, the average length of the shortest possible representation to encode the message in the given alphabet is their entropy divided by logarithm of the number of symbols in that alphabet. The entropy rate of a data source means the average number of bits per symbol needed to encode it. Information Gain Also refered to as Kullback-Leibler divergence is a non-symmetric measure of difference between two probability distributions [20]. Information gain measures the extra number of bits that are needed to code samples from one distribution while using a code based on the other. Usually the first distribution represents the "true" distribution of data, while the other is a theory or a model. Algorithm C4.5 uses two sets of vectors as it's building blocks for decision trees. The first set of vectors, training data, is supplied as a set of examples that have already been classified. Each sample is in the form of a vector whose values are its attributes or features. The second vector contains classes, to which each sample 23

30 belongs. It is used to augment the training data. At each partitioning step, C4.5 chooses an attribute based on the criterion of normalized information gain, that results from choosing an attribute for splitting data. The attribute with the highest normalized information gain is chosen [17]. 5.6 Neural Networks Neural networks are a computational model inspired by the connectivity of neurons in animate nervous systems [21]. As an alternative to von Neumann computer architecture, which is based on sequential processing, neural networks are derived from observation of how animal brains work. The principle of a neural network has first been suggested by Turing in Input Data Artificial neural networks are restricted to numeric data, and are only capable of processing those in a fairly limited range. If data is in an unusual range, or if there is some missing data, problems may arise. Fortunately, there are methods that can be used to adress each of these issues. Numeric data can be scaled to fit within the required range of the network and missing data can be substituted using mean value or some other statistic. Algorithm A neural network consists of computational elements called neurons. A neuron is a device, that has many inputs and only one output. It can function in two modes. In training mode, the neuron is presented with training input patterns and taught for which to fire, and for which to withold firing a signal. In the using mode, neuron is evaluating the input patterns looking for a learned pattern. If such pattern is detected, the associated output from the training data becomes the current output. If the input pattern is not recognised, the firing rule is used to determine whether to fire or not. 24

31 Each neuron computes a weighted sum of its inputs and can perform a nonlinear function on this sum. If sufficient amount of examples has been provided, and required number of neurons is present in the network, the function computed by the network can approximate any function. The architecture of neural networks comes in two basic types. Feedforward networks allow signals between neurons to travel in one direction only; from input to output. Such networks are used primarily for pattern recognition. Feedback networks allow for signals to travel in both directions, which in turns makes them computationaly very powerful. However, it also tends to make them extremely complicated. Such a network functions dynamically, where the state of the network changes continuously until an equilibrium is reached for the given input pattern. Neural networks consist of three types of layers. Input layer represents the raw information that is fed to the network. It is connected to the hidden layer, where activity of each hidden unit is determined by the input units and their weights on connections. Lastly, the activity of the output units is determined by the evaluation of hidden units and weights on connections that lead to the output units. There are neural networks designed for supervised as well as unsupervised learning. Application Domains Neural networks are best suited for pattern recognition tasks, and therefore excel at solving real world problems related to prediction and forecasting. Many businesses already employ the usage of neural networks to do customer research, data validation, sales forecasting, risk management or target marketing, because these areas usually have high amounts of training data available. The usage of neural networks is not restricted to marketing needs. There are many applications of this paradigm in medical field and scientific research. 25

32 Possible applications include function approximation, regression analysis, classification, data processing such as filtering, clustering or compression, and robotics. Pros and Cons The primary advantage of neural networks is, that they can be deployed to deal with problems that cannot be solved by constructing an algorithmic solution. This sets them apart from any other tool available, for there is a number of cases where no other solution will suffice. It is important to note that models that use neural networks need to have high tolerance to errors, which stems from the non-algorithmic way they work. Another concern regarding this algorithm is the principle of the black box. Unlike white box algorithms such as decision trees, where it is possible to see and understand the structure and inner workings of the algorithm, neural networks and black box algorithms in general do not offer such insight. Computational and Memory Requirements Requirements that neural networks pose on hardware vary depending on the size of the network and it's implementation. ANN that solve difficult problems such as speech recognition may consist of thousands neurons with hundreds of inputs, and such networks would require enormous resources, or even custom hardware solutions. Also, input data supplied during the training phase tend to be considerably more complex than those for classification, so different hardware requirements need to be met for each of the two modes. 5.7 Support Vector Machines Support vector machine [22] is a non-probabilistic binary linear classifier. It provides supervised learning methods for classification and regression analysis. SVM takes a set of training data, and predicts for each given input to which of the two possible classes it belongs. To achieve this, kernel function to map 26

33 training data into a feature space is used. The data are then separated using a large margin hyperplane which establishes two separate classes. Input Data Input data for SVMs are sets of vectors, often given in a finite dimensional space. However, some of the data given in such a way might not be linearly separable. Mapping such data onto a high-dimensional space makes their separation easier. Algorithm In the first part of the learning process, SVM learns the division of classes by separating the training data into two sectors within the specified highdimensional space. The training data is mapped onto a high-dimensional plane using a kernel function. The algorithm then constructs a hyperplane within the specified space. The hyperplane is chosen in such a way, so that the distance from the nearest training data is the largest possible. This is because the larger the margin, the lower the generalisation error of the classifier. Most commonly used kernel functions for the mapping between training data and high-dimensional space are: x x ' 2 RBF kernels: k x, x ' = σ 2 and polynomial kernels: k x, x ' = x x' d. Application Domains The most common usage of SVMs is for classification and regression. The motivation towards using SVMs is that the solution is always global and unique. This algorithm also has a simple geometric interpretation and gives a sparse solution. Real world applications that use SVM range from research, through education to commerce. Some examples of such applications are: Protein 27

34 Structure Prediction, Support Vector Classifiers for Land Cover Classification or even Intrusion Detection Systems. Pros and Cons One of the most notable limitations connected to usage of SVMs is, that classification is only directed to two-class tasks. Therefore, in order to solve multi-class tasks, algorithms that deal with reduction to several binary problems have to be applied. Also parameters of a solved model are hard to interpret, and class membership probabilities are uncalibred. Computational and Memory Requirements Since SVMs work in high-dimensional space, the curse of dimensionality takes it's toll in regards to hardware requirements. Memory consumption is rather high. 28

35 Chapter 6 Freely Available Software Tools 6.1 WEKA Weka is a fully functional data mining software package that provides a high level of functionality for users. For example, the software provides API support to users that allows for the ability of various components of the software to communicate with each other. API In fact, it has been noted that the API functionality of Weka provides users with the ability to achieve increased functionality because of the many freely available programming codes that are available online [23]. Even more, the software contains the ability to perform over 100 types of data mining methods, including Bayesian methods, rule-based methods, and statistical analysis. The inclusion of so many different types of data mining methods in Weka makes the software useful for a variety of data mining methods and in a variety of industries. Users in different industries are not likely to face any inability to use a desired data mining method with Weka. Database System Support Another strength of Weka is that the software natively supports the ability to read files from a variety of database formats [24]. For users who obtain data from the internet, a specific strength of Weka is the ability to acquire data from both SQL databases and from actual webpages by entering the URL of the webpage containing the information [25]. This makes it possible for users to easily input information into the software that might not actually be in a format that would make it easily read by other data mining packages. 29

36 Visualization Capabilities While Weka has strong support for the use of APIs, a variety of data mining methods, and supported database systems, one of the weaknesses of the software is its visualization support. It is important to note that the software provides visualization of data, results, and processes, but the support that is provided is somewhat limited. What is meant by this is that the visualization of data, results, and processes is not highly colorful or as detailed as other data mining software packages [26]. However, the visualization that is provided is certainly sufficient for being able to view the data on which the analyses are being performed and the results of the data analyses efforts. In addition, add-ons are available that can increase the visualization functionality of the software [24]. As a part of the visualization support in the software and the add-ons that are available,. Weka is able to interface with the R statistical package in order to not only increase its statistical analysis functions, but to also allow for increased visualization of statistical analyses and results [26]. PMML Support Weka has support for PMML. This allows users to import PMML files that are created in both propriety and open-source data mining and statistical software packages. However, the software does not currently have support for exporting data files in the PMML format for use in other applications. This functionality is planned for future releases of the software [24]. Statistical Analysis Capabilities Weka can perform just about any type of statistical analysis. In addition to performing the most basic descriptive and inferential statistical analyses, the software also allows for cluster analysis to be performed. Also, as has already been noted, Weka has the ability to interact directly with the R statistical package. This makes it possible to increase the statistical functionally of the software, as well as makes it possible for users that are more comfortable or familiar with R to 30

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

In this presentation, you will be introduced to data mining and the relationship with meaningful use. In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, fabian.gruening@informatik.uni-oldenburg.de Abstract: Independent

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

How To Use Neural Networks In Data Mining

How To Use Neural Networks In Data Mining International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

More information

D A T A M I N I N G C L A S S I F I C A T I O N

D A T A M I N I N G C L A S S I F I C A T I O N D A T A M I N I N G C L A S S I F I C A T I O N FABRICIO VOZNIKA LEO NARDO VIA NA INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe.

More information

Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Sanjeev Kumar. contribute

Sanjeev Kumar. contribute RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 sanjeevk@iasri.res.in 1. Introduction The field of data mining and knowledgee discovery is emerging as a

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

NEURAL NETWORKS IN DATA MINING

NEURAL NETWORKS IN DATA MINING NEURAL NETWORKS IN DATA MINING 1 DR. YASHPAL SINGH, 2 ALOK SINGH CHAUHAN 1 Reader, Bundelkhand Institute of Engineering & Technology, Jhansi, India 2 Lecturer, United Institute of Management, Allahabad,

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

A Review of Data Mining Techniques

A Review of Data Mining Techniques Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Classification and Prediction techniques using Machine Learning for Anomaly Detection.

Classification and Prediction techniques using Machine Learning for Anomaly Detection. Classification and Prediction techniques using Machine Learning for Anomaly Detection. Pradeep Pundir, Dr.Virendra Gomanse,Narahari Krishnamacharya. *( Department of Computer Engineering, Jagdishprasad

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016 Network Machine Learning Research Group S. Jiang Internet-Draft Huawei Technologies Co., Ltd Intended status: Informational October 19, 2015 Expires: April 21, 2016 Abstract Network Machine Learning draft-jiang-nmlrg-network-machine-learning-00

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Learning outcomes. Knowledge and understanding. Competence and skills

Learning outcomes. Knowledge and understanding. Competence and skills Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges

More information

Microsoft Azure Machine learning Algorithms

Microsoft Azure Machine learning Algorithms Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql Tomaz.kastrun@gmail.com http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation

More information

How To Make A Credit Risk Model For A Bank Account

How To Make A Credit Risk Model For A Bank Account TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

More information

Neural Networks and Back Propagation Algorithm

Neural Networks and Back Propagation Algorithm Neural Networks and Back Propagation Algorithm Mirza Cilimkovic Institute of Technology Blanchardstown Blanchardstown Road North Dublin 15 Ireland mirzac@gmail.com Abstract Neural Networks (NN) are important

More information

Learning is a very general term denoting the way in which agents:

Learning is a very general term denoting the way in which agents: What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Data Mining Techniques for Prognosis in Pancreatic Cancer

Data Mining Techniques for Prognosis in Pancreatic Cancer Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski trakovski@nyus.edu.mk Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier Data Mining: Concepts and Techniques Jiawei Han Micheline Kamber Simon Fräser University К MORGAN KAUFMANN PUBLISHERS AN IMPRINT OF Elsevier Contents Foreword Preface xix vii Chapter I Introduction I I.

More information

Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski

More information

8. Machine Learning Applied Artificial Intelligence

8. Machine Learning Applied Artificial Intelligence 8. Machine Learning Applied Artificial Intelligence Prof. Dr. Bernhard Humm Faculty of Computer Science Hochschule Darmstadt University of Applied Sciences 1 Retrospective Natural Language Processing Name

More information

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut. Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,

More information

Intrusion Detection via Machine Learning for SCADA System Protection

Intrusion Detection via Machine Learning for SCADA System Protection Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. s.l.yasakethu@surrey.ac.uk J. Jiang Department

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Nine Common Types of Data Mining Techniques Used in Predictive Analytics 1 Nine Common Types of Data Mining Techniques Used in Predictive Analytics By Laura Patterson, President, VisionEdge Marketing Predictive analytics enable you to develop mathematical models to help better

More information

6.2.8 Neural networks for data mining

6.2.8 Neural networks for data mining 6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural

More information

A New Approach for Evaluation of Data Mining Techniques

A New Approach for Evaluation of Data Mining Techniques 181 A New Approach for Evaluation of Data Mining s Moawia Elfaki Yahia 1, Murtada El-mukashfi El-taher 2 1 College of Computer Science and IT King Faisal University Saudi Arabia, Alhasa 31982 2 Faculty

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Neural Networks in Data Mining

Neural Networks in Data Mining IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V6 PP 01-06 www.iosrjen.org Neural Networks in Data Mining Ripundeep Singh Gill, Ashima Department

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management Foundations of Business Intelligence: Databases and Information Management Problem: HP s numerous systems unable to deliver the information needed for a complete picture of business operations, lack of

More information

Data Preprocessing. Week 2

Data Preprocessing. Week 2 Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics

More information

Data Warehousing and Data Mining in Business Applications

Data Warehousing and Data Mining in Business Applications 133 Data Warehousing and Data Mining in Business Applications Eesha Goel CSE Deptt. GZS-PTU Campus, Bathinda. Abstract Information technology is now required in all aspect of our lives that helps in business

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Machine Learning and Data Mining. Fundamentals, robotics, recognition Machine Learning and Data Mining Fundamentals, robotics, recognition Machine Learning, Data Mining, Knowledge Discovery in Data Bases Their mutual relations Data Mining, Knowledge Discovery in Databases,

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

Statistics for BIG data

Statistics for BIG data Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association

More information

A Survey on Pre-processing and Post-processing Techniques in Data Mining

A Survey on Pre-processing and Post-processing Techniques in Data Mining , pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,

More information

A Comparative Analysis of Classification Techniques on Categorical Data in Data Mining

A Comparative Analysis of Classification Techniques on Categorical Data in Data Mining A Comparative Analysis of Classification Techniques on Categorical Data in Data Mining Sakshi Department Of Computer Science And Engineering United College of Engineering & Research Naini Allahabad sakshikashyap09@gmail.com

More information

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Data is Important because it: Helps in Corporate Aims Basis of Business Decisions Engineering Decisions Energy

More information

Specific Usage of Visual Data Analysis Techniques

Specific Usage of Visual Data Analysis Techniques Specific Usage of Visual Data Analysis Techniques Snezana Savoska 1 and Suzana Loskovska 2 1 Faculty of Administration and Management of Information systems, Partizanska bb, 7000, Bitola, Republic of Macedonia

More information

Data Mining and Neural Networks in Stata

Data Mining and Neural Networks in Stata Data Mining and Neural Networks in Stata 2 nd Italian Stata Users Group Meeting Milano, 10 October 2005 Mario Lucchini e Maurizo Pisati Università di Milano-Bicocca mario.lucchini@unimib.it maurizio.pisati@unimib.it

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli (alberto.ceselli@unimi.it)

More information

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d. EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models

More information

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors Classification k-nearest neighbors Data Mining Dr. Engin YILDIZTEPE Reference Books Han, J., Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques. Third edition. San Francisco: Morgan Kaufmann

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Monday Morning Data Mining

Monday Morning Data Mining Monday Morning Data Mining Tim Ruhe Statistische Methoden der Datenanalyse Outline: - data mining - IceCube - Data mining in IceCube Computer Scientists are different... Fakultät Physik Fakultät Physik

More information

A Property & Casualty Insurance Predictive Modeling Process in SAS

A Property & Casualty Insurance Predictive Modeling Process in SAS Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing

More information

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S. AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

More information

DATA PREPARATION FOR DATA MINING

DATA PREPARATION FOR DATA MINING Applied Artificial Intelligence, 17:375 381, 2003 Copyright # 2003 Taylor & Francis 0883-9514/03 $12.00 +.00 DOI: 10.1080/08839510390219264 u DATA PREPARATION FOR DATA MINING SHICHAO ZHANG and CHENGQI

More information

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM KATE GLEASON COLLEGE OF ENGINEERING John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE (KGCOE- CQAS- 747- Principles of

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information