Data Mining and Automatic Quality Assurance of Survey Data

Size: px
Start display at page:

Download "Data Mining and Automatic Quality Assurance of Survey Data"

Transcription

1 Autumn 2011 Bachelor Project for: Ledian Selimaj (081088) Data Mining and Automatic Quality Assurance of Survey Data Abstract Data mining is an emerging field in the computer science world, which stands in between the fields of statistics, machine learning and pattern recognition. Classification is one of the main tasks of data mining and its ultimate goal is to describe or predict certain unknown data into predefined categories based on knowledge acquired on data of the same type beforehand. Programmatic quality checking of field survey data is difficult as it requires manual supervision to be able to identify and differentiate data trends. This paper attempts to explore the possibility to run various classifiers on field survey data and then use different techniques to build classification models to mimic manual supervision of data to programmatically check the quality of field survey data. The choice of the right classifiers and model testing techniques are of crucial importance for the whole process as the main purpose is to ensure high reliability of the predictive classification models that will be used to predict the category where new and previously unknown data pertains. Emphasis is placed on picking and utilizing the best classifiers as well as to test with several manners and combine them for a classification task in order to provide more accurate results than just using one single algorithm, which might suffer from structural problems. I accept that the report is available at the library of the department Student: Ledian Selimaj Supervizor: Chengzi Chew Company: DHI Coordinator: Hans Christian Pedersen Ext. Examiner: Jens Damgaard Andersen Sign... Sign... Sign... Sign...

2

3 This is to verify that: The entire material in this thesis report contains only my original work All acknowledgments for the material referenced has been explicitly made Ledian Selimaj January 2012

4

5 To my beloved Mother...

6

7 Contents Contents i 1 Problem Formulation 5 2 Introduction Business understanding The problem Why data mining? Data mining as a process Thesis Objectives Thesis Outline Examining and Understanding the Data Implications of data sets Attributes Statistics Numerical attributes Class attribute Analysis of the data Data quality Correlation analysis and attribute selection Outliers Model Creation Classification Model creation process Testing techniques Classifiers Trees Rules i

8 ii CONTENTS Meta Alternative approaches Model Assessment and Selection Model assessment process Model selection Measures of performance Tests and experiments Classifying unlabeled data using selected model Customized AutoQA tool Tool Development Integrating Weka library Design issues Functionalities Conclusions Conclusions Bibliography 59 A Test and experiments full results 63 A.1 MS A.2 MS A.3 MS B AutoQA Monitor Requirement Specifications 67 B.1 Requirements B.2 Use cases List of Figures 71 List of Tables 73

9 Acknowledgements Not everything that counts can be counted, and not everything that can be counted counts (A. Einstein) This bachelor thesis was done at DHI. All the material used for the study was provided by the DHI Data Center. Special thanks to: Chengzi Chew for his supervision and technical support Hans C. Pedersen for his supervision and encouragement 1

10

11 Chapter 1 Problem Formulation This thesis attempts to explore the possibility to use data mining learning algorithms and techniques to perform quality assurance of field survey data by labeling the data as being correct or not. The data is gathered from numerous sensors in the stations that reside in particular sites in the Baltic Sea. The presence of incorrect data is due to some errors that might occur on the sensors that gather the data as well as some other external factors during data transmission from the stations to the data center. The central task is to derive conclusions on some of the best data mining techniques and algorithms that could be used for the quality assurance of the data. Certainly, in order to achieve this many tests should be performed and necessary analysis should be carried out on their results. Implementation of an application that would provide the needed data mining apparatuses to perform the tests is also a significant part of this thesis. The main issues to be solved in order to have a successful approach toward the problem solution are: Constructing data sets Special attention is dedicated to constructing the proper data sets that would be inputs for the classifiers by ensuring that there is enough information for them to find the patterns underlying the data. However a proper analysis on the physical meaning of the data should be done in advance. The classification of wrong and right data is not trivial and a number of parameters should be included in the data sets in order to make the classification process easier and more accurate for the classifiers. The decision on what data to be included has to do with the correlation of the various physical parameters and necessary discussions on that matter should be done in accordance with the classifiers performances that run on these data. Using the best data mining practices The data mining process of building and testing a classification model is done by following a specific protocol. There are several possibilities of using them, therefore a number of necessary tests should be done to pick the most reliable and safe technique both while 3

12 4 CHAPTER 1. PROBLEM FORMULATION creating and testing a classifier. Reliability of the performance of the classifier model during testing is very important since that would be considered later when classifying unlabeled data by ensuring that the result of the classification would be as accurate as possible. Selecting the best classifier Judging about performance of a certain classifier for a given problem and compare it with others is always done with the aim of picking the best performing classifier. However this becomes complicated when the results of many classifiers are close to each other. Thus several measurements of the performance of each classifier should be considered to make a decision for the best one. Special measurements that show how good the particular classes are classified by the classifier get a special attention. There are also other techniques use to combine several classifiers together for making a single classification. A verification on how well does this technique improves as compared to the classification done by a single classifier needs to be done. The results of the use of some ensemble methods has shown to be promising in the data mining classification task. Thus a discussion on the degree that the ensemble classifiers might have over the primitive once has to be verified through some tests.

13 Chapter 2 Introduction This chapter presents the problem of this thesis from the business point of view. This is a real life problem and a solution might be offered by the use of data mining. A definition of data mining is stated and then a brief explanation on why and how data mining can be used to provide a reasonably solution for the problem introduced. Afterwards a data mining problem solving methodology is briefly explained following the objectives of the thesis as well as an outline of the following chapters. 2.1 Business understanding Numerous hydrographical data is being transmitted from three large monitoring stations placed at different water depth in the Baltic Sea. These stations transmit data directly to a data handling center located in Denmark. It is the data center s task to consolidate, continuously monitor, and ensure quality assurance and reliability of the transmitted data. The former is a difficult task to be done automatically since the judgment whether the data coming from the sensors is correct or not includes the evaluation and analysis of many parameters, their historical values, the time of the year when the data sample is taken and many others. Data mining classifiers could be of usage in this situation. They run on the data and construct classification models that can accomplish the required classification of the data, that is to classify (label) each instance of the data set to be either correct or not. So, data mining classification models might provide a convenient way to atomize the quality assurance process of the field survey data The problem The data is coming from the sensors placed in the sea through a transmission to the data handling center. The raw data is stored in a database after some reallocation and calibration processes. Afterwards is the time for the automatic quality assurance (autoqa) to do some checks on the quality of the received data. After this post process has successfully finished the data is updated 5

14 6 CHAPTER 2. INTRODUCTION in the database. The autoqa goes through the data, makes some check and assigns a flag to each of its instance depending on the result of the check. For example, if the data is bad it is marked with 4, otherwise it is marked with 5 [17]. The value of the flag depends on the calculations and a bunch of checks performed on the values of each attribute of the data instances. The tests that need to be done in order to check the correctness of a data instance are: -check if the value exceeds a predetermined threshold as its maximum value allowed -check if it goes below a predetermined threshold as its minimum value allowed -check if its value is increased within a predetermined gradient in 1 measurement time step -check if its value is decreased within a predetermined gradient in 1 measurement time step -check if its value is repeated for a predetermined number of time steps The automatic quality assurance process is followed by a manual quality assurance of data. Because the autoqa filter lacks intelligence and threshold values have been hardcoded, it makes wrong classification of good and bad data instances. Therefore a reviewer needs to manually go through the data and change the automatic quality assurance flags based on many standards and relations among the data. This process is very time consuming as well as expensive. A person needs to check every instance on the data sets and decide if the automatic quality assurance made a good or bad classification. Subsequently, the idea to automatize this process is raised. Such an automatic process will provide both faster processing as well as cheaper cost. Learning algorithms that perform similar checks as the review does manually can be used in this situation. Since the task at hand is a classification task, data mining classifiers offer a good chance to atomize the autoqa filtering Why data mining? There are many different definitions on what data mining is but what is the core of all is that data mining is the process of automatically discovering useful information in large data repositories [1]. This useful information (or knowledge) is found as patterns on the data in different structures. By structure is meant that the patterns found are represented in an explicit form, for example a tree, a bunch of rules, decision tables etc. The power of these patterns is that the meaningful information found on some data can be applied to data that is previously unknown and used for classification predictions, which leads to some advantages, usually economic ones [2]. The benefits of having a classification done by a computer program are obviously greater as compared to for instance manual classification, which is quite impractical for large data sets. Classification is also one of the main tasks of data mining. Predictive classification models can be seen as functions that have the class value as their output and some explanatory variables as inputs. Practice has shown that applying the learned knowledge structures to new data can generate satisfactory results in their prediction [2]. Quality assurance of field survey data is a typical classification task. The data that is meaningful and conveys right information can be labeled as being true or false otherwise. Therefore patterns for particular data can be retrieved from data set that are already manually classified

15 2.2. DATA MINING AS A PROCESS 7 as being true and false and can afterwards be used to classify new data with empty class values. Certainly, the level of accuracy depends on the classifier as well as on the information that the data sets that it runs on contain. Data mining offers various techniques on how to get more and more accurate predictions. The process of constructing as well as testing the models should be accomplished in a manner that ensures reliability of the predictions abilities of a particular classifier. Subsequently, by obtaining this reliability, the chances for success in future predictions on new data are high. 2.2 Data mining as a process Data mining consists of a set of tools and algorithms that could be used for different business problems solutions. However what really matters is the analyzes and the knowledge provided by using it and all its facilities and tools in the right way, and in some structured manner. Therefore data mining should be viewed as a process. The particular standard process used is the CRISP-DM framework: the Cross-Industry Standard Process for Data Mining. CRISP-DM demands that data mining be seen as an entire process [3]. According to CRISP-DM methodology the lifecycle of a data mining project consist of six phases: Business understanding This phase describes the problem from the business point of view and tries to understand the objectives and requirements of the problem. Afterwards the problem is converted to a data mining problem. Data understanding It is crucial for a data mining project to understand the nature of the data that is going to be processed. Analyzes of the attributes should be done and all the necessary information should be included in the data set. It is also convenient to go through the data and check its quality, for example if there are missing values, redundant instances etc. Data preparation After all relevant analyzes and cleaning of the data have been made in the preceding phase the final data set needs to be constructed. This data set will be the input for all the models that will be constructed and used. Model building There are many different techniques for building models in data mining. The choice of the best one is always a priority so several parameter and techniques should be tried Model evaluation

16 8 CHAPTER 2. INTRODUCTION After having several models constructed in various ways, proper analyzes should be done to search for the best performing model and the most successful techniques used for constructing it. At the end of this analyzes only one model is chosen for the task Model deployment The model chosen will be applied to previously unknown data. Even if the purpose of the picked model is to increase the knowledge gained from the data, it should properly be applied on that data, otherwise the result may not be that satisfactory Figure 2.1: CRISP-Data Mining project methodology [16] Picture 2.1 depicts the entire process as explained above. As noticed from the picture, going back on the steps is necessary. For example in steps of Data understanding, Modeling, and Model evaluation it would be appropriate to go one step backward during the process in order to make necessary changes in case something went wrong on unexpected results were generated. This is important to ensure that the whole process is compact and coherent to have the best possible results. 2.3 Thesis Objectives The main subject of the thesis is to explore the opportunity of applying data mining classification models for assuring quality assurance of field survey data. It examines what are the right procedures that should be followed in order to get satisfactory classification result and reliability

17 2.4. THESIS OUTLINE 9 for future usage on new data. It also concludes on which classifiers should be used to achieve the same goal. Other ways to combine and utilize these classifiers should be tested on their ability to increase the overall performance. Special attention is dedicated on constructing data sets aiming to avoid any missing useful information that would affect the classification model performances. In this context some effort is put to see how the changes on the data sets such as including additional auxiliary information affect the performances of the models. This thesis introduces a customized application that offers a set of tools and classifiers to make it possible to experiment and derive conclusions on all the above mentioned topics. The classifiers and the data mining techniques are retrieved, customized and utilized from the Weka system [4]. All of these are used in the context of achieving the goal of making good use of data mining in solving the problem at hand. Therefore the application is called AutoQA Monitor. Different use cases and the requirement specifications for the application will be shown in appendixes B. Also, a summary of the its functionalities will be explained in Chapter Thesis Outline Following the data mining project methodology CRISP-DM introduced in subsection 2.2 the rest of this thesis report will be structured as follows: Chapter 3 This chapter explains in details all the steps followed to build the data sets used during this thesis as well as the physical meaning of them. This is done by closely looking at the attributes of the data sets. Focus is put at the class attribute as well, that is the attribute used for classification of each data instance. It also makes an analyzes from the data mining (and statistics) of the data to have a good understanding of the data being processed as well as to explore the possibility to apply some preprocessing techniques to them before using as input for different classifiers. Chapter 4 This chapter explains in depth the process of constructing data mining classification models. Firstly a general view of the whole process is explained and then a more specific explanation is shown for each step of the process. Several model construction and testing techniques are presented and contrasted with each other in accordance to the tests done to seek for the best ones to use. Moreover this chapter explains the classifier chosen for the task as well as presents some other existing classifiers which are not included or used in the AutoQA Monitor. Chapter 5 This chapter explains which factors were considered determinant in picking the best classifiers and at the same time contrasting with others and judging on the decision. Several statistical metrics are used to compare different classifier models. Another discussion is made on whether to use one classification model or combine several with the goal of increasing the classification accuracy. A visualization tool for comparing the class predictions of a particular classifier included in AutoQA Monitor is shown. It can be used to see how good two different class values have been predicted by a classifier. Finally, a word about the prediction of data with unlabeled class attribute is

18 10 CHAPTER 2. INTRODUCTION mentioned and the tools available in AutoQA Monitor are shown. Chapter 6 This chapter explains the development of the AutoQA Monitor data mining customized tool and how the Weka system libraries were integrated into it. It explains its functionalities in accordance with the requirement specifications and use cases built for it. A verification of the correctness of the software is also presented. Chapter 7 This chapter shows the overall conclusions of the problem solution by stating each conclusion after the necessary tests and verifications were performed. It also tries to map each of the problems presented in Chapter 1 with the corresponding conclusions derived after all the defined tests were successfully accomplished. It then mentions some future improvements on this thesis goal or some improvements that the AutoQA Monitor might have.

19 Chapter 3 Examining and Understanding the Data As already stated in Section 2.2, one of steps of CRISP-DM is understanding the data that will be used to discover patterns on. A data set should be carefully analyzed before being used as an input into a machine learning algorithm (in this thesis those algorithms are the classifiers). This chapter describes one by one the physical meaning, the construction, and the analyzes done on the data sets. In order to have a good understanding of the data various statistical measures are calculated both on the continuous (numerical) data and certainly on the nominal values which will be the classified attribute of the data set after performing different classifiers on them. Also, this chapter demonstrates how the data sets were generated using a computer program. There will be two different types of data sets introduced and analyzed. 3.1 Implications of data sets It is very important to explain the physical meaning of the data that is being used. The machine learning algorithms find patterns on data that contain some internal information. This section explains the attributes of the data sets and tries to define what they mean in another context, different from that of data mining. As already mentioned there are two types of data sets used, however they contain the same data. The second type has some modifications as compared to the first one. They will be explained later in this section. The data set types consist of the following attributes respectively: DateTime Turbidity WindSpeed WindDirection Depth Gradient ManualQA Table 3.1: Data set type 1 attributes Those data sets that will be used for predictions on unlabeled data (new data) will have the same attributes but instead of ManualQA it will have PredictedQA. Before explaining all the 11

20 12 CHAPTER 3. EXAMINING AND UNDERSTANDING THE DATA DateTime Turbidity WindSpeed WindDirection Depth Gradient NTU(t-1) NTU(t-2) NTU(t+1) NTU(t+2) ManualQA Table 3.2: Data set type 2 attributes attributes some general concepts are defined. Each data set consists of data for a particular station. There are overall data from three stations that are used for tests in this thesis. These stations are different in the sense that they are placed in different geographical locations. A station is a buoy containing a number of sensors placed on it that are collecting various data, for example salinity, turbidity, temperature etc. They are then transmitted to the data center via network. The focus of this thesis will be to provide some convenient and accurate ways to ensure the quality of turbidity data received from the stations. A program was used to build the data sets. Since the data needed is in different files and the time stamps of their values are also different (even though close to each other), a merge is necessary to include all in one file. As already mentioned there are 3 stations, subsequently three main data sets will be constructed. However for the sake of testing and verifying how would the classifiers react to some changes on the data sets, another three data sets were created. The changes were made after some justifications to improve the accuracy of the classifiers as it will be shown again in the next chapter. Afterwards each of these six data sets is split into two parts in preserving the ratio 1:3 for one set and 2:3 for the other. The first will be used for testing during model testing and the second for training during model construction. So the total number of data sets constructed by this program is Attributes DateTime is a timestamp of when each data instance is received. They are usually received in predefined time intervals that range from every 10 minutes to every hour. Continuous monitoring is the reason for getting data after the defined intervals. This is done to avoid losing necessary data in any case the sensors encounter an error of some type. Turbidity Turbidity is the amount of particulate matter that is suspended in water. Turbidity is measured by shining a light through the water and is reported in nephelometric turbidity units(ntu) [5]. Turbidity makes the water look cloudy or opaque depending on the climate activity. During periods of normal wind speed and normal flow of the water, turbidities are low, usually less than 10 NTU. But during a rainstorm or a strong wind, particles from the surrounding land (rocks of just land) are washed into the water making it a muddy brown color, indicating water that has higher turbidity values. When it is also very strong wind, large waves are created in the sea, so the water flow is too high and the water volume moving is big as well, causing different particles from the bed of the sea to move in the upper direction and stir up with other existing particles, causing higher turbidities. Negative values of turbidity don t

21 3.2. STATISTICS 13 exist and according the calibration of the sensors its value shouldn t exceed 125. It can be inferred from the above description of turbidity why the values of two other attributes wind speed and wind direction. In case there are high values of wind speed, it is expected a high value of NTU and vice versa. Wind Speed measures the wind speed in ( m / s ). The reason for including it in this data set is closely related to the turbidity values as explained above. Wind Direction is used to check for objects on different directions from the position that the buoy is located. If for example it is close to the shore or there are some rocks in the surroundings then they might affect in an increase on the value of NTU. If nothing is on the surrounding then something else might have caused the increase in the NTU values or an error has occurred. This value is measured in degrees and has values ranging in the interval Depth is the water level value at which the sensors are placed in the field. Different NTU values are measured in different level, again according to the weather conditions. For example the surface might have lower level as compared to the lower ones in case there are high waves. Gradient is the slope of two data instances next to each other on their values of NTU over time. This is done because it is expected that the value of NTU should gradually decrease after reaching a high value, thus the gradient should have small, 0, even negative but not very big values. ManualQA is the attribute of the classification of data done by a person after manually going through the data checking for incorrect NTU values received from the stations. True and False labels stand for correct and incorrect data instances. In this context, the second data set type has four other attributes with two previous and two preceding values of NTUs. This is done to have more explicit information on the change of the values of NTU and to easily detect immediate changes of it. 3.2 Statistics This section provides some analyzes on the data sets. In the context of getting to know the data it is important to perform some calculations over the data sets to examine the type of attributes, their value range, variance and so on. First the numerical attributes will be explained and then the class attribute. Afterward some correlation analyzes will be performed as well as some data quality checks on the data sets.

22 14 CHAPTER 3. EXAMINING AND UNDERSTANDING THE DATA Figure 3.1: An overview of data sets used in this thesis Numerical attributes As already mentioned previously there are two types of data sets that will be examined and used for the data mining process. The first data set type has the following numerical attributes: Turbidity Wind Speed Wind Direction Depth Gradient Date Time: This attribute contains time stamps in the date format as explained in [7]. These values are converted into doubles once uploaded to AutoQA Monitor (the same way Weka handles these type of values). The following statistical summary will include this group of metrics for the data sets: MaxValue finds the maximum value of the attribute

23 3.2. STATISTICS 15 MinValue finds the minimum value of the attribute Range finds the difference between Max and Min values. It is a measure of variability and it calculates the maximum spread on the data Mean finds the mean of the values of an attribute. It is a measure of the location of data. if the data values is distributed in symmetric manner than this will be the middle value of the range of values the data contains Median finds the median value of an attribute. It is the median value in between the higher and lower set of values. it is also a measure of location of the values of an attribute Standard deviation is a measure of spread of data. DataSet Attribute Max Min Range Mean Median StDev Turbidity WindSpeed TrainingSetMS01-1 WindDirection Depth Gradient Turbidity WindSpeed TrainingSetMS02-1 WindDirection Depth Gradient e10 Turbidity WindSpeed TrainingSetMS03-1 WindDirection Depth Gradient Table 3.3: Summary statistics of the training data sets Class attribute There is only one class attribute in all the data sets used. It is called the ManualQA and it can have only two categorical values (binary) : True and False. The following tables shows the distribution of each of the binary values of ManualQA attribute for each of the data sets in terms of frequency in order to have a better understanding on the ratio of data instances that are classified True and False. As noticed from this table the class label False is fewer in number as compared with True. This is also the mode in all the data sets. Therefore it might be a challenge to correctly classify the False data instances. It is also seen that the training and test sets for each station have

24 16 CHAPTER 3. EXAMINING AND UNDERSTANDING THE DATA Station DataSetName FrequencyClassTrue FrequencyClassFalse MS01 TrainingSetMS MS01 TestSetMS MS02 TrainingSetMS MS02 TestSetMS MS03 TrainingSetMS MS03 TestSetMS Table 3.4: Frquencies of class labels for data sets of type 1 equal distribution of class values. The reason for doing this is explained in 4.2. The data sets of type 2 have the same distribution of class values thus they are not shown in the above table. Some histograms with the distribution of True and False class labels are shown in Figure 3.2. Figure 3.2: Histograms showing the distribution of TRUE and FALSE class values for the data sets

25 3.3. ANALYSIS OF THE DATA Analysis of the data Before performing data mining model creation processes it is necessary to create a general idea of the data that is being used as well as apply some preprocessing techniques on it. This section shows some inspections done on the data sets to conclude on the quality of the data. Also a correlation analysis is done to find possible correlations between the attributes. Attribute selection and outlier analyzes are also included in this section Data quality Data quality has a great impact on the success rate of the data mining algorithms. If wrong data is being input to the classifiers, the models generated won t be so robust and will have poor performance on the classification tasks. The issues that will be considered regarding the data quality are: Missing values Since the data sets were created using a program this issue is already handled. The data sets created have no missing value for each and every attribute they consist of. Duplicate values For the same reason mentioned above, there are no duplicate values in the data set. A quick check for this would be to see if there is any time stamp which is identical. If so, the calculated gradient would be infinite, because the gradient is calculated using the following formula: (turbiduty i turbidity i 1 ) (time i time i 1 ) (3.1) The gradient is calculated before data instances are randomized, thus it is done for every succeeding value. So, in case there will be any duplicated values of gradient there will be a division by 0 (infinity). Wrong values As explained in the previous section, the attributes in the data sets have values that fall in a range of acceptable values. For instance turbidity has values ranging from 0 to 125 NTU. Also, as seen from the statistics provided before for each attribute in every data set there are some negative values. Moreover the wind direction attribute must contain values from 0 to 360. Since every data on the data sets represents a real value retrieved from the stations in the field, taking those data out before training and building classification models would mean that those models will be too idealistic and their performance on future classification tasks may suffer when it encounters these kind of data. However a test to verify how do these data impact the performance of the classification models can be done to make this verification.

26 18 CHAPTER 3. EXAMINING AND UNDERSTANDING THE DATA Correlation analysis and attribute selection Correlation between two attributes with continuous values is a measure of the linear relationship between them. Its value is always in the range -1 to 1 [1]. A correlation of 1 means two attributes have a perfect positive linear relationship. That means that one can be build as a linear combination of the other in a linear equation of the form; y = k x + b where x and y are the two attributes (for example represented as a vector of values ). Below will be shown the correlation matrices of 6 data sets, the training sets of the first and second types of data sets (since the test sets have the same attributes, subsequently the same data trends). Figure 3.3: Correlation matrices for all training sets The Gradient attribute was removed from the calculations since it was generating many divisions with 0. As notices from the matrices of training sets of type 1, the correlation coefficients are very close to 0, indicating a non existing correlation between the attributes. This fact means that the data

27 3.3. ANALYSIS OF THE DATA 19 set does not suffer from redundancy and all the attributes provided are necessary for model constructions. This also implies that it is not possible to construct any of the attributes as a linear combination of another. Certainly every attribute can be constructed as a linear combination of all the other attributes by performing a principal component analysis, but this issue is out of the scope of this thesis. On the other hand it can also be seen that in the training sets of type 2, the 4 attributes that contain the history turbidity values are highly correlated with the turnidity attribute since the coefficients have values like 0.9, 0.81, 0.83 and so on. In case we perform an attribute selection on the different data sets (using CfsSubsetEval in Weka) the results would be: Data Set Attributes MS01-1, MS01-2 DateTime, Depth MS02-1 DateTime, Turbidity MS02-2 DateTime, Turbidity, and afterntu2 MS03-1, MS03-2 DateTime, Depth, Gradient Table 3.5: Attribute selection for the training data sets The attribute selection just keeps a subset of attributes that give most of the information on the data set and leaves out the rest of the attributes. So, the above presented output is a subset from the whole attributes of the data set based on correlation analysis between individual attributes (features) that are not dependent on each other and don t have redundant data. However, it should be noted that this is done mostly to reduce the processing time of model construction. The more attributes, the more time is required to build classification models. This approach will than be very convenient since it will reduce the running time of the algorithms by keeping at the same time most of the information from the previous data set. On the other hand such necessary information may be lost by taking out the other attributes. Also, even if the rest of the attributes (other than the ones presented in table 3.5) are omitted and the models are constructed without them, no significant change on the performance will be achieved. The attribute selection gives however some indications on which attributes are central and most important in the model construction phase Outliers Outlier are data instances that are completely different from the rest of the data sets. As seen in the previous section, based on the statistics done on the continuous attributes, there are a number of data instances that have values which fall in a specific range around some value. Outliers are data instances that have values that don t belong to any of this groups of values. Thus they

28 20 CHAPTER 3. EXAMINING AND UNDERSTANDING THE DATA will impact the performance of the classifiers and confuse them during the classification process. Finding outliers is not trivial. Also, in the context of this thesis, the data being processed should be used as a whole since there is not a clear definition whether an instance that might be concluded to be an outlier from the data mining point of view it s not from the physical meaning that the value represents. There are many techniques that can be used to detect outlies, as explained in [1] but they won t be applied for this thesis.

29 Chapter 4 Model Creation The next topic to explore according to CRISP-DM is the model building. This chapter depicts the process of model construction that was followed for the tests performed during this thesis and attempts to explain some decisions of the best methods to use in order to get good models. The goal is to get good and reliable results from these models. Also, a brief introduction to the nature of the classifiers used will be presented. Finally, some alternative classifiers will be mentioned. 4.1 Classification Data mining classification task is a two-step process. During the first step, model creation, a model or classifier is constructed to predict the class labels of each instance in the input data set. While in the second step (performing classification) this model can be applied for the same task to previously unknown data (new data). Classification is the task of assigning each data instance in a data set to predefined categories or labels. The models are constructed after a classifier has figured pattern on the data set that has each data instance already labeled. This is called the training set. As shown in [8], the concept is the same as a mathematical function Y = f (X) where X is a set of attributes X = (x 1, x 2,..., x n ) and Y is a vector of labels (in this case True and False) Y = (y 1, y 2,..., y n ) for each instance in the training set. Figure 4.1: Data mining classification concept 21

30 22 CHAPTER 4. MODEL CREATION Model creation process The process of constructing models includes several steps. First the training set should be ready for usage (after doing all the necessary analysis previously) and then a classifier should be selected for building the model. Also, a testing technique should be chosen to view result of how good the model performs to classify the data in the training set. After the model has been built, it can be used for future classification tasks. This is explained in details in the next chapter. The AutoQA Monitor gives the possibility to select multiple classifiers to run on the training set. It also gives provides some of the most reliable model testing techniques that are used during model construction. Chapter 6 presents this functionalities in details. The classifiers can be of different internal structures. This means that the there are several ways to extract patters from the data. Decision trees and rules are two types of structures that will be explored further in this chapter. Figure 4.2: Data mining model creation process 4.2 Testing techniques The way to judge on whether a data mining model will reach the wanted success in classification tasks is to evaluate and examine its performance. There are different ways to achieve this but the protocol followed when building the models should be one that provides trusted results. The goal during model construction will be to create a model that will correctly classify as many data instances as possible when applied to the test sets that contain previously unseen instances. Another concern is that during model training phase the accuracy and error rate are generally too optimistic. This is because the same data is involved in training as well as testing. Subsequently the classifier can easily correctly classify and instance it used to build the model. In order to get a more realistic result the model should be tested on the test set that contains unknown data. These results would be a better measure of the performance of each classifier. This will be explained in details in the next chapter in the context of model evaluation and selection.

31 4.2. TESTING TECHNIQUES 23 On the other hand this section explains some of the best protocols used during model construction phase and derives a conclusion on which one is the best to use as supported by tests done using the AutoQA Monitor. The three techniques available are: Hold out In this approach the training set is partitioned into two disjoint sets usually in the ratio 1:3 for testing and the other 2:3 for training. The model is then constructed using the the data set portioned for training and is tested on the remaining data from the test set. So, the estimations on the accuracy of the model will be calculated on the test set, which has previously unseen instances for the model. This method has however several drawbacks. First of all when using this accuracy assessment technique during model construction, fewer data instances are available for training, subsequently the model generated is not as good as it might have been if built including all the data. Also, the partition of the original training set may include most of the data instances with one of the class labels to only one of the sets. Therefore that particular class label will be overrepresented in one set and underrepresented in the other. Subsequently, the results may not be trusted. In the implementation of hold out method in AutoQA Monitor another feature was added in the way the original data set is partitioned. The class labels in the training and test set after separation have the same ratio as in the data set before separation. This method is called stratification. It is done to ensure that both tests have the same distribution of class representatives as the original data set by including the right number of instances during the model construction and testing. Figure 4.3: Hold out accuracy assessment method Randomization Since the data sets that are being trained are time series data, there must be a way to make sure to include in the training and test set data from different years, months, days,and hours. Therefore another test method called randomization is implemented in AutoQA Monitor. A randomization algorithms is performed firstly to the

32 24 CHAPTER 4. MODEL CREATION training set and then it is divided in percentage into training and test set. Usually (as in the hold out method) 66% of the data instances goes for training and the rest for testing. This technique suffers from the same issues as the hold out method. Also here there is no stratification done in order not to interfere in the randomization done according to dates. The data is partitioned directly after randomization. Figure 4.4: A view of one of the data sets time stamp attribute Cross-validation Probably the simplest and most widely method used for estimating prediction error [6]. This technique is often used when the number of available data instances is not big enough. However it is also used to build a model that makes a good use of all the instance in the training set. The training set T is partitioned in k different parts and each of them is used exactly once for testing and once for training. So T = (k 1, k 2,..., k N ) where k is the number of folds that the training set will be partitioned and N the number of instances in it. Each fold contains 1 k instances and each of them is stratified, so each of them contains the same ratio of instances with the class values as the original set. Afterwards (k 2, k 3,..., k N ) are used as the training set and k 1 as the test set. This process is repeated for every fold. The total error is found by summing up all errors from each run or the accuracy is found as the average accuracy of all runs. The decision on the value of k is a topic for discussion but usually the 10-fold cross-validation is used, as well as 5-fold. [6]. A special case of this technique is the leave-one-out crossvalidation when k=n. In this case each test set contains only one instance and the rest is the training set. This is beneficial in the sense that it utilizes the entire data set in the model construction by providing a better model. However this technique is very computational expensive and is not very practical for relatively big data sets. An implementation

33 4.3. CLASSIFIERS 25 of leave-one-out cross-validation is also available in the AutoQA Monitor. A schematic view of how the data is portioned and how the stratification is performed is shown in picture 4.5. Figure 4.5: Partition of the training set in k-folds in the cross-validation method in a binary class problem 4.3 Classifiers Choosing the right machine learning algorithms is one of the important decisions that needs to be taken when performing a data mining process. There are several alternatives that offer different ways to solve the classification problems. The difference consists of the particular manner that each of them find the patterns on the data. This chapter presents some of the algorithms that were included for usage in the AutoQA Monitor by giving a brief description on how they build the pattern on the data they are run, as well as providing some concrete examples from the patterns found on some of the data sets being studied on this thesis. At the end some other alternative classification algorithms different from the ones used for this thesis, are mentioned. All these algorithms will be called classifiers since their tasks are only concentrated in classification Trees Classification using tree structures is done by the so called decision tree induction. In this tree the internal nodes are attributes and each outgoing edge is the outcome of test done on the values of these attributes. The leaf nodes contain class values. Certainly there is a root note that also

34 26 CHAPTER 4. MODEL CREATION contains an attribute. Decision trees are convenient to use in the sense that it does not require domain knowledge on the data [8].Moreover such a structure is quite intuitive and easy to grasp when applying new data on the tree. There are many types of machine learning algorithms (such as ID3) that use decision tree structures but in this thesis only J48 and RendomForest is used. An example of a decision tree constructed using J48 is shown in 4.6. J48 It is the implementation of C4.5 in java (Weka)[2].This is an algorithm that is used for classifying continuous values attributes. The way it does this is by putting a threshold and splitting the set of data into two parts, one containing the values below the threshold and the other one above or equal to the threshold. This operation is performed recursively until the whole tree with purified leaves is constructed. The decision for the attributes to be involved in any node is done after calculating their information gain. The attributes with the highest information gain are chosen. This algorithm also handles missing values. The missing values are divided proportionally according to the rest of the data. It also offers the option of using pruning and validating the model in order to reduce the size of the tree of any of its branches if they are redundant and unnecessary. This contributes to make the process of building the models faster, but it can also have a negative effect on their performances since some of the tests done during the tree are lost. Figure 4.6: A classification model of tree structure (J48) It can be notices that all the leaves have the class values (True or False) and all the continuous attributes are split in two parts, being less or equal or higher than some number. This is a pruned

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents

More information

Classification and Prediction

Classification and Prediction Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser

More information

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

More information

Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, fabian.gruening@informatik.uni-oldenburg.de Abstract: Independent

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS

PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS Kalpesh Adhatrao, Aditya Gaykar, Amiraj Dhawan, Rohit Jha and Vipul Honrao ABSTRACT Department of Computer Engineering, Fr.

More information

Data Mining Applications in Higher Education

Data Mining Applications in Higher Education Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d. EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer

More information

Pentaho Data Mining Last Modified on January 22, 2007

Pentaho Data Mining Last Modified on January 22, 2007 Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

In this presentation, you will be introduced to data mining and the relationship with meaningful use. In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine

More information

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

More information

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

Data Preprocessing. Week 2

Data Preprocessing. Week 2 Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES Bruno Carneiro da Rocha 1,2 and Rafael Timóteo de Sousa Júnior 2 1 Bank of Brazil, Brasília-DF, Brazil brunorocha_33@hotmail.com 2 Network Engineering

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

Chapter 7. Feature Selection. 7.1 Introduction

Chapter 7. Feature Selection. 7.1 Introduction Chapter 7 Feature Selection Feature selection is not used in the system classification experiments, which will be discussed in Chapter 8 and 9. However, as an autonomous system, OMEGA includes feature

More information

Beating the MLB Moneyline

Beating the MLB Moneyline Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series

More information

not possible or was possible at a high cost for collecting the data.

not possible or was possible at a high cost for collecting the data. Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

1. Classification problems

1. Classification problems Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

More information

Northumberland Knowledge

Northumberland Knowledge Northumberland Knowledge Know Guide How to Analyse Data - November 2012 - This page has been left blank 2 About this guide The Know Guides are a suite of documents that provide useful information about

More information

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

Learning is a very general term denoting the way in which agents:

Learning is a very general term denoting the way in which agents: What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Data Mining Techniques Chapter 6: Decision Trees

Data Mining Techniques Chapter 6: Decision Trees Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Nine Common Types of Data Mining Techniques Used in Predictive Analytics 1 Nine Common Types of Data Mining Techniques Used in Predictive Analytics By Laura Patterson, President, VisionEdge Marketing Predictive analytics enable you to develop mathematical models to help better

More information

INTRUSION PREVENTION AND EXPERT SYSTEMS

INTRUSION PREVENTION AND EXPERT SYSTEMS INTRUSION PREVENTION AND EXPERT SYSTEMS By Avi Chesla avic@v-secure.com Introduction Over the past few years, the market has developed new expectations from the security industry, especially from the intrusion

More information

On Correlating Performance Metrics

On Correlating Performance Metrics On Correlating Performance Metrics Yiping Ding and Chris Thornley BMC Software, Inc. Kenneth Newman BMC Software, Inc. University of Massachusetts, Boston Performance metrics and their measurements are

More information

Professor Anita Wasilewska. Classification Lecture Notes

Professor Anita Wasilewska. Classification Lecture Notes Professor Anita Wasilewska Classification Lecture Notes Classification (Data Mining Book Chapters 5 and 7) PART ONE: Supervised learning and Classification Data format: training and test data Concept,

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Advanced analytics at your hands

Advanced analytics at your hands 2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously

More information

APPENDIX N. Data Validation Using Data Descriptors

APPENDIX N. Data Validation Using Data Descriptors APPENDIX N Data Validation Using Data Descriptors Data validation is often defined by six data descriptors: 1) reports to decision maker 2) documentation 3) data sources 4) analytical method and detection

More information

Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski

More information

Sanjeev Kumar. contribute

Sanjeev Kumar. contribute RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 sanjeevk@iasri.res.in 1. Introduction The field of data mining and knowledgee discovery is emerging as a

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

CRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining

CRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining Mining Process CRISP - DM Cross-Industry Standard Process for Mining (CRISP-DM) European Community funded effort to develop framework for data mining tasks Goals: Cross-Industry Standard Process for Mining

More information

Lecture 6 - Data Mining Processes

Lecture 6 - Data Mining Processes Lecture 6 - Data Mining Processes Dr. Songsri Tangsripairoj Dr.Benjarath Pupacdi Faculty of ICT, Mahidol University 1 Cross-Industry Standard Process for Data Mining (CRISP-DM) Example Application: Telephone

More information

Discovering process models from empirical data

Discovering process models from empirical data Discovering process models from empirical data Laura Măruşter (l.maruster@tm.tue.nl), Ton Weijters (a.j.m.m.weijters@tm.tue.nl) and Wil van der Aalst (w.m.p.aalst@tm.tue.nl) Eindhoven University of Technology,

More information

A New Approach for Evaluation of Data Mining Techniques

A New Approach for Evaluation of Data Mining Techniques 181 A New Approach for Evaluation of Data Mining s Moawia Elfaki Yahia 1, Murtada El-mukashfi El-taher 2 1 College of Computer Science and IT King Faisal University Saudi Arabia, Alhasa 31982 2 Faculty

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Studying Auto Insurance Data

Studying Auto Insurance Data Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

Analysis of Algorithms I: Binary Search Trees

Analysis of Algorithms I: Binary Search Trees Analysis of Algorithms I: Binary Search Trees Xi Chen Columbia University Hash table: A data structure that maintains a subset of keys from a universe set U = {0, 1,..., p 1} and supports all three dictionary

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

Context Aware Predictive Analytics: Motivation, Potential, Challenges

Context Aware Predictive Analytics: Motivation, Potential, Challenges Context Aware Predictive Analytics: Motivation, Potential, Challenges Mykola Pechenizkiy Seminar 31 October 2011 University of Bournemouth, England http://www.win.tue.nl/~mpechen/projects/capa Outline

More information

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM. DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,

More information

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

!!!#$$%&'()*+$(,%!#$%$&'()*%(+,'-*&./#-$&'(-&(0*.$#-$1(2&.3$'45 !"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:

More information

Implementation of Data Mining Techniques for Weather Report Guidance for Ships Using Global Positioning System

Implementation of Data Mining Techniques for Weather Report Guidance for Ships Using Global Positioning System International Journal Of Computational Engineering Research (ijceronline.com) Vol. 3 Issue. 3 Implementation of Data Mining Techniques for Weather Report Guidance for Ships Using Global Positioning System

More information

CHURN PREDICTION IN MOBILE TELECOM SYSTEM USING DATA MINING TECHNIQUES

CHURN PREDICTION IN MOBILE TELECOM SYSTEM USING DATA MINING TECHNIQUES International Journal of Scientific and Research Publications, Volume 4, Issue 4, April 2014 1 CHURN PREDICTION IN MOBILE TELECOM SYSTEM USING DATA MINING TECHNIQUES DR. M.BALASUBRAMANIAN *, M.SELVARANI

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

A Review of Anomaly Detection Techniques in Network Intrusion Detection System

A Review of Anomaly Detection Techniques in Network Intrusion Detection System A Review of Anomaly Detection Techniques in Network Intrusion Detection System Dr.D.V.S.S.Subrahmanyam Professor, Dept. of CSE, Sreyas Institute of Engineering & Technology, Hyderabad, India ABSTRACT:In

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

Predicting the Stock Market with News Articles

Predicting the Stock Market with News Articles Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is

More information

Evaluating Data Mining Models: A Pattern Language

Evaluating Data Mining Models: A Pattern Language Evaluating Data Mining Models: A Pattern Language Jerffeson Souza Stan Matwin Nathalie Japkowicz School of Information Technology and Engineering University of Ottawa K1N 6N5, Canada {jsouza,stan,nat}@site.uottawa.ca

More information

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) Data Exploration and Preprocessing Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

Maschinelles Lernen mit MATLAB

Maschinelles Lernen mit MATLAB Maschinelles Lernen mit MATLAB Jérémy Huard Applikationsingenieur The MathWorks GmbH 2015 The MathWorks, Inc. 1 Machine Learning is Everywhere Image Recognition Speech Recognition Stock Prediction Medical

More information

Applying Data Mining Technique to Sales Forecast

Applying Data Mining Technique to Sales Forecast Applying Data Mining Technique to Sales Forecast 1 Erkin Guler, 2 Taner Ersoz and 1 Filiz Ersoz 1 Karabuk University, Department of Industrial Engineering, Karabuk, Turkey erkn.gler@yahoo.com, fersoz@karabuk.edu.tr

More information

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

Modeling Guidelines Manual

Modeling Guidelines Manual Modeling Guidelines Manual [Insert company name here] July 2014 Author: John Doe john.doe@johnydoe.com Page 1 of 22 Table of Contents 1. Introduction... 3 2. Business Process Management (BPM)... 4 2.1.

More information

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 8 August 2013

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 8 August 2013 A Short-Term Traffic Prediction On A Distributed Network Using Multiple Regression Equation Ms.Sharmi.S 1 Research Scholar, MS University,Thirunelvelli Dr.M.Punithavalli Director, SREC,Coimbatore. Abstract:

More information

A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks

A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks H. T. Kung Dario Vlah {htk, dario}@eecs.harvard.edu Harvard School of Engineering and Applied Sciences

More information