Data Mining and Automatic Quality Assurance of Survey Data

Transcription

1 Autumn 2011 Bachelor Project for: Ledian Selimaj (081088) Data Mining and Automatic Quality Assurance of Survey Data Abstract Data mining is an emerging field in the computer science world, which stands in between the fields of statistics, machine learning and pattern recognition. Classification is one of the main tasks of data mining and its ultimate goal is to describe or predict certain unknown data into predefined categories based on knowledge acquired on data of the same type beforehand. Programmatic quality checking of field survey data is difficult as it requires manual supervision to be able to identify and differentiate data trends. This paper attempts to explore the possibility to run various classifiers on field survey data and then use different techniques to build classification models to mimic manual supervision of data to programmatically check the quality of field survey data. The choice of the right classifiers and model testing techniques are of crucial importance for the whole process as the main purpose is to ensure high reliability of the predictive classification models that will be used to predict the category where new and previously unknown data pertains. Emphasis is placed on picking and utilizing the best classifiers as well as to test with several manners and combine them for a classification task in order to provide more accurate results than just using one single algorithm, which might suffer from structural problems. I accept that the report is available at the library of the department Student: Ledian Selimaj Supervizor: Chengzi Chew Company: DHI Coordinator: Hans Christian Pedersen Ext. Examiner: Jens Damgaard Andersen Sign... Sign... Sign... Sign...

2

3 This is to verify that: The entire material in this thesis report contains only my original work All acknowledgments for the material referenced has been explicitly made Ledian Selimaj January 2012

4

5 To my beloved Mother...

6

7 Contents Contents i 1 Problem Formulation 5 2 Introduction Business understanding The problem Why data mining? Data mining as a process Thesis Objectives Thesis Outline Examining and Understanding the Data Implications of data sets Attributes Statistics Numerical attributes Class attribute Analysis of the data Data quality Correlation analysis and attribute selection Outliers Model Creation Classification Model creation process Testing techniques Classifiers Trees Rules i

8 ii CONTENTS Meta Alternative approaches Model Assessment and Selection Model assessment process Model selection Measures of performance Tests and experiments Classifying unlabeled data using selected model Customized AutoQA tool Tool Development Integrating Weka library Design issues Functionalities Conclusions Conclusions Bibliography 59 A Test and experiments full results 63 A.1 MS A.2 MS A.3 MS B AutoQA Monitor Requirement Specifications 67 B.1 Requirements B.2 Use cases List of Figures 71 List of Tables 73

9 Acknowledgements Not everything that counts can be counted, and not everything that can be counted counts (A. Einstein) This bachelor thesis was done at DHI. All the material used for the study was provided by the DHI Data Center. Special thanks to: Chengzi Chew for his supervision and technical support Hans C. Pedersen for his supervision and encouragement 1

10

11 Chapter 1 Problem Formulation This thesis attempts to explore the possibility to use data mining learning algorithms and techniques to perform quality assurance of field survey data by labeling the data as being correct or not. The data is gathered from numerous sensors in the stations that reside in particular sites in the Baltic Sea. The presence of incorrect data is due to some errors that might occur on the sensors that gather the data as well as some other external factors during data transmission from the stations to the data center. The central task is to derive conclusions on some of the best data mining techniques and algorithms that could be used for the quality assurance of the data. Certainly, in order to achieve this many tests should be performed and necessary analysis should be carried out on their results. Implementation of an application that would provide the needed data mining apparatuses to perform the tests is also a significant part of this thesis. The main issues to be solved in order to have a successful approach toward the problem solution are: Constructing data sets Special attention is dedicated to constructing the proper data sets that would be inputs for the classifiers by ensuring that there is enough information for them to find the patterns underlying the data. However a proper analysis on the physical meaning of the data should be done in advance. The classification of wrong and right data is not trivial and a number of parameters should be included in the data sets in order to make the classification process easier and more accurate for the classifiers. The decision on what data to be included has to do with the correlation of the various physical parameters and necessary discussions on that matter should be done in accordance with the classifiers performances that run on these data. Using the best data mining practices The data mining process of building and testing a classification model is done by following a specific protocol. There are several possibilities of using them, therefore a number of necessary tests should be done to pick the most reliable and safe technique both while 3

12 4 CHAPTER 1. PROBLEM FORMULATION creating and testing a classifier. Reliability of the performance of the classifier model during testing is very important since that would be considered later when classifying unlabeled data by ensuring that the result of the classification would be as accurate as possible. Selecting the best classifier Judging about performance of a certain classifier for a given problem and compare it with others is always done with the aim of picking the best performing classifier. However this becomes complicated when the results of many classifiers are close to each other. Thus several measurements of the performance of each classifier should be considered to make a decision for the best one. Special measurements that show how good the particular classes are classified by the classifier get a special attention. There are also other techniques use to combine several classifiers together for making a single classification. A verification on how well does this technique improves as compared to the classification done by a single classifier needs to be done. The results of the use of some ensemble methods has shown to be promising in the data mining classification task. Thus a discussion on the degree that the ensemble classifiers might have over the primitive once has to be verified through some tests.

13 Chapter 2 Introduction This chapter presents the problem of this thesis from the business point of view. This is a real life problem and a solution might be offered by the use of data mining. A definition of data mining is stated and then a brief explanation on why and how data mining can be used to provide a reasonably solution for the problem introduced. Afterwards a data mining problem solving methodology is briefly explained following the objectives of the thesis as well as an outline of the following chapters. 2.1 Business understanding Numerous hydrographical data is being transmitted from three large monitoring stations placed at different water depth in the Baltic Sea. These stations transmit data directly to a data handling center located in Denmark. It is the data center s task to consolidate, continuously monitor, and ensure quality assurance and reliability of the transmitted data. The former is a difficult task to be done automatically since the judgment whether the data coming from the sensors is correct or not includes the evaluation and analysis of many parameters, their historical values, the time of the year when the data sample is taken and many others. Data mining classifiers could be of usage in this situation. They run on the data and construct classification models that can accomplish the required classification of the data, that is to classify (label) each instance of the data set to be either correct or not. So, data mining classification models might provide a convenient way to atomize the quality assurance process of the field survey data The problem The data is coming from the sensors placed in the sea through a transmission to the data handling center. The raw data is stored in a database after some reallocation and calibration processes. Afterwards is the time for the automatic quality assurance (autoqa) to do some checks on the quality of the received data. After this post process has successfully finished the data is updated 5

14 6 CHAPTER 2. INTRODUCTION in the database. The autoqa goes through the data, makes some check and assigns a flag to each of its instance depending on the result of the check. For example, if the data is bad it is marked with 4, otherwise it is marked with 5 [17]. The value of the flag depends on the calculations and a bunch of checks performed on the values of each attribute of the data instances. The tests that need to be done in order to check the correctness of a data instance are: -check if the value exceeds a predetermined threshold as its maximum value allowed -check if it goes below a predetermined threshold as its minimum value allowed -check if its value is increased within a predetermined gradient in 1 measurement time step -check if its value is decreased within a predetermined gradient in 1 measurement time step -check if its value is repeated for a predetermined number of time steps The automatic quality assurance process is followed by a manual quality assurance of data. Because the autoqa filter lacks intelligence and threshold values have been hardcoded, it makes wrong classification of good and bad data instances. Therefore a reviewer needs to manually go through the data and change the automatic quality assurance flags based on many standards and relations among the data. This process is very time consuming as well as expensive. A person needs to check every instance on the data sets and decide if the automatic quality assurance made a good or bad classification. Subsequently, the idea to automatize this process is raised. Such an automatic process will provide both faster processing as well as cheaper cost. Learning algorithms that perform similar checks as the review does manually can be used in this situation. Since the task at hand is a classification task, data mining classifiers offer a good chance to atomize the autoqa filtering Why data mining? There are many different definitions on what data mining is but what is the core of all is that data mining is the process of automatically discovering useful information in large data repositories [1]. This useful information (or knowledge) is found as patterns on the data in different structures. By structure is meant that the patterns found are represented in an explicit form, for example a tree, a bunch of rules, decision tables etc. The power of these patterns is that the meaningful information found on some data can be applied to data that is previously unknown and used for classification predictions, which leads to some advantages, usually economic ones [2]. The benefits of having a classification done by a computer program are obviously greater as compared to for instance manual classification, which is quite impractical for large data sets. Classification is also one of the main tasks of data mining. Predictive classification models can be seen as functions that have the class value as their output and some explanatory variables as inputs. Practice has shown that applying the learned knowledge structures to new data can generate satisfactory results in their prediction [2]. Quality assurance of field survey data is a typical classification task. The data that is meaningful and conveys right information can be labeled as being true or false otherwise. Therefore patterns for particular data can be retrieved from data set that are already manually classified

15 2.2. DATA MINING AS A PROCESS 7 as being true and false and can afterwards be used to classify new data with empty class values. Certainly, the level of accuracy depends on the classifier as well as on the information that the data sets that it runs on contain. Data mining offers various techniques on how to get more and more accurate predictions. The process of constructing as well as testing the models should be accomplished in a manner that ensures reliability of the predictions abilities of a particular classifier. Subsequently, by obtaining this reliability, the chances for success in future predictions on new data are high. 2.2 Data mining as a process Data mining consists of a set of tools and algorithms that could be used for different business problems solutions. However what really matters is the analyzes and the knowledge provided by using it and all its facilities and tools in the right way, and in some structured manner. Therefore data mining should be viewed as a process. The particular standard process used is the CRISP-DM framework: the Cross-Industry Standard Process for Data Mining. CRISP-DM demands that data mining be seen as an entire process [3]. According to CRISP-DM methodology the lifecycle of a data mining project consist of six phases: Business understanding This phase describes the problem from the business point of view and tries to understand the objectives and requirements of the problem. Afterwards the problem is converted to a data mining problem. Data understanding It is crucial for a data mining project to understand the nature of the data that is going to be processed. Analyzes of the attributes should be done and all the necessary information should be included in the data set. It is also convenient to go through the data and check its quality, for example if there are missing values, redundant instances etc. Data preparation After all relevant analyzes and cleaning of the data have been made in the preceding phase the final data set needs to be constructed. This data set will be the input for all the models that will be constructed and used. Model building There are many different techniques for building models in data mining. The choice of the best one is always a priority so several parameter and techniques should be tried Model evaluation

16 8 CHAPTER 2. INTRODUCTION After having several models constructed in various ways, proper analyzes should be done to search for the best performing model and the most successful techniques used for constructing it. At the end of this analyzes only one model is chosen for the task Model deployment The model chosen will be applied to previously unknown data. Even if the purpose of the picked model is to increase the knowledge gained from the data, it should properly be applied on that data, otherwise the result may not be that satisfactory Figure 2.1: CRISP-Data Mining project methodology [16] Picture 2.1 depicts the entire process as explained above. As noticed from the picture, going back on the steps is necessary. For example in steps of Data understanding, Modeling, and Model evaluation it would be appropriate to go one step backward during the process in order to make necessary changes in case something went wrong on unexpected results were generated. This is important to ensure that the whole process is compact and coherent to have the best possible results. 2.3 Thesis Objectives The main subject of the thesis is to explore the opportunity of applying data mining classification models for assuring quality assurance of field survey data. It examines what are the right procedures that should be followed in order to get satisfactory classification result and reliability

17 2.4. THESIS OUTLINE 9 for future usage on new data. It also concludes on which classifiers should be used to achieve the same goal. Other ways to combine and utilize these classifiers should be tested on their ability to increase the overall performance. Special attention is dedicated on constructing data sets aiming to avoid any missing useful information that would affect the classification model performances. In this context some effort is put to see how the changes on the data sets such as including additional auxiliary information affect the performances of the models. This thesis introduces a customized application that offers a set of tools and classifiers to make it possible to experiment and derive conclusions on all the above mentioned topics. The classifiers and the data mining techniques are retrieved, customized and utilized from the Weka system [4]. All of these are used in the context of achieving the goal of making good use of data mining in solving the problem at hand. Therefore the application is called AutoQA Monitor. Different use cases and the requirement specifications for the application will be shown in appendixes B. Also, a summary of the its functionalities will be explained in Chapter Thesis Outline Following the data mining project methodology CRISP-DM introduced in subsection 2.2 the rest of this thesis report will be structured as follows: Chapter 3 This chapter explains in details all the steps followed to build the data sets used during this thesis as well as the physical meaning of them. This is done by closely looking at the attributes of the data sets. Focus is put at the class attribute as well, that is the attribute used for classification of each data instance. It also makes an analyzes from the data mining (and statistics) of the data to have a good understanding of the data being processed as well as to explore the possibility to apply some preprocessing techniques to them before using as input for different classifiers. Chapter 4 This chapter explains in depth the process of constructing data mining classification models. Firstly a general view of the whole process is explained and then a more specific explanation is shown for each step of the process. Several model construction and testing techniques are presented and contrasted with each other in accordance to the tests done to seek for the best ones to use. Moreover this chapter explains the classifier chosen for the task as well as presents some other existing classifiers which are not included or used in the AutoQA Monitor. Chapter 5 This chapter explains which factors were considered determinant in picking the best classifiers and at the same time contrasting with others and judging on the decision. Several statistical metrics are used to compare different classifier models. Another discussion is made on whether to use one classification model or combine several with the goal of increasing the classification accuracy. A visualization tool for comparing the class predictions of a particular classifier included in AutoQA Monitor is shown. It can be used to see how good two different class values have been predicted by a classifier. Finally, a word about the prediction of data with unlabeled class attribute is

18 10 CHAPTER 2. INTRODUCTION mentioned and the tools available in AutoQA Monitor are shown. Chapter 6 This chapter explains the development of the AutoQA Monitor data mining customized tool and how the Weka system libraries were integrated into it. It explains its functionalities in accordance with the requirement specifications and use cases built for it. A verification of the correctness of the software is also presented. Chapter 7 This chapter shows the overall conclusions of the problem solution by stating each conclusion after the necessary tests and verifications were performed. It also tries to map each of the problems presented in Chapter 1 with the corresponding conclusions derived after all the defined tests were successfully accomplished. It then mentions some future improvements on this thesis goal or some improvements that the AutoQA Monitor might have.

19 Chapter 3 Examining and Understanding the Data As already stated in Section 2.2, one of steps of CRISP-DM is understanding the data that will be used to discover patterns on. A data set should be carefully analyzed before being used as an input into a machine learning algorithm (in this thesis those algorithms are the classifiers). This chapter describes one by one the physical meaning, the construction, and the analyzes done on the data sets. In order to have a good understanding of the data various statistical measures are calculated both on the continuous (numerical) data and certainly on the nominal values which will be the classified attribute of the data set after performing different classifiers on them. Also, this chapter demonstrates how the data sets were generated using a computer program. There will be two different types of data sets introduced and analyzed. 3.1 Implications of data sets It is very important to explain the physical meaning of the data that is being used. The machine learning algorithms find patterns on data that contain some internal information. This section explains the attributes of the data sets and tries to define what they mean in another context, different from that of data mining. As already mentioned there are two types of data sets used, however they contain the same data. The second type has some modifications as compared to the first one. They will be explained later in this section. The data set types consist of the following attributes respectively: DateTime Turbidity WindSpeed WindDirection Depth Gradient ManualQA Table 3.1: Data set type 1 attributes Those data sets that will be used for predictions on unlabeled data (new data) will have the same attributes but instead of ManualQA it will have PredictedQA. Before explaining all the 11

20 12 CHAPTER 3. EXAMINING AND UNDERSTANDING THE DATA DateTime Turbidity WindSpeed WindDirection Depth Gradient NTU(t-1) NTU(t-2) NTU(t+1) NTU(t+2) ManualQA Table 3.2: Data set type 2 attributes attributes some general concepts are defined. Each data set consists of data for a particular station. There are overall data from three stations that are used for tests in this thesis. These stations are different in the sense that they are placed in different geographical locations. A station is a buoy containing a number of sensors placed on it that are collecting various data, for example salinity, turbidity, temperature etc. They are then transmitted to the data center via network. The focus of this thesis will be to provide some convenient and accurate ways to ensure the quality of turbidity data received from the stations. A program was used to build the data sets. Since the data needed is in different files and the time stamps of their values are also different (even though close to each other), a merge is necessary to include all in one file. As already mentioned there are 3 stations, subsequently three main data sets will be constructed. However for the sake of testing and verifying how would the classifiers react to some changes on the data sets, another three data sets were created. The changes were made after some justifications to improve the accuracy of the classifiers as it will be shown again in the next chapter. Afterwards each of these six data sets is split into two parts in preserving the ratio 1:3 for one set and 2:3 for the other. The first will be used for testing during model testing and the second for training during model construction. So the total number of data sets constructed by this program is Attributes DateTime is a timestamp of when each data instance is received. They are usually received in predefined time intervals that range from every 10 minutes to every hour. Continuous monitoring is the reason for getting data after the defined intervals. This is done to avoid losing necessary data in any case the sensors encounter an error of some type. Turbidity Turbidity is the amount of particulate matter that is suspended in water. Turbidity is measured by shining a light through the water and is reported in nephelometric turbidity units(ntu) [5]. Turbidity makes the water look cloudy or opaque depending on the climate activity. During periods of normal wind speed and normal flow of the water, turbidities are low, usually less than 10 NTU. But during a rainstorm or a strong wind, particles from the surrounding land (rocks of just land) are washed into the water making it a muddy brown color, indicating water that has higher turbidity values. When it is also very strong wind, large waves are created in the sea, so the water flow is too high and the water volume moving is big as well, causing different particles from the bed of the sea to move in the upper direction and stir up with other existing particles, causing higher turbidities. Negative values of turbidity don t

21 3.2. STATISTICS 13 exist and according the calibration of the sensors its value shouldn t exceed 125. It can be inferred from the above description of turbidity why the values of two other attributes wind speed and wind direction. In case there are high values of wind speed, it is expected a high value of NTU and vice versa. Wind Speed measures the wind speed in ( m / s ). The reason for including it in this data set is closely related to the turbidity values as explained above. Wind Direction is used to check for objects on different directions from the position that the buoy is located. If for example it is close to the shore or there are some rocks in the surroundings then they might affect in an increase on the value of NTU. If nothing is on the surrounding then something else might have caused the increase in the NTU values or an error has occurred. This value is measured in degrees and has values ranging in the interval Depth is the water level value at which the sensors are placed in the field. Different NTU values are measured in different level, again according to the weather conditions. For example the surface might have lower level as compared to the lower ones in case there are high waves. Gradient is the slope of two data instances next to each other on their values of NTU over time. This is done because it is expected that the value of NTU should gradually decrease after reaching a high value, thus the gradient should have small, 0, even negative but not very big values. ManualQA is the attribute of the classification of data done by a person after manually going through the data checking for incorrect NTU values received from the stations. True and False labels stand for correct and incorrect data instances. In this context, the second data set type has four other attributes with two previous and two preceding values of NTUs. This is done to have more explicit information on the change of the values of NTU and to easily detect immediate changes of it. 3.2 Statistics This section provides some analyzes on the data sets. In the context of getting to know the data it is important to perform some calculations over the data sets to examine the type of attributes, their value range, variance and so on. First the numerical attributes will be explained and then the class attribute. Afterward some correlation analyzes will be performed as well as some data quality checks on the data sets.

22 14 CHAPTER 3. EXAMINING AND UNDERSTANDING THE DATA Figure 3.1: An overview of data sets used in this thesis Numerical attributes As already mentioned previously there are two types of data sets that will be examined and used for the data mining process. The first data set type has the following numerical attributes: Turbidity Wind Speed Wind Direction Depth Gradient Date Time: This attribute contains time stamps in the date format as explained in [7]. These values are converted into doubles once uploaded to AutoQA Monitor (the same way Weka handles these type of values). The following statistical summary will include this group of metrics for the data sets: MaxValue finds the maximum value of the attribute

23 3.2. STATISTICS 15 MinValue finds the minimum value of the attribute Range finds the difference between Max and Min values. It is a measure of variability and it calculates the maximum spread on the data Mean finds the mean of the values of an attribute. It is a measure of the location of data. if the data values is distributed in symmetric manner than this will be the middle value of the range of values the data contains Median finds the median value of an attribute. It is the median value in between the higher and lower set of values. it is also a measure of location of the values of an attribute Standard deviation is a measure of spread of data. DataSet Attribute Max Min Range Mean Median StDev Turbidity WindSpeed TrainingSetMS01-1 WindDirection Depth Gradient Turbidity WindSpeed TrainingSetMS02-1 WindDirection Depth Gradient e10 Turbidity WindSpeed TrainingSetMS03-1 WindDirection Depth Gradient Table 3.3: Summary statistics of the training data sets Class attribute There is only one class attribute in all the data sets used. It is called the ManualQA and it can have only two categorical values (binary) : True and False. The following tables shows the distribution of each of the binary values of ManualQA attribute for each of the data sets in terms of frequency in order to have a better understanding on the ratio of data instances that are classified True and False. As noticed from this table the class label False is fewer in number as compared with True. This is also the mode in all the data sets. Therefore it might be a challenge to correctly classify the False data instances. It is also seen that the training and test sets for each station have

24 16 CHAPTER 3. EXAMINING AND UNDERSTANDING THE DATA Station DataSetName FrequencyClassTrue FrequencyClassFalse MS01 TrainingSetMS MS01 TestSetMS MS02 TrainingSetMS MS02 TestSetMS MS03 TrainingSetMS MS03 TestSetMS Table 3.4: Frquencies of class labels for data sets of type 1 equal distribution of class values. The reason for doing this is explained in 4.2. The data sets of type 2 have the same distribution of class values thus they are not shown in the above table. Some histograms with the distribution of True and False class labels are shown in Figure 3.2. Figure 3.2: Histograms showing the distribution of TRUE and FALSE class values for the data sets

25 3.3. ANALYSIS OF THE DATA Analysis of the data Before performing data mining model creation processes it is necessary to create a general idea of the data that is being used as well as apply some preprocessing techniques on it. This section shows some inspections done on the data sets to conclude on the quality of the data. Also a correlation analysis is done to find possible correlations between the attributes. Attribute selection and outlier analyzes are also included in this section Data quality Data quality has a great impact on the success rate of the data mining algorithms. If wrong data is being input to the classifiers, the models generated won t be so robust and will have poor performance on the classification tasks. The issues that will be considered regarding the data quality are: Missing values Since the data sets were created using a program this issue is already handled. The data sets created have no missing value for each and every attribute they consist of. Duplicate values For the same reason mentioned above, there are no duplicate values in the data set. A quick check for this would be to see if there is any time stamp which is identical. If so, the calculated gradient would be infinite, because the gradient is calculated using the following formula: (turbiduty i turbidity i 1 ) (time i time i 1 ) (3.1) The gradient is calculated before data instances are randomized, thus it is done for every succeeding value. So, in case there will be any duplicated values of gradient there will be a division by 0 (infinity). Wrong values As explained in the previous section, the attributes in the data sets have values that fall in a range of acceptable values. For instance turbidity has values ranging from 0 to 125 NTU. Also, as seen from the statistics provided before for each attribute in every data set there are some negative values. Moreover the wind direction attribute must contain values from 0 to 360. Since every data on the data sets represents a real value retrieved from the stations in the field, taking those data out before training and building classification models would mean that those models will be too idealistic and their performance on future classification tasks may suffer when it encounters these kind of data. However a test to verify how do these data impact the performance of the classification models can be done to make this verification.

26 18 CHAPTER 3. EXAMINING AND UNDERSTANDING THE DATA Correlation analysis and attribute selection Correlation between two attributes with continuous values is a measure of the linear relationship between them. Its value is always in the range -1 to 1 [1]. A correlation of 1 means two attributes have a perfect positive linear relationship. That means that one can be build as a linear combination of the other in a linear equation of the form; y = k x + b where x and y are the two attributes (for example represented as a vector of values ). Below will be shown the correlation matrices of 6 data sets, the training sets of the first and second types of data sets (since the test sets have the same attributes, subsequently the same data trends). Figure 3.3: Correlation matrices for all training sets The Gradient attribute was removed from the calculations since it was generating many divisions with 0. As notices from the matrices of training sets of type 1, the correlation coefficients are very close to 0, indicating a non existing correlation between the attributes. This fact means that the data

27 3.3. ANALYSIS OF THE DATA 19 set does not suffer from redundancy and all the attributes provided are necessary for model constructions. This also implies that it is not possible to construct any of the attributes as a linear combination of another. Certainly every attribute can be constructed as a linear combination of all the other attributes by performing a principal component analysis, but this issue is out of the scope of this thesis. On the other hand it can also be seen that in the training sets of type 2, the 4 attributes that contain the history turbidity values are highly correlated with the turnidity attribute since the coefficients have values like 0.9, 0.81, 0.83 and so on. In case we perform an attribute selection on the different data sets (using CfsSubsetEval in Weka) the results would be: Data Set Attributes MS01-1, MS01-2 DateTime, Depth MS02-1 DateTime, Turbidity MS02-2 DateTime, Turbidity, and afterntu2 MS03-1, MS03-2 DateTime, Depth, Gradient Table 3.5: Attribute selection for the training data sets The attribute selection just keeps a subset of attributes that give most of the information on the data set and leaves out the rest of the attributes. So, the above presented output is a subset from the whole attributes of the data set based on correlation analysis between individual attributes (features) that are not dependent on each other and don t have redundant data. However, it should be noted that this is done mostly to reduce the processing time of model construction. The more attributes, the more time is required to build classification models. This approach will than be very convenient since it will reduce the running time of the algorithms by keeping at the same time most of the information from the previous data set. On the other hand such necessary information may be lost by taking out the other attributes. Also, even if the rest of the attributes (other than the ones presented in table 3.5) are omitted and the models are constructed without them, no significant change on the performance will be achieved. The attribute selection gives however some indications on which attributes are central and most important in the model construction phase Outliers Outlier are data instances that are completely different from the rest of the data sets. As seen in the previous section, based on the statistics done on the continuous attributes, there are a number of data instances that have values which fall in a specific range around some value. Outliers are data instances that have values that don t belong to any of this groups of values. Thus they

28 20 CHAPTER 3. EXAMINING AND UNDERSTANDING THE DATA will impact the performance of the classifiers and confuse them during the classification process. Finding outliers is not trivial. Also, in the context of this thesis, the data being processed should be used as a whole since there is not a clear definition whether an instance that might be concluded to be an outlier from the data mining point of view it s not from the physical meaning that the value represents. There are many techniques that can be used to detect outlies, as explained in [1] but they won t be applied for this thesis.

29 Chapter 4 Model Creation The next topic to explore according to CRISP-DM is the model building. This chapter depicts the process of model construction that was followed for the tests performed during this thesis and attempts to explain some decisions of the best methods to use in order to get good models. The goal is to get good and reliable results from these models. Also, a brief introduction to the nature of the classifiers used will be presented. Finally, some alternative classifiers will be mentioned. 4.1 Classification Data mining classification task is a two-step process. During the first step, model creation, a model or classifier is constructed to predict the class labels of each instance in the input data set. While in the second step (performing classification) this model can be applied for the same task to previously unknown data (new data). Classification is the task of assigning each data instance in a data set to predefined categories or labels. The models are constructed after a classifier has figured pattern on the data set that has each data instance already labeled. This is called the training set. As shown in [8], the concept is the same as a mathematical function Y = f (X) where X is a set of attributes X = (x 1, x 2,..., x n ) and Y is a vector of labels (in this case True and False) Y = (y 1, y 2,..., y n ) for each instance in the training set. Figure 4.1: Data mining classification concept 21

30 22 CHAPTER 4. MODEL CREATION Model creation process The process of constructing models includes several steps. First the training set should be ready for usage (after doing all the necessary analysis previously) and then a classifier should be selected for building the model. Also, a testing technique should be chosen to view result of how good the model performs to classify the data in the training set. After the model has been built, it can be used for future classification tasks. This is explained in details in the next chapter. The AutoQA Monitor gives the possibility to select multiple classifiers to run on the training set. It also gives provides some of the most reliable model testing techniques that are used during model construction. Chapter 6 presents this functionalities in details. The classifiers can be of different internal structures. This means that the there are several ways to extract patters from the data. Decision trees and rules are two types of structures that will be explored further in this chapter. Figure 4.2: Data mining model creation process 4.2 Testing techniques The way to judge on whether a data mining model will reach the wanted success in classification tasks is to evaluate and examine its performance. There are different ways to achieve this but the protocol followed when building the models should be one that provides trusted results. The goal during model construction will be to create a model that will correctly classify as many data instances as possible when applied to the test sets that contain previously unseen instances. Another concern is that during model training phase the accuracy and error rate are generally too optimistic. This is because the same data is involved in training as well as testing. Subsequently the classifier can easily correctly classify and instance it used to build the model. In order to get a more realistic result the model should be tested on the test set that contains unknown data. These results would be a better measure of the performance of each classifier. This will be explained in details in the next chapter in the context of model evaluation and selection.

31 4.2. TESTING TECHNIQUES 23 On the other hand this section explains some of the best protocols used during model construction phase and derives a conclusion on which one is the best to use as supported by tests done using the AutoQA Monitor. The three techniques available are: Hold out In this approach the training set is partitioned into two disjoint sets usually in the ratio 1:3 for testing and the other 2:3 for training. The model is then constructed using the the data set portioned for training and is tested on the remaining data from the test set. So, the estimations on the accuracy of the model will be calculated on the test set, which has previously unseen instances for the model. This method has however several drawbacks. First of all when using this accuracy assessment technique during model construction, fewer data instances are available for training, subsequently the model generated is not as good as it might have been if built including all the data. Also, the partition of the original training set may include most of the data instances with one of the class labels to only one of the sets. Therefore that particular class label will be overrepresented in one set and underrepresented in the other. Subsequently, the results may not be trusted. In the implementation of hold out method in AutoQA Monitor another feature was added in the way the original data set is partitioned. The class labels in the training and test set after separation have the same ratio as in the data set before separation. This method is called stratification. It is done to ensure that both tests have the same distribution of class representatives as the original data set by including the right number of instances during the model construction and testing. Figure 4.3: Hold out accuracy assessment method Randomization Since the data sets that are being trained are time series data, there must be a way to make sure to include in the training and test set data from different years, months, days,and hours. Therefore another test method called randomization is implemented in AutoQA Monitor. A randomization algorithms is performed firstly to the

32 24 CHAPTER 4. MODEL CREATION training set and then it is divided in percentage into training and test set. Usually (as in the hold out method) 66% of the data instances goes for training and the rest for testing. This technique suffers from the same issues as the hold out method. Also here there is no stratification done in order not to interfere in the randomization done according to dates. The data is partitioned directly after randomization. Figure 4.4: A view of one of the data sets time stamp attribute Cross-validation Probably the simplest and most widely method used for estimating prediction error [6]. This technique is often used when the number of available data instances is not big enough. However it is also used to build a model that makes a good use of all the instance in the training set. The training set T is partitioned in k different parts and each of them is used exactly once for testing and once for training. So T = (k 1, k 2,..., k N ) where k is the number of folds that the training set will be partitioned and N the number of instances in it. Each fold contains 1 k instances and each of them is stratified, so each of them contains the same ratio of instances with the class values as the original set. Afterwards (k 2, k 3,..., k N ) are used as the training set and k 1 as the test set. This process is repeated for every fold. The total error is found by summing up all errors from each run or the accuracy is found as the average accuracy of all runs. The decision on the value of k is a topic for discussion but usually the 10-fold cross-validation is used, as well as 5-fold. [6]. A special case of this technique is the leave-one-out crossvalidation when k=n. In this case each test set contains only one instance and the rest is the training set. This is beneficial in the sense that it utilizes the entire data set in the model construction by providing a better model. However this technique is very computational expensive and is not very practical for relatively big data sets. An implementation

33 4.3. CLASSIFIERS 25 of leave-one-out cross-validation is also available in the AutoQA Monitor. A schematic view of how the data is portioned and how the stratification is performed is shown in picture 4.5. Figure 4.5: Partition of the training set in k-folds in the cross-validation method in a binary class problem 4.3 Classifiers Choosing the right machine learning algorithms is one of the important decisions that needs to be taken when performing a data mining process. There are several alternatives that offer different ways to solve the classification problems. The difference consists of the particular manner that each of them find the patterns on the data. This chapter presents some of the algorithms that were included for usage in the AutoQA Monitor by giving a brief description on how they build the pattern on the data they are run, as well as providing some concrete examples from the patterns found on some of the data sets being studied on this thesis. At the end some other alternative classification algorithms different from the ones used for this thesis, are mentioned. All these algorithms will be called classifiers since their tasks are only concentrated in classification Trees Classification using tree structures is done by the so called decision tree induction. In this tree the internal nodes are attributes and each outgoing edge is the outcome of test done on the values of these attributes. The leaf nodes contain class values. Certainly there is a root note that also

34 26 CHAPTER 4. MODEL CREATION contains an attribute. Decision trees are convenient to use in the sense that it does not require domain knowledge on the data [8].Moreover such a structure is quite intuitive and easy to grasp when applying new data on the tree. There are many types of machine learning algorithms (such as ID3) that use decision tree structures but in this thesis only J48 and RendomForest is used. An example of a decision tree constructed using J48 is shown in 4.6. J48 It is the implementation of C4.5 in java (Weka)[2].This is an algorithm that is used for classifying continuous values attributes. The way it does this is by putting a threshold and splitting the set of data into two parts, one containing the values below the threshold and the other one above or equal to the threshold. This operation is performed recursively until the whole tree with purified leaves is constructed. The decision for the attributes to be involved in any node is done after calculating their information gain. The attributes with the highest information gain are chosen. This algorithm also handles missing values. The missing values are divided proportionally according to the rest of the data. It also offers the option of using pruning and validating the model in order to reduce the size of the tree of any of its branches if they are redundant and unnecessary. This contributes to make the process of building the models faster, but it can also have a negative effect on their performances since some of the tests done during the tree are lost. Figure 4.6: A classification model of tree structure (J48) It can be notices that all the leaves have the class values (True or False) and all the continuous attributes are split in two parts, being less or equal or higher than some number. This is a pruned