Functional Analysis of Real World Truck Fuel Consumption Data

Size: px
Start display at page:

Download "Functional Analysis of Real World Truck Fuel Consumption Data"

Transcription

1 Technical Report, IDE0806, January 2008 Functional Analysis of Real World Truck Fuel Consumption Data Master s Thesis in Computer Systems Engineering Georg Vogetseder School of Information Science, Computer and Electrical Engineering Halmstad University

2

3 Functional Analysis of Real World Truck Fuel Consumption Data School of Information Science, Computer and Electrical Engineering Halmstad University Box 82, S Halmstad, Sweden January 2008

4 Acknowledgement If it looks like a duck, and quacks like a duck, we have at least to consider the possibility that we have a small aquatic bird of the family anatidae on our hands. Douglas Adams ( ) Thanks to my family, especially my mother Eva and friends. ii

5 Abstract This thesis covers the analysis of sparse and irregular fuel consumption data of long distance haulage articulate trucks. It is shown that this kind of data is hard to analyse with multivariate as well as with functional methods. To be able to analyse the data, Principal Components Analysis through Conditional Expectation (PACE) is used, which enables the use of observations from many trucks to compensate for the sparsity of observations in order to get continuous results. The principal component scores generated by PACE, can then be used to get rough estimates of the trajectories for single trucks as well as to detect outliers. The data centric approach of PACE is very useful to enable functional analysis of sparse and irregular data. Functional analysis is desirable for this data to sidestep feature extraction and enabling a more natural view on the data. iii

6 Contents Acknowledgement Abstract List of Figures List of Tables ii iii vi viii 1 Introduction Background Motivation and Novelty Related Work Limitations Outline Methods General Statistical Methods PCA Hierarchical Clustering Validation Methods Diagrams Functional Data Analysis Principal Components Analysis through Conditional Expectation The Vehicle Application and Data Description 1.1 Data Impurities in the Truck Data Data structure Approach iv

7 4 Results Basic Data Analysis Data Binning Feature Extraction Function Fitting Application of PACE Baseline PACE Results Number of Principal Components Error Assumptions in PACE Different Kernel Functions Variances Model Variance Data Variance Prediction Outlier Detection Expansion Discussion 50 6 Conclusion 51 Bibliography 5 List of Abbreviations 55 v

8 List of Figures.1 Fuel Consumption between Observations Fuel consumption plot generated from the raw data Histograms of the original and the cleaned data Fuel consumption plot generated from the clean data Scatter plot and histograms Histogram of the distance between observations Distribution and mean/variance of binned data Boxplots of binned data Outlier detection based on feature extraction Straight line fitting Plot of mean function and principal components Scree Plot Smoothed covariance matrix Reconstructed curves versus mean function and raw observations of selected trucks Reconstructed curves and raw measurements for all trucks Reconstructed traces of misfitted trucks Comparison of reconstructed trajectories with differing number of PCs Reconstructed trajectories without measurement error assumed A comparison of µ with different smoothing kernels A comparison of PCs with different smoothing kernels Distribution of all mean curves Graph of all mean curves Trucks with a high influence on the results of PACE Data variance vi

9 4.19 Normal Distribution Plots of the PC scores Histograms of the probability of trucks Samples of truck probability PACE Results of Speed Data PACE Results on Seasonal Fuel Consumption Selected trucks from the Seasonal Fuel Consumption Data vii

10 List of Tables 4.1 MSE of PACE with 8 principal components MSE of PACE with PCs MSE of PACE with 4 PCs MSE of PACE with 29 PCs MSE of PACE with 8 PCs and error cut-off viii

11 1 Introduction 1.1 Background The original idea for analyzing this data came from Volvo Parts AB, one of the main business units of Volvo Group AB. The role of Volvo Parts is to provide solutions and tools to the after-market, which includes vehicle electronics diagnostic tools. When a truck is in the workshop, the vehicle electronics data is read out from the truck using diagnostics tools from Volvo Parts and transmitted to a central database. This data, which is collected from the vehicles electronics systems is called logged vehicle data (LVD) and is collected from sensors within the truck. Several electronic subsystems supply information for LVD, which can include data from the electronic suspension, the transmission, and most importantly from the Engine Electric Control Unit. The current main use of LVD is seemingly just basic analysis, e.g. remote diagnostics of faulty components and simple statistics. One of the problems with analysing LVD is the relative lack of observations. The source of this lack of information is the data retrieval process. The procedure is a time consuming process, making it a cost factor for the workshops. The time consumption affects the adoption rate of this procedure in the field negatively, which leads to the data composition detailed in Section.1. The basic idea behind the problems detailed in this thesis is to expand the usefulness of the data for Volvo Parts, retrieving additional new information from it and provide means to access this information. This is done by using recent advanced statistical 1

12 1. Introduction 2 techniques. As a starting point to the application of these techniques, the analysis of the fuel consumption data contained in LVD was suggested. Fuel consumption data is very interesting from a statistical point of view. This interest stems from being a major cost factor, as well as being influenced by a high number of other factors, such as: Usage patterns of the operator, i.e. the driving style and habits Maintenance of the truck Gross Combination Weight usage, i.e. the cargo of the truck Environment, i.e. hilliness, road condition, etc. The influence of these and more factors make this data a good indicator. But the mass of influences also makes exact determination of the underlying cause impossible. Additionally, some of these influences might cancel each other out, thus removing information. If it is possible to extract information from fuel consumption data, then it should work for the rest of the data too. 1.2 Motivation and Novelty From LVD, it should be possible to extract information on hidden trends, i.e. the principal components (see Section 2.1.1) that are common to all similar trucks. Based on these components, it should be possible to determine if a truck is unrelated to other trucks, i.e. a outlier and to predict future developments in fuel consumption, when the trucks behavior is similar to that of other vehicles. It is very easy to take the last observation of each truck in a group of similar trucks to determine abnormal fuel consumption, but it is hardly possible to calculate underlying trends or other information from these facts. To discover information like trends or outliers from LVD, the data of a truck has to include not only the last observation available, but also past ones. These requirements, multiple observations of a truck and a set of similar trucks lead to the irregular and sparse structure of the data used in this thesis. The data is described in more detail in Section.1.

13 1. Introduction The analysis of this data can be done in at least two ways. The most obvious choice in methodology would be the use of multivariate statistics, but for several reasons detailed below, the central methodology for this thesis is functional statistics. Functional statistics focuses on analysing the data as functions, rather than a set of discrete values 1. Multivariate statistics are a set of methods which work on more than one variable at a time. Some examples for these methods are regression analysis, principal components analysis and artificial neural networks. Principally, functional statistics are also part of this set, as both have multiple variables as input. However, the focus on handling the input variables as continuous functions rather than arbitrary variables separates those two fields. As the observation of trucks in the workshop is not happening regularly, i.e. the observations can not be fitted to a grid, it is difficult to incorporate all information from the input into variables for use in multivariate statistics. Therefore, features like mean, variance, duration of all observations, date of first observation, odometer count at the last observation, etc. have to be extracted from the data to be able to do analysis. Inevitably, the extraction of this knowledge leads to information loss, which is problematic on this already sparse data. The process of discovery and selection of important features for multivariate analysis is very difficult and time consuming. It is crucial to extract and select the best and most important features from the data to minimize the data loss and maximize the information content of the features for the success of all further steps in analysis. Feature extraction creates an additional layer of data processing and introduces a large number of tunable knobs. Functional Data Analysis (FDA) on the other hand, preserves the information in the data present and does not need feature extraction at all. Furthermore, it facilitates a more natural handling of the data, describing not only more or less abstract features of the data, but a function which resembles the data. The choice of using functional over multivariate data analysis is also motivated by the ability to analyze the functional properties of the data, e.g. derivatives of the data. Additionally, FDA does not introduce a high number of additional parameters, unlike multivariate analysis. 1 A more detailed description on this collection of methods can be found in Section 2.2.

14 1. Introduction 4 However, multivariate analysis has an advantage over FDA when a high number of different functions have to be analysed at the same time. FDA has problems in visualizing this higher dimensional data, as well as the necessity of having a high amount of data for each dimension (curse of dimensionality). The most important step in FDA is the transformation of the discrete data to a functional basis. Again, the irregular and sparse nature of the data makes this transformation difficult. For being able to perform FDA on this data, a method called Principal Components Analysis through Conditional Expectation (PACE) is applied. The foundation of PACE is the assumption that a smooth function is underlying the sparse data. Under this assumption, it is possible to use even irregular data for the discovery of principal components. The main novel aspect of this thesis is the application of FDA and PACE to automotive data. Previously it has successfully been applied to biological data, economic processes, bidding in online auction houses, but not automotive data. PACE itself is highly interesting to be applied to the data at hand, because it is able to work on it without the need for feature extraction or regular observations. The methods used in this work can be used to describe the actual fuel consumption of the observed trucks in customer hands. This means the methods applied to LVD are driven by data and not by a model. 1. Related Work General sources of information on data analysis related to this work are The Elements of Statistical Learning [1], Functional Data Analysis [2] and Nonparametric Functional Data Analysis []. The single most important paper related to this work is Functional Data Analysis for Sparse Longitudinal Data [4], which proposed the method PACE and applied it to yeast cell cycle gene expression data and to longitudinal CD4 cell percentages. The percentage is used as a maker for the progress of AIDS in adults.

15 1. Introduction 5 Functional Data Analysis for Sparse Auction Data [5] combines the PACE approach with linear regression to predict closing prices of online auctions. The most related of the few public papers on fuel consumption in heavy trucks is Heavy Truck Modeling for Fuel Consumption Simulations and Measurements [6]. This work deals with building a simulation model of fuel consumption. Another paper, which discusses methods to reduce idle fuel consumption in North American long distance trucks and highlights typical driver behavior is Analysis of Technology Options to Reduce the Fuel Consumption of Idling Trucks[7] Additional information on doing PCA on sparse and irregular data can be found in Principal component models for sparse functional data[8] and Sparse Principal Component Analysis[9]. More related to PACE is Properties of principal component methods for functional and longitudinal data analysis[10]. Another paper which is related to the estimation of Functional Principal Component Scores is [11]. Knowledge relating to linear regression analysis for longitudinal data can be found in [12]. 1.4 Limitations The scope of this thesis is to research the possibilities for the application of FDA methods to the sparse and irregular automotive data from LVD. It is outside of the scope of this thesis to establish a conclusive theory about a true long term fuel consumption model of all truck engines. The conclusive, globally valid model is impossible because of a relatively low number of individuals in the data, as well as a limited observation duration and possible differences in usage patterns of the trucks, i.e. vehicles with a high mileage in a limited time span do not necessarily exhibit a similar fuel consumption to low mileage trucks in the same time span.

16 1. Introduction Outline The next chapter Methods describes crucial used methods. This includes underlying basic methods as well as the foundations of FDA and PACE. The chapter Application provides a description of the data used in this thesis and includes information on the interplay of the proposed methods and the data. Chapter 4 provides comprehensive information on the results. The last two chapters, Conclusion and Discussion, wrap up the results from this thesis and provide an outlook on possible continuations of the research.

17 2 Methods This chapter is divided into three parts. General Statistical Methods describes nonfunctional methods which are fundamental to this work. Functional Data Analysis provides an introduction into this field. The final part, Principal Components Analysis through Conditional Expectation gives an overview of this crucial method. 2.1 General Statistical Methods This section introduces general statistical concepts used in this thesis and a number of tools to visualize data and test results Principal Component Analysis One of the constitutional methods for analysing LVD is the Karhunen-Loève transformation, universally known as Principal Component Analysis (PCA). PCA is also the foundation to Functional Principal Component Analysis (FPCA)[1, 1]. Basically, PCA is a method to explore data by finding the most important ways the variables in the data differ from another. It can compress the data by discovering a low number of linear combinations of input variables which contribute most to the variability of the input. These linear combinations are found by constructing a linear basis for the data where the retained variability is maximal. 7

18 2. Methods 8 Mathematically speaking, the goal is to reduce or compress high dimensional data X to lower dimensional data Y. To do this reduction, a number of algorithms are available, here, a method involving the calculation of the covariance is described. The first step is to calculate the mean vector µ for each variable: µ i = 1 K i x ij, i = 1... N K i j=1 where N denotes the number of variables and K i the number of observations in one variable. Subsequently, µ is removed from every observation in X, which is subsequently denoted as X X. In the next step the covariance matrix cov(x X) has to be calculated. Covariance is a measure how two variables vary together. If those two variable vary in the same way (i.e. same prefix), the covariance will be positive. If, on the other hand, the two variables have different prefixes, the covariance will be negative. A covariance matrix is the result of calculating the covariance for all members of two vectors. The resulting matrix gives the grade of correlation between the input vectors. To find a mapping M that is able to transform the high dimensional data into low dimensional data, M that maximizes M T cov(x X)M has to be found. It can be shown that the best (variance maximizing) mapping is formed by the eigenvectors of the covariance matrix. Hence, PCA has to solve the eigenproblem to get the transformation matrix. cov(x X)M = λm The eigenproblem has to be solved d times with different principal eigenvalues λ to get the principal eigenvectors (or principal components). The low dimensional representation Y can then be computed by simple multiplication: Y = (X X)M

19 2. Methods Hierarchical Clustering Hierarchical clustering is a relatively simple method [1] to segment data into related groups. Clustering is used within this thesis for testing if differing clusters of trucks can be found from extracted features. Hierarchical clustering needs a dissimilarity measure between the elements. The standard for measuring the dissimilarity is the euclidean distance, which is also used in this thesis. When the distance between all possible pairs of elements is calculated, the clusters can be built. For building these clusters, there are two different approaches: The agglomerative approach, which starts with as many clusters as there are individuals. The divisive method starts with one big cluster which is then split into smaller clusters. Agglomerative methods are guaranteed to have a monotonic increasing level of dissimilarity between merged clusters, growing with the level of merging. This property is not guaranteed to divisive approaches. The second choice for building the clusters is to decide on the measurement for the distance between two clusters. Single Linkage The link between the clusters is defined by the smallest distance between elements in the two clusters. Complete Linkage The link is defined by the largest distance between elements in the two clusters, the opposite of the first method. Average Linkage Uses the average distance between all pairs of elements in both clusters Validation Methods A number of methods to validate the results and to estimate variation were used in the scope of this thesis. These include brief usage of bootstrap, jackknife and various cross validation methods, such as k-fold and leave-one-out[1]. Bootstrapping is the process of randomly picking a samples from given observations where a single observation can be chosen multiple times. The goal of a bootstrap is to approximate the distribution from these samples.

20 2. Methods 10 Jackknifing can be used to estimate the bias and standard error. Jackknife is very similar to k-fold and leave-one-out cross validation, as it systematically removes one or more observations from a sample and then recalculates the results as often as there are possible readouts Diagrams A number of special diagrams were used to illustrate some results of this thesis. Those diagrams are dendrograms, boxplots and scree plots [1, 2]. Dendrograms are tree diagrams which are used to illustrate the result of a clustering algorithm. An example for such a diagram is Figure 4.. On the vertical axis the distance between clusters is plotted. A horizontal line denotes a split between classes at this specific distance measure. This implies that a split at a higher distance value has a higher dissimilarity between the split classes, as opposed to a lower distance value split. Boxplots describe groups of data such as binned data through five statistical properties. A boxplot example can be seen in Figure 4.2. The box represents the lower and the upper quartile, showing where half of the data is contained. The line in this box illustrates the median of data in this group. The whiskers attached to this box extend to the furthest data point, up to a maximum of 1.5 the distance between the quartiles. Data points outside of this boundary are usually marked with a cross, indicating a possible outlier. Scree plots give an indication of the relevance of a principal component (eigenfunction) by indicating the accumulated eigenvalue up to the n-th principal component. This plot can be used to select a suitable number of eigenfunctions. An example for a scree plot is Figure Functional Data Analysis Functional data analysis (FDA) [2, ] is a collection of methods which enable the investigation of data in a functional form. Functional data is the idea of looking at a

21 2. Methods 11 set of observations not as a vector in discrete time, but as a continuous function. The analysis of functions rather than discrete samples inherits advantages over multivariate analysis. An advantage of this property is that the rate of change or derivatives of these functions can easily be calculated and analysed. FDA also includes variants of multivariate methods like PCA. Functional PCA, like normal PCA, not only provides a method for dimensionality reduction, but also characterizes the main modes of variation from a mean function. To perform FDA on discretely sampled data, the data has to be converted to a continuous, functional format. This means a function has to be fitted to the sampled data points. It is not feasible to convert every dataset to a functional form. Especially in the case of sparse and irregular observations, this task is very difficult, but central to the success of functional data analysis. Usually, the methods used to convert data into a functional format are interpolation and smoothing, or more generally function fitting. A very simple method to do this conversion would be a least squares fit of a first order polynomial (a straight line). Usually, a more flexible method is used for this step, namely spline interpolation. Depending on the underlying data, other fits like Fourier functions are possible. FDA is easily applicable if the measurements were done with a regular spacing, and the data is complete over the observation duration. In the opposite case, it is very difficult to estimate the complete trajectory, when only a single subject is taken into calculation. 2. Principal Components Analysis through Conditional Expectation Principal Components Analysis through Conditional Expectation (PACE) is a derivative of functional principal components analysis for sparse longitudinal data, proposed in the paper Functional Data Analysis for Sparse Longitudinal Data by Yao, Müller and Wang[4].

22 2. Methods 12 PACE is an algorithm for extracting the principal components from irregular and sparse data. It also provides an estimation of individual smooth trajectories of the data. PACE assumes that the data is randomly located with a random number of observations per subject. Furthermore it assumes that data is determined by a underlying smooth trajectory. The first step in PACE is the estimation of the smooth mean function µ, by using a local linear line smoother on all measurements combined into one pool of data. The choice of the smoothing parameter, or bandwidth is done automatically[14] or by hand in this step. The covariance surface can then be calculated like a regular covariance matrix. This raw covariance surface is stripped of the variance (the first diagonal). This raw matrix is then smoothed utilizing a local linear surface smoother. The bandwidth is chosen by leave-one-curve-out cross-validation. The smoothing step is necessary to fill in for missing observations. The estimation of these two model components share the same smoothing kernel. The choice of a smoothing kernel is discussed in Chapter 4. From these model components, it is possible to calculate the estimates of the eigenvalues and eigenfunctions, i.e. the functional principal components of sparse and irregular data. The last step is the calculation of the functional principal component scores. Those scores describe how much of a principal component is retained in a single subject. However, the conventional method of using numerical integration to recover the Principal Component (PC) scores leads to biased results; because of sparse and irregular data. In this step, the conditional expectation comes into play. It provides the best prediction of the PC scores if the measurement error is Gaussian, or the best linear prediction otherwise. PACE is discussed in detail by Yao, Müller and Wang [4].

23 The Vehicle Application and Data Description The purpose of this chapter is to outline the connection between the methods proposed in Chapter 2 and the application of those methods on the Volvo data..1 Volvo Truck Data The original data received from Volvo Parts AB consists of 2027 observations of 267 trucks. It was collected between June 2004 and May 2007 in North America. All trucks have the same engine and are configured as articulate truck for long distance transports on smooth roads. The gross combination weight (GCW), which includes the weight of the towed trailer and the truck itself is 6 tons, the US federal GCW limit. Data is retrieved when a truck is in a workshop that is equipped to read out the onboard electronics and performs this procedure. It is then sent to the Volvo Headquarter in Gothenburg for storage and analysis. The data from each observation contains only informations from one of the trucks onboard electronic systems, the Engine Control Unit (ECU). From these data, two variables are mainly relevant for this thesis: Total distance driven Total amount of fuel consumed 1

24 . The Vehicle Application and Data Description Incremental Fuel Mileage [km/l] Distance Driven [km] Figure.1: This figure shows the distribution of the fuel consumption, when the fuel mileage is calculated only between two observations. The outliers visible in this figure can be explained by a high amount of idling between two close observations. When the fuel mileage is calculated accumulative, those outliers do not occur. These variables are not reset when the ECU was read out in the workshop and therefore behave accumulative. Using these variables as a basis to calculate the fuel consumption per distance or time has an averaging effect on itself as it includes all former mileage data. This is necessary because of the unevenly distributed data. If a truck was read out twice within a very short span of time, the fuel consumption in this interval is possibly vastly different from the normal fuel consumption behavior of the truck, possibly because the truck was not moved very far withhin this time span, but idling for some time. The outliers caused by this effect can be seen in Figure.1. These outliers are the reason for not using the difference in fuel amounts between two observations as a calculation basis in this thesis. The accumulative approach allows those outliers to remain in the dataset..1.1 Impurities in the Truck Data The raw data retrieved from the trucks contains irregular observations or changes in the truck data which result in some cases in a removal of specific observations or the whole truck from the data set. See Figure.2 for a plot of the raw fuel consumption

25 . The Vehicle Application and Data Description Fuel Mileage [km/l] Distance Driven [km] Figure.2: Fuel consumption plot generated from the raw data. The lines are linear interpolations between the observations. data. Incomplete Observations A truck is missing one of more variables that would be required for analysis. The observations from this individual can not be used for the calculations. Physically impossible changes in accumulative variables Between two observations of a single truck, accumulative variables changed to a smaller value. This means that a later observation in time has a smaller number of total driving distance than an earlier measurement for example. This is physically impossible, but observable if the ECU has been replaced or the contents of the ECU were erased during a software update. This criteria applies to 44 trucks. Although it is possible to use a subset of the observations from each of these trucks. This was not done, because the quality of the measurement might have been compromised and the manual effort of cleaning the data is a time consuming task for very few usable measurements. Empty and Duplicated Observations Some observations do not contain any new information, but only seem to be resubmits of earlier or empty observations with a different time stamp. These particular observations are removed from the

26 . The Vehicle Application and Data Description 16 final data, but the remaining observations of this truck are used. Phenomena like these might occur, when the data aquisition process in the workshop was interrupted, or a transmission error occurred. Early Observations These observations are too early in the life of the truck to give a meaningful information. The removal of these observations is motivated by the unusual fuel consumption of a truck in this state. The unusual fuel consumption is caused by a high number of short trips the truck has to travel before it can be put into regular service. Examples are drives to paint shops or truck customizers as well as transfers to the customer. The number of observations purged when this criteria is set to remove all measurements below km is 150, when all measurements before 1000km are deleted, the number of observations drops by 100. See Figure.. From the 269 initial individual trucks, 56 trucks are removed. In terms of observations, from originally observations 120 remained in the data set, when the lower border for observations is set to 1000km. See Figure.4 for a plot of the cleaned fuel consumption data. The most visible change to Figure.2 is the lower number of outliers at roughly 0 kilometers, which is mostly an effect of the removal of very early observations..1.2 Data structure Some properties of the data make the task of analyzing inherently difficult. Most of these properties stem from the sparsity of the data. Sparseness in this case means that every truck has been observed on average just times with a standard deviation of observations. The sparseness of the data is visualized in Figure.5. The data is not fully observed. The observations of a single truck often are not scattered over a very long distance in time or driven distance, but measured only within a short span. The average duration between the first observation of a truck and the last one, where measurements are taken, is kilometers with a standard deviation of kilometers. The mean focus of the observations 1 Excluding incomplete observations, as they are not usable at all.

27 . The Vehicle Application and Data Description Raw Data Cleaned Data 200 Number of Observations Distance Driven [km] Figure.: This comparison shows the number of observations on the raw data versus the cleaned data. The overall reduction in the number of observations as well as the lower amount of observations at the beginning is noticeable. 4.5 Fuel Mileage [km/l] Driven Distance [km] Figure.4: Fuel consumption plot generated from the clean data. Note the lack of outliers at the beginning of the data.

28 . The Vehicle Application and Data Description Fuel Mileage [km/l] Driven Distance [km] Figure.5: The scatter plot in this figure highlights the sparse and irregular distribution of the data. The histograms describe the distribution of the observations along the axes. is at 022 kilometers deviating by 1609 kilometers, which means that most of the trucks are not observed from the beginning, but observed later on in their life-cycle. The density of measurements varies. This implies that the placement of measurements is irregular throughout the duration of their observation. As the trucks are independent of each other, the times when observations happen are not correlated with each other. For a visual representation of the irregular duration between the measurements, see Figure.6. This figure indicates a non-normal distribution. The average distance between observations is kilometers with a standard deviation of kilometers. Unsupported curvature. The irregular placement and the sparsity of variables causes this property to occur. If a part of a curve has a high curvature, which can be approximated by d2 y dx 2 or ( d2 y dx 2 ) 2. When this is the case, the relative resolution of the data at the point of the high curvature should also be high to enable a good estimation of the underlying function [2].

29 . The Vehicle Application and Data Description Number of Observations Distance Driven between Observations [km] Figure.6: This figure shows the distribution of distances between two observations of the same truck..2 Approach The first part in analyzing truck data, which is described in section 4.1, is to establish results with basic multivariate analysis as a basis where the results of functional analysis can be compared to. This part shows pitfalls and difficulties when applying standard multivariate methods to the data. The first possible way for multivariate analysis is feature extraction. It is a difficult task to find relevant features to extract. A simple statistical feature will be extracted from the data to be able to give an idea how feature extraction works. The second possibility for multivariate analysis is to put the observations into bins. This is done in order to be able to align the data onto a vertical grid. The second way is necessary, because it is very hard to visualize and convert to the original data format from the extracted features. However, binning cannot easily be used for outlier detection. Usually, some of the bins are likely to have only a low number of observations which makes outlier determination in this bin very difficult. If the bins are made larger, multiple or even all observations of a single truck might be put into a single bin. This leads to increased difficulty in differentiating between normal and outlying observations.

30 . The Vehicle Application and Data Description 20 These steps should lead to two results: A simple outlier detection, based on a clustering of the extracted features and a variance and mean estimation for the data, based on the binned data. The task of estimating fuel consumption behavior for a single truck, outside of its observation duration using the extracted features is very hard. This is because the mapping between the values of the features and a function is not available. Additionally, information from other, similar trucks is not taken into consideration. The last step in Basic Analysis (Sect. 4.1) is a demonstration of the main problem of applying FDA on the data at hand: The difficulty of fitting a function to a single truck. The main task of this thesis is to apply the PACE algorithm to the data (Sect. 4.2), and to try out the various options within the PACE algorithm. In this section, the results of PACE in general will be assessed, the difference between PACE with different options in regard of the PACE generated functions as well as general statistical properties, such as the mean function. The first advantage in using the PACE algorithm in comparison to the basic methods is the lack of need to pre-process data, i.e. to extract features or otherwise process the data. This non-parametric input of the data is complimented by a number of options to tune the algorithm itself for various needs (amount of information retained, if the input data has measurement errors, etc.). The next step is to try out a number of methods which can be applied to the results of PACE. For example to calculate the possibility of the fuel consumption of a particular truck, given all the other trucks. PACE enables the user to analyse the sparse and irregular data at hand, enabling the use of additional techniques from FDA, whereas using only multivariate data analysis or normal FDA on the same data is very difficult to do and does not incorporate the information gathered from the other trucks. PACE makes outlier detection, estimation of the function outside the observation duration and the gathering of common statistical properties, like mean and variance in functional form, from sparse and irregular data a lot easier or even possible.

31 4 Results 4.1 Basic Data Analysis The aim of this section is to provide an overview of basic multivariate analysis possibilities with the available data. Functional methods are applied from Section Data Binning One approach, as described in the previous chapter, is the creation of a vertical grid for the data domain followed by binning the data into a limited number of buckets along the time or distance axis, similar to creating a histogram. If there is more than one observation of a truck in one of these bins, an average of these measurements is put into the bin. This has to be done to avoid biasing in case of dense observations of a truck within a short timespan. The size and the quantity of the bins is crucial for binning. With the data at hand, 25 bins were used, which results in a size of 6087 kilometers per bin. In Figure 4.1 the number of observations per bin, as well as an estimation of the mean function and the variance of the data can be seen. In Figure 4.2 a boxplot of the binned data and one of the results of bootstrapping [1] the mean value per bin (10000 bootstrap samples) are illustrated. 21

32 4. Results Histogram Mean and Deviation Observations Fuel Mileage [km/l] Bins Distance Driven [km] Figure 4.1: The histogram depicts the number of observations per bin. Especially the first and the last few bins have a very small number of observations, which leads to the abnormal results in these bins in the mean and standard deviation figure on the right. This figure shows the mean as well as the standard deviation estimated from the binned data.

33 4. Results 2 Binned Data Boxplot Bootstrapped Mean Boxplot Values Bin Values Bin Figure 4.2: The figures show boxplots for the binned data (left) and bootstrapped mean values (right). The left boxplot is a simple plot of the raw binned data, providing an easy visualization. The right boxplot is generated by bootstrapping the mean of each bin times. Bootstrapping should give an idea of how much the mean can vary, if new data has the same distribution as the data at hand.

34 4. Results Feature Extraction The features which are retrieved from all observations of a single truck, are used to construct a simple outlier detector with hierarchical clustering. The goal of this simple outlier detector is to find trucks, whose mean is deviating significantly from the mean of the entire data. A single extracted feature was used in this case: T ruck = (µ T ruck µ All ) 2 The data was then clustered with a hierarchical algorithm, using average distance linking. The outlying classes were subjectively selected by looking at the resulting dendrogram. For the results, see Figure 4..

35 4. Results Dendrogram.1 Plot Class Distance Fuel Mileage [km/l] Class Distance Driven [km] Figure 4.: Results of outlier detection based on feature extraction. The left figure shows the dendrogram of the clustering algorithm. This figure shows that the class 6 is an extreme outlier, whereas the classes and 7 are also quite different from the main part of the data. The basis for this classes being outliers is a vastly different mean from the rest of the data. In the other figure, the outlying clusters are highlighted. The extreme outlier is marked red, the normal outliers are marked green and the normal data is colored blue. The class has 5 members, whereas the other outlier classes have just 1 member. The classes 1 and 5 have 114 respective 52 members. Class 2 has 27 members, whereas 4 has 1 members.

36 4. Results Function Fitting Finding a plausible function that is fitting the data of the trucks well is difficult, because of the open-ended nature of the measurements. If a set of observations have a defined start and an end of their measurements, i.e. the data is fully observed, it is easy to interpolate the data in between, even if the data within this span is sparse. This property of the data at hand is also discussed in Section.1. If the set of data is not fully observed, it is almost impossible to get a reliable fit outside the observation span of a single entity. This reliable fit outside of this span is necessary for performing FDA on this data, as FDA needs the same set of basis functions, or in the case of spline interpolation, the same knots for all functions to work. It was not possible to get a good fit on this data with splines, where all of the knots are distributed the same for all truck entities. Also, polynomial fits, i.e. the approximation of the data with low (< 5) order did not result in a stable fit for the available data. The most reliable fit under these conditions were generated by fitting a linear function to the fuel consumption observations. These results in fitting the sparse and irregular data motivate the idea of combining the observations by the means of PACE, to be able to get better fits from the reconstructed trajectories. The results of fitting a straight line to the data can be seen in Figure 4.4.

37 4. Results Fuel Mileage [km/l] 2.5 Fuel Mileage [km/l] Distance Driven [km] 1.5 Distance Driven [km] Figure 4.4: On the left, all fitted straight lines are shown. The right figure shows the mean straight line along with the standard deviation of the slope and the offset (blue) and the standard deviation of just the offset (dashed). The main problem with this straight line fit are a number of fits with high gradients, which are not valid outside their observation span. However, the mean line shows a slight increase in fuel economy, just like the mean curve from PACE (Figure 4.5).

38 4. Results Application of PACE The goal of this section is to elaborate on the application of the PACE method on the truck data, focusing only on fuel consumption per kilometer over the distance axis. Along with the results of this first application, some options available for a fine-tuning of the method will be presented and a general estimate of variability will be given Baseline PACE Results The data in use for this initial run of the PACE method is the cleaned set, with all the trucks removed which have less than 2 observations. Additionally every observation, that happened before a threshold of km has been removed. The PACE method has some interchangeable sub-methods. For the baseline results, mostly the same parts as in the original method described in [4] were used. Thus, the kernel used for smoothing the mean function is the Epanechnikov kernel [4] and the input data is assumed to contain measurement errors. A small discrepancy to the original method is the choice of using Fraction of Variance Explained 1 (FVE), instead of the Akaike Information Criterion [1] (AIC) to select the number of PCs. The FVE threshold is set at 95 % of variance explained. Regarding Figure 4.5, the smoothed mean curve should be taken with a grain of salt, especially the variance plots and the measurement density plots in Figure. should be considered. The number of PCs selected by FVE is 8, which accounts for % of the total variation. The scree plot (Section 2.1.4) of the principal components from this analysis can be seen in Figure 4.6. The first, strong principal component is almost a straight line, which is basically shifting the mean from its starting point closer to the position of the measurements. The second and the fourth principal component seem to serve partially as corrective for trucks with a higher initial fuel economy than the average truck. The smoothed covariance matrix generated and used by PACE is visualized in Figure 4.7 by a color-matrix. 1 The sum of the eigenvalues of a certain number of eigenfunctions divided by the sum of all eigenvalues has to exceed a certain threshold. The first number of PCs which exceeds this threshold is subsequently used.

39 4. Results 29 Smooth mean curve 4 x 10 Principal Components % % 8.65 % 4.0 % Fuel Mileage [km/l] Distance Driven [km] 0 Distance Driven [km] Figure 4.5: The smooth mean function generated by PACE (left) is the basis for all other results. The four most significant PCs (right) are the strongest ways in which the individual trucks vary. The legend quantifies the strength of the PCs.

40 4. Results Scree Plot Fraction of variance explained (%) Number of principal components Figure 4.6: The scree plot, which highlights the trade-off between the number of PCs used versus the variance retained. The use of more than 10 PCs makes little sense, as the Fraction of Variance Explained (FVE) is not improving much.

41 4. Results 1 9 x 105 Smoothed Covariance Matrix Distance Driven [km] Distance Driven [km] 0.04 Figure 4.7: The smoothed covariance matrix generated by PACE. (The diagonal, which is the variance, has been removed prior to smoothing.) The main part of the matrix shows a small positive covariance (green).

42 4. Results 2.2 Vehicle # 14 Vehicle # 106 Vehicle # 92 Vehicle # 72 Vehicle # 4 Fuel Consumption [km/l] Figure 4.8: These plots exhibit the mean curve(red), the corresponding original observations(green) and the reconstructed curve(blue). Vehicle 14 and 106 have high values on all major PC scores, under opposite prefixes. Number 92 has the lowest PC scores overall; Trucks 72 and 4 have average PC scores. High PC scores lead to extreme values, especially on the strong first PC. From the estimated PCA scores, the mean function µ and the principal component functions, the individual traces of the trucks can be reconstructed, which should give a rough estimate on the behavior of the truck. A number of selected reconstructions can be viewed in Figure 4.8 and a collection of all traces and the original measurements can be seen in Figure 4.9. As a next step, for an analysis of the results, the goodness-of-fit of the original measurements versus the reconstructed traces is assessed. To estimate the goodness-of-fit, the mean squared error [1] between the discrete observation and the estimated reconstruction is considered. However, the irregular measurement intervals are making assessment of the results difficult. In Figure 4.10 some examples of bad fits are explained. Just taking the mean of the mean square error (MSE) of all observations of one truck is prone to skewing, as well as just summing up the MSE for each single truck. A more sensible approach to

43 4. Results.2 Fuel Mileage [km/l] Distance Driven [km] Figure 4.9: This graph shows all reconstructed traces (gray) and original measurements (blue). Note how the traces tend to follow the observations, especially when the relative occurrence of observations is low..2 Vehicle # 7 Vehicle # 102 Vehicle # 106 Vehicle # 202 Fuel Consumption [km/l] Figure 4.10: As described in the text, these figures depict misfitted trucks. Vehicle #7 and #106 show trucks which provide bad fits, whereas #102 is a truck which is only identifiable as misfit when median mean square error (MSE) is applied. Truck #202 is a counter-example, where the misfit is more noticeable when the mean MSE is used.

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 seema@iasri.res.in Genomics A genome is an organism s

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

More information

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential

More information

Data Exploration Data Visualization

Data Exploration Data Visualization Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen

CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 3: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major

More information

Exploratory Data Analysis

Exploratory Data Analysis Exploratory Data Analysis Johannes Schauer johannes.schauer@tugraz.at Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Time series clustering and the analysis of film style

Time series clustering and the analysis of film style Time series clustering and the analysis of film style Nick Redfern Introduction Time series clustering provides a simple solution to the problem of searching a database containing time series data such

More information

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics (or inductive statistics),

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

Diagrams and Graphs of Statistical Data

Diagrams and Graphs of Statistical Data Diagrams and Graphs of Statistical Data One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in

More information

Exercise 1.12 (Pg. 22-23)

Exercise 1.12 (Pg. 22-23) Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

Hierarchical Cluster Analysis Some Basics and Algorithms

Hierarchical Cluster Analysis Some Basics and Algorithms Hierarchical Cluster Analysis Some Basics and Algorithms Nethra Sambamoorthi CRMportals Inc., 11 Bartram Road, Englishtown, NJ 07726 (NOTE: Please use always the latest copy of the document. Click on this

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals

More information

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler Machine Learning and Data Mining Regression Problem (adapted from) Prof. Alexander Ihler Overview Regression Problem Definition and define parameters ϴ. Prediction using ϴ as parameters Measure the error

More information

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap Statistical Learning: Chapter 5 Resampling methods (Cross-validation and bootstrap) (Note: prior to these notes, we'll discuss a modification of an earlier train/test experiment from Ch 2) We discuss 2

More information

A Short Tour of the Predictive Modeling Process

A Short Tour of the Predictive Modeling Process Chapter 2 A Short Tour of the Predictive Modeling Process Before diving in to the formal components of model building, we present a simple example that illustrates the broad concepts of model building.

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Tutorial for proteome data analysis using the Perseus software platform

Tutorial for proteome data analysis using the Perseus software platform Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information

More information

Local outlier detection in data forensics: data mining approach to flag unusual schools

Local outlier detection in data forensics: data mining approach to flag unusual schools Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential

More information

Multivariate Normal Distribution

Multivariate Normal Distribution Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory data analysis (Chapter 2) Fall 2011 Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,

More information

EXPLORING SPATIAL PATTERNS IN YOUR DATA

EXPLORING SPATIAL PATTERNS IN YOUR DATA EXPLORING SPATIAL PATTERNS IN YOUR DATA OBJECTIVES Learn how to examine your data using the Geostatistical Analysis tools in ArcMap. Learn how to use descriptive statistics in ArcMap and Geoda to analyze

More information

Common Tools for Displaying and Communicating Data for Process Improvement

Common Tools for Displaying and Communicating Data for Process Improvement Common Tools for Displaying and Communicating Data for Process Improvement Packet includes: Tool Use Page # Box and Whisker Plot Check Sheet Control Chart Histogram Pareto Diagram Run Chart Scatter Plot

More information

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] Stephan Spiegel and Sahin Albayrak DAI-Lab, Technische Universität Berlin, Ernst-Reuter-Platz 7,

More information

How To Check For Differences In The One Way Anova

How To Check For Differences In The One Way Anova MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis ERS70D George Fernandez INTRODUCTION Analysis of multivariate data plays a key role in data analysis. Multivariate data consists of many different attributes or variables recorded

More information

FUNCTIONAL EXPLORATORY DATA ANALYSIS OF UNEMPLOYMENT RATE FOR VARIOUS COUNTRIES

FUNCTIONAL EXPLORATORY DATA ANALYSIS OF UNEMPLOYMENT RATE FOR VARIOUS COUNTRIES QUANTITATIVE METHODS IN ECONOMICS Vol. XIV, No. 1, 2013, pp. 180 189 FUNCTIONAL EXPLORATORY DATA ANALYSIS OF UNEMPLOYMENT RATE FOR VARIOUS COUNTRIES Stanisław Jaworski Department of Econometrics and Statistics

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

Appendix 1: Time series analysis of peak-rate years and synchrony testing.

Appendix 1: Time series analysis of peak-rate years and synchrony testing. Appendix 1: Time series analysis of peak-rate years and synchrony testing. Overview The raw data are accessible at Figshare ( Time series of global resources, DOI 10.6084/m9.figshare.929619), sources are

More information

Volvo Parts Corporation 04-02-03 Volvo Trip Manager. Contents

Volvo Parts Corporation 04-02-03 Volvo Trip Manager. Contents User s Manual Contents Functional Overview...4 Welcome to Volvo Trip Manager...4 About the documentation...4 Information Downloading...5 Vehicle Groups...6 Warning Messages...6 Reports...7 Trend Report...7

More information

Module 4: Data Exploration

Module 4: Data Exploration Module 4: Data Exploration Now that you have your data downloaded from the Streams Project database, the detective work can begin! Before computing any advanced statistics, we will first use descriptive

More information

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

Demographics of Atlanta, Georgia:

Demographics of Atlanta, Georgia: Demographics of Atlanta, Georgia: A Visual Analysis of the 2000 and 2010 Census Data 36-315 Final Project Rachel Cohen, Kathryn McKeough, Minnar Xie & David Zimmerman Ethnicities of Atlanta Figure 1: From

More information

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

Geostatistics Exploratory Analysis

Geostatistics Exploratory Analysis Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa Master of Science in Geospatial Technologies Geostatistics Exploratory Analysis Carlos Alberto Felgueiras cfelgueiras@isegi.unl.pt

More information

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S. AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

More information

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE STATISTICS The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE VS. INFERENTIAL STATISTICS Descriptive To organize,

More information

Interpreting Data in Normal Distributions

Interpreting Data in Normal Distributions Interpreting Data in Normal Distributions This curve is kind of a big deal. It shows the distribution of a set of test scores, the results of rolling a die a million times, the heights of people on Earth,

More information

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS Mathematics Revision Guides Histograms, Cumulative Frequency and Box Plots Page 1 of 25 M.K. HOME TUITION Mathematics Revision Guides Level: GCSE Higher Tier HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

More information

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI STATS8: Introduction to Biostatistics Data Exploration Babak Shahbaba Department of Statistics, UCI Introduction After clearly defining the scientific problem, selecting a set of representative members

More information

Interactive Logging with FlukeView Forms

Interactive Logging with FlukeView Forms FlukeView Forms Technical Note Fluke developed an Event Logging function allowing the Fluke 89-IV and the Fluke 189 models to profile the behavior of a signal over time without requiring a great deal of

More information

Fitting Subject-specific Curves to Grouped Longitudinal Data

Fitting Subject-specific Curves to Grouped Longitudinal Data Fitting Subject-specific Curves to Grouped Longitudinal Data Djeundje, Viani Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh, EH14 4AS, UK E-mail: vad5@hw.ac.uk Currie,

More information

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Getting to know the data An important first step before performing any kind of statistical analysis is to familiarize

More information

Dimensionality Reduction: Principal Components Analysis

Dimensionality Reduction: Principal Components Analysis Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely

More information

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

More information

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

More information

A Demonstration of Hierarchical Clustering

A Demonstration of Hierarchical Clustering Recitation Supplement: Hierarchical Clustering and Principal Component Analysis in SAS November 18, 2002 The Methods In addition to K-means clustering, SAS provides several other types of unsupervised

More information

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Automatic Photo Quality Assessment Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Estimating i the photorealism of images: Distinguishing i i paintings from photographs h Florin

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

More information

Normality Testing in Excel

Normality Testing in Excel Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics. Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing

More information

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Structure As a starting point it is useful to consider a basic questionnaire as containing three main sections:

More information

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) Data Exploration and Preprocessing Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Server Load Prediction

Server Load Prediction Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that

More information

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3 COMP 5318 Data Exploration and Analysis Chapter 3 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping

More information

Going Big in Data Dimensionality:

Going Big in Data Dimensionality: LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information

Standardization and Its Effects on K-Means Clustering Algorithm

Standardization and Its Effects on K-Means Clustering Algorithm Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

More information

the points are called control points approximating curve

the points are called control points approximating curve Chapter 4 Spline Curves A spline curve is a mathematical representation for which it is easy to build an interface that will allow a user to design and control the shape of complex curves and surfaces.

More information

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This

More information

Overview... 2. Accounting for Business (MCD1010)... 3. Introductory Mathematics for Business (MCD1550)... 4. Introductory Economics (MCD1690)...

Overview... 2. Accounting for Business (MCD1010)... 3. Introductory Mathematics for Business (MCD1550)... 4. Introductory Economics (MCD1690)... Unit Guide Diploma of Business Contents Overview... 2 Accounting for Business (MCD1010)... 3 Introductory Mathematics for Business (MCD1550)... 4 Introductory Economics (MCD1690)... 5 Introduction to Management

More information

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows: Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are

More information

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques

More information

VISUALIZING HIERARCHICAL DATA. Graham Wills SPSS Inc., http://willsfamily.org/gwills

VISUALIZING HIERARCHICAL DATA. Graham Wills SPSS Inc., http://willsfamily.org/gwills VISUALIZING HIERARCHICAL DATA Graham Wills SPSS Inc., http://willsfamily.org/gwills SYNONYMS Hierarchical Graph Layout, Visualizing Trees, Tree Drawing, Information Visualization on Hierarchies; Hierarchical

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

Exploratory data analysis for microarray data

Exploratory data analysis for microarray data Eploratory data analysis for microarray data Anja von Heydebreck Ma Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany heydebre@molgen.mpg.de Visualization

More information

Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

More information

Data Preparation and Statistical Displays

Data Preparation and Statistical Displays Reservoir Modeling with GSLIB Data Preparation and Statistical Displays Data Cleaning / Quality Control Statistics as Parameters for Random Function Models Univariate Statistics Histograms and Probability

More information

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Data is Important because it: Helps in Corporate Aims Basis of Business Decisions Engineering Decisions Energy

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

2. Simple Linear Regression

2. Simple Linear Regression Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Data Mining and Visualization

Data Mining and Visualization Data Mining and Visualization Jeremy Walton NAG Ltd, Oxford Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research

More information

Descriptive Statistics

Descriptive Statistics Y520 Robert S Michael Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values. Using the textbook readings and other resources listed on the web

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

College Readiness LINKING STUDY

College Readiness LINKING STUDY College Readiness LINKING STUDY A Study of the Alignment of the RIT Scales of NWEA s MAP Assessments with the College Readiness Benchmarks of EXPLORE, PLAN, and ACT December 2011 (updated January 17, 2012)

More information

2002 IEEE. Reprinted with permission.

2002 IEEE. Reprinted with permission. Laiho J., Kylväjä M. and Höglund A., 2002, Utilization of Advanced Analysis Methods in UMTS Networks, Proceedings of the 55th IEEE Vehicular Technology Conference ( Spring), vol. 2, pp. 726-730. 2002 IEEE.

More information

Manifold Learning Examples PCA, LLE and ISOMAP

Manifold Learning Examples PCA, LLE and ISOMAP Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition

More information

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel

More information

Distances, Clustering, and Classification. Heatmaps

Distances, Clustering, and Classification. Heatmaps Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be

More information