Functional Analysis of Real World Truck Fuel Consumption Data

Transcription

1 Technical Report, IDE0806, January 2008 Functional Analysis of Real World Truck Fuel Consumption Data Master s Thesis in Computer Systems Engineering Georg Vogetseder School of Information Science, Computer and Electrical Engineering Halmstad University

2

3 Functional Analysis of Real World Truck Fuel Consumption Data School of Information Science, Computer and Electrical Engineering Halmstad University Box 82, S Halmstad, Sweden January 2008

4 Acknowledgement If it looks like a duck, and quacks like a duck, we have at least to consider the possibility that we have a small aquatic bird of the family anatidae on our hands. Douglas Adams ( ) Thanks to my family, especially my mother Eva and friends. ii

5 Abstract This thesis covers the analysis of sparse and irregular fuel consumption data of long distance haulage articulate trucks. It is shown that this kind of data is hard to analyse with multivariate as well as with functional methods. To be able to analyse the data, Principal Components Analysis through Conditional Expectation (PACE) is used, which enables the use of observations from many trucks to compensate for the sparsity of observations in order to get continuous results. The principal component scores generated by PACE, can then be used to get rough estimates of the trajectories for single trucks as well as to detect outliers. The data centric approach of PACE is very useful to enable functional analysis of sparse and irregular data. Functional analysis is desirable for this data to sidestep feature extraction and enabling a more natural view on the data. iii

6 Contents Acknowledgement Abstract List of Figures List of Tables ii iii vi viii 1 Introduction Background Motivation and Novelty Related Work Limitations Outline Methods General Statistical Methods PCA Hierarchical Clustering Validation Methods Diagrams Functional Data Analysis Principal Components Analysis through Conditional Expectation The Vehicle Application and Data Description 1.1 Data Impurities in the Truck Data Data structure Approach iv

7 4 Results Basic Data Analysis Data Binning Feature Extraction Function Fitting Application of PACE Baseline PACE Results Number of Principal Components Error Assumptions in PACE Different Kernel Functions Variances Model Variance Data Variance Prediction Outlier Detection Expansion Discussion 50 6 Conclusion 51 Bibliography 5 List of Abbreviations 55 v

8 List of Figures.1 Fuel Consumption between Observations Fuel consumption plot generated from the raw data Histograms of the original and the cleaned data Fuel consumption plot generated from the clean data Scatter plot and histograms Histogram of the distance between observations Distribution and mean/variance of binned data Boxplots of binned data Outlier detection based on feature extraction Straight line fitting Plot of mean function and principal components Scree Plot Smoothed covariance matrix Reconstructed curves versus mean function and raw observations of selected trucks Reconstructed curves and raw measurements for all trucks Reconstructed traces of misfitted trucks Comparison of reconstructed trajectories with differing number of PCs Reconstructed trajectories without measurement error assumed A comparison of µ with different smoothing kernels A comparison of PCs with different smoothing kernels Distribution of all mean curves Graph of all mean curves Trucks with a high influence on the results of PACE Data variance vi

9 4.19 Normal Distribution Plots of the PC scores Histograms of the probability of trucks Samples of truck probability PACE Results of Speed Data PACE Results on Seasonal Fuel Consumption Selected trucks from the Seasonal Fuel Consumption Data vii

10 List of Tables 4.1 MSE of PACE with 8 principal components MSE of PACE with PCs MSE of PACE with 4 PCs MSE of PACE with 29 PCs MSE of PACE with 8 PCs and error cut-off viii

11 1 Introduction 1.1 Background The original idea for analyzing this data came from Volvo Parts AB, one of the main business units of Volvo Group AB. The role of Volvo Parts is to provide solutions and tools to the after-market, which includes vehicle electronics diagnostic tools. When a truck is in the workshop, the vehicle electronics data is read out from the truck using diagnostics tools from Volvo Parts and transmitted to a central database. This data, which is collected from the vehicles electronics systems is called logged vehicle data (LVD) and is collected from sensors within the truck. Several electronic subsystems supply information for LVD, which can include data from the electronic suspension, the transmission, and most importantly from the Engine Electric Control Unit. The current main use of LVD is seemingly just basic analysis, e.g. remote diagnostics of faulty components and simple statistics. One of the problems with analysing LVD is the relative lack of observations. The source of this lack of information is the data retrieval process. The procedure is a time consuming process, making it a cost factor for the workshops. The time consumption affects the adoption rate of this procedure in the field negatively, which leads to the data composition detailed in Section.1. The basic idea behind the problems detailed in this thesis is to expand the usefulness of the data for Volvo Parts, retrieving additional new information from it and provide means to access this information. This is done by using recent advanced statistical 1

12 1. Introduction 2 techniques. As a starting point to the application of these techniques, the analysis of the fuel consumption data contained in LVD was suggested. Fuel consumption data is very interesting from a statistical point of view. This interest stems from being a major cost factor, as well as being influenced by a high number of other factors, such as: Usage patterns of the operator, i.e. the driving style and habits Maintenance of the truck Gross Combination Weight usage, i.e. the cargo of the truck Environment, i.e. hilliness, road condition, etc. The influence of these and more factors make this data a good indicator. But the mass of influences also makes exact determination of the underlying cause impossible. Additionally, some of these influences might cancel each other out, thus removing information. If it is possible to extract information from fuel consumption data, then it should work for the rest of the data too. 1.2 Motivation and Novelty From LVD, it should be possible to extract information on hidden trends, i.e. the principal components (see Section 2.1.1) that are common to all similar trucks. Based on these components, it should be possible to determine if a truck is unrelated to other trucks, i.e. a outlier and to predict future developments in fuel consumption, when the trucks behavior is similar to that of other vehicles. It is very easy to take the last observation of each truck in a group of similar trucks to determine abnormal fuel consumption, but it is hardly possible to calculate underlying trends or other information from these facts. To discover information like trends or outliers from LVD, the data of a truck has to include not only the last observation available, but also past ones. These requirements, multiple observations of a truck and a set of similar trucks lead to the irregular and sparse structure of the data used in this thesis. The data is described in more detail in Section.1.

13 1. Introduction The analysis of this data can be done in at least two ways. The most obvious choice in methodology would be the use of multivariate statistics, but for several reasons detailed below, the central methodology for this thesis is functional statistics. Functional statistics focuses on analysing the data as functions, rather than a set of discrete values 1. Multivariate statistics are a set of methods which work on more than one variable at a time. Some examples for these methods are regression analysis, principal components analysis and artificial neural networks. Principally, functional statistics are also part of this set, as both have multiple variables as input. However, the focus on handling the input variables as continuous functions rather than arbitrary variables separates those two fields. As the observation of trucks in the workshop is not happening regularly, i.e. the observations can not be fitted to a grid, it is difficult to incorporate all information from the input into variables for use in multivariate statistics. Therefore, features like mean, variance, duration of all observations, date of first observation, odometer count at the last observation, etc. have to be extracted from the data to be able to do analysis. Inevitably, the extraction of this knowledge leads to information loss, which is problematic on this already sparse data. The process of discovery and selection of important features for multivariate analysis is very difficult and time consuming. It is crucial to extract and select the best and most important features from the data to minimize the data loss and maximize the information content of the features for the success of all further steps in analysis. Feature extraction creates an additional layer of data processing and introduces a large number of tunable knobs. Functional Data Analysis (FDA) on the other hand, preserves the information in the data present and does not need feature extraction at all. Furthermore, it facilitates a more natural handling of the data, describing not only more or less abstract features of the data, but a function which resembles the data. The choice of using functional over multivariate data analysis is also motivated by the ability to analyze the functional properties of the data, e.g. derivatives of the data. Additionally, FDA does not introduce a high number of additional parameters, unlike multivariate analysis. 1 A more detailed description on this collection of methods can be found in Section 2.2.

14 1. Introduction 4 However, multivariate analysis has an advantage over FDA when a high number of different functions have to be analysed at the same time. FDA has problems in visualizing this higher dimensional data, as well as the necessity of having a high amount of data for each dimension (curse of dimensionality). The most important step in FDA is the transformation of the discrete data to a functional basis. Again, the irregular and sparse nature of the data makes this transformation difficult. For being able to perform FDA on this data, a method called Principal Components Analysis through Conditional Expectation (PACE) is applied. The foundation of PACE is the assumption that a smooth function is underlying the sparse data. Under this assumption, it is possible to use even irregular data for the discovery of principal components. The main novel aspect of this thesis is the application of FDA and PACE to automotive data. Previously it has successfully been applied to biological data, economic processes, bidding in online auction houses, but not automotive data. PACE itself is highly interesting to be applied to the data at hand, because it is able to work on it without the need for feature extraction or regular observations. The methods used in this work can be used to describe the actual fuel consumption of the observed trucks in customer hands. This means the methods applied to LVD are driven by data and not by a model. 1. Related Work General sources of information on data analysis related to this work are The Elements of Statistical Learning [1], Functional Data Analysis [2] and Nonparametric Functional Data Analysis []. The single most important paper related to this work is Functional Data Analysis for Sparse Longitudinal Data [4], which proposed the method PACE and applied it to yeast cell cycle gene expression data and to longitudinal CD4 cell percentages. The percentage is used as a maker for the progress of AIDS in adults.

15 1. Introduction 5 Functional Data Analysis for Sparse Auction Data [5] combines the PACE approach with linear regression to predict closing prices of online auctions. The most related of the few public papers on fuel consumption in heavy trucks is Heavy Truck Modeling for Fuel Consumption Simulations and Measurements [6]. This work deals with building a simulation model of fuel consumption. Another paper, which discusses methods to reduce idle fuel consumption in North American long distance trucks and highlights typical driver behavior is Analysis of Technology Options to Reduce the Fuel Consumption of Idling Trucks[7] Additional information on doing PCA on sparse and irregular data can be found in Principal component models for sparse functional data[8] and Sparse Principal Component Analysis[9]. More related to PACE is Properties of principal component methods for functional and longitudinal data analysis[10]. Another paper which is related to the estimation of Functional Principal Component Scores is [11]. Knowledge relating to linear regression analysis for longitudinal data can be found in [12]. 1.4 Limitations The scope of this thesis is to research the possibilities for the application of FDA methods to the sparse and irregular automotive data from LVD. It is outside of the scope of this thesis to establish a conclusive theory about a true long term fuel consumption model of all truck engines. The conclusive, globally valid model is impossible because of a relatively low number of individuals in the data, as well as a limited observation duration and possible differences in usage patterns of the trucks, i.e. vehicles with a high mileage in a limited time span do not necessarily exhibit a similar fuel consumption to low mileage trucks in the same time span.

16 1. Introduction Outline The next chapter Methods describes crucial used methods. This includes underlying basic methods as well as the foundations of FDA and PACE. The chapter Application provides a description of the data used in this thesis and includes information on the interplay of the proposed methods and the data. Chapter 4 provides comprehensive information on the results. The last two chapters, Conclusion and Discussion, wrap up the results from this thesis and provide an outlook on possible continuations of the research.

17 2 Methods This chapter is divided into three parts. General Statistical Methods describes nonfunctional methods which are fundamental to this work. Functional Data Analysis provides an introduction into this field. The final part, Principal Components Analysis through Conditional Expectation gives an overview of this crucial method. 2.1 General Statistical Methods This section introduces general statistical concepts used in this thesis and a number of tools to visualize data and test results Principal Component Analysis One of the constitutional methods for analysing LVD is the Karhunen-Loève transformation, universally known as Principal Component Analysis (PCA). PCA is also the foundation to Functional Principal Component Analysis (FPCA)[1, 1]. Basically, PCA is a method to explore data by finding the most important ways the variables in the data differ from another. It can compress the data by discovering a low number of linear combinations of input variables which contribute most to the variability of the input. These linear combinations are found by constructing a linear basis for the data where the retained variability is maximal. 7

18 2. Methods 8 Mathematically speaking, the goal is to reduce or compress high dimensional data X to lower dimensional data Y. To do this reduction, a number of algorithms are available, here, a method involving the calculation of the covariance is described. The first step is to calculate the mean vector µ for each variable: µ i = 1 K i x ij, i = 1... N K i j=1 where N denotes the number of variables and K i the number of observations in one variable. Subsequently, µ is removed from every observation in X, which is subsequently denoted as X X. In the next step the covariance matrix cov(x X) has to be calculated. Covariance is a measure how two variables vary together. If those two variable vary in the same way (i.e. same prefix), the covariance will be positive. If, on the other hand, the two variables have different prefixes, the covariance will be negative. A covariance matrix is the result of calculating the covariance for all members of two vectors. The resulting matrix gives the grade of correlation between the input vectors. To find a mapping M that is able to transform the high dimensional data into low dimensional data, M that maximizes M T cov(x X)M has to be found. It can be shown that the best (variance maximizing) mapping is formed by the eigenvectors of the covariance matrix. Hence, PCA has to solve the eigenproblem to get the transformation matrix. cov(x X)M = λm The eigenproblem has to be solved d times with different principal eigenvalues λ to get the principal eigenvectors (or principal components). The low dimensional representation Y can then be computed by simple multiplication: Y = (X X)M

19 2. Methods Hierarchical Clustering Hierarchical clustering is a relatively simple method [1] to segment data into related groups. Clustering is used within this thesis for testing if differing clusters of trucks can be found from extracted features. Hierarchical clustering needs a dissimilarity measure between the elements. The standard for measuring the dissimilarity is the euclidean distance, which is also used in this thesis. When the distance between all possible pairs of elements is calculated, the clusters can be built. For building these clusters, there are two different approaches: The agglomerative approach, which starts with as many clusters as there are individuals. The divisive method starts with one big cluster which is then split into smaller clusters. Agglomerative methods are guaranteed to have a monotonic increasing level of dissimilarity between merged clusters, growing with the level of merging. This property is not guaranteed to divisive approaches. The second choice for building the clusters is to decide on the measurement for the distance between two clusters. Single Linkage The link between the clusters is defined by the smallest distance between elements in the two clusters. Complete Linkage The link is defined by the largest distance between elements in the two clusters, the opposite of the first method. Average Linkage Uses the average distance between all pairs of elements in both clusters Validation Methods A number of methods to validate the results and to estimate variation were used in the scope of this thesis. These include brief usage of bootstrap, jackknife and various cross validation methods, such as k-fold and leave-one-out[1]. Bootstrapping is the process of randomly picking a samples from given observations where a single observation can be chosen multiple times. The goal of a bootstrap is to approximate the distribution from these samples.

20 2. Methods 10 Jackknifing can be used to estimate the bias and standard error. Jackknife is very similar to k-fold and leave-one-out cross validation, as it systematically removes one or more observations from a sample and then recalculates the results as often as there are possible readouts Diagrams A number of special diagrams were used to illustrate some results of this thesis. Those diagrams are dendrograms, boxplots and scree plots [1, 2]. Dendrograms are tree diagrams which are used to illustrate the result of a clustering algorithm. An example for such a diagram is Figure 4.. On the vertical axis the distance between clusters is plotted. A horizontal line denotes a split between classes at this specific distance measure. This implies that a split at a higher distance value has a higher dissimilarity between the split classes, as opposed to a lower distance value split. Boxplots describe groups of data such as binned data through five statistical properties. A boxplot example can be seen in Figure 4.2. The box represents the lower and the upper quartile, showing where half of the data is contained. The line in this box illustrates the median of data in this group. The whiskers attached to this box extend to the furthest data point, up to a maximum of 1.5 the distance between the quartiles. Data points outside of this boundary are usually marked with a cross, indicating a possible outlier. Scree plots give an indication of the relevance of a principal component (eigenfunction) by indicating the accumulated eigenvalue up to the n-th principal component. This plot can be used to select a suitable number of eigenfunctions. An example for a scree plot is Figure Functional Data Analysis Functional data analysis (FDA) [2, ] is a collection of methods which enable the investigation of data in a functional form. Functional data is the idea of looking at a

21 2. Methods 11 set of observations not as a vector in discrete time, but as a continuous function. The analysis of functions rather than discrete samples inherits advantages over multivariate analysis. An advantage of this property is that the rate of change or derivatives of these functions can easily be calculated and analysed. FDA also includes variants of multivariate methods like PCA. Functional PCA, like normal PCA, not only provides a method for dimensionality reduction, but also characterizes the main modes of variation from a mean function. To perform FDA on discretely sampled data, the data has to be converted to a continuous, functional format. This means a function has to be fitted to the sampled data points. It is not feasible to convert every dataset to a functional form. Especially in the case of sparse and irregular observations, this task is very difficult, but central to the success of functional data analysis. Usually, the methods used to convert data into a functional format are interpolation and smoothing, or more generally function fitting. A very simple method to do this conversion would be a least squares fit of a first order polynomial (a straight line). Usually, a more flexible method is used for this step, namely spline interpolation. Depending on the underlying data, other fits like Fourier functions are possible. FDA is easily applicable if the measurements were done with a regular spacing, and the data is complete over the observation duration. In the opposite case, it is very difficult to estimate the complete trajectory, when only a single subject is taken into calculation. 2. Principal Components Analysis through Conditional Expectation Principal Components Analysis through Conditional Expectation (PACE) is a derivative of functional principal components analysis for sparse longitudinal data, proposed in the paper Functional Data Analysis for Sparse Longitudinal Data by Yao, Müller and Wang[4].

22 2. Methods 12 PACE is an algorithm for extracting the principal components from irregular and sparse data. It also provides an estimation of individual smooth trajectories of the data. PACE assumes that the data is randomly located with a random number of observations per subject. Furthermore it assumes that data is determined by a underlying smooth trajectory. The first step in PACE is the estimation of the smooth mean function µ, by using a local linear line smoother on all measurements combined into one pool of data. The choice of the smoothing parameter, or bandwidth is done automatically[14] or by hand in this step. The covariance surface can then be calculated like a regular covariance matrix. This raw covariance surface is stripped of the variance (the first diagonal). This raw matrix is then smoothed utilizing a local linear surface smoother. The bandwidth is chosen by leave-one-curve-out cross-validation. The smoothing step is necessary to fill in for missing observations. The estimation of these two model components share the same smoothing kernel. The choice of a smoothing kernel is discussed in Chapter 4. From these model components, it is possible to calculate the estimates of the eigenvalues and eigenfunctions, i.e. the functional principal components of sparse and irregular data. The last step is the calculation of the functional principal component scores. Those scores describe how much of a principal component is retained in a single subject. However, the conventional method of using numerical integration to recover the Principal Component (PC) scores leads to biased results; because of sparse and irregular data. In this step, the conditional expectation comes into play. It provides the best prediction of the PC scores if the measurement error is Gaussian, or the best linear prediction otherwise. PACE is discussed in detail by Yao, Müller and Wang [4].

23 The Vehicle Application and Data Description The purpose of this chapter is to outline the connection between the methods proposed in Chapter 2 and the application of those methods on the Volvo data..1 Volvo Truck Data The original data received from Volvo Parts AB consists of 2027 observations of 267 trucks. It was collected between June 2004 and May 2007 in North America. All trucks have the same engine and are configured as articulate truck for long distance transports on smooth roads. The gross combination weight (GCW), which includes the weight of the towed trailer and the truck itself is 6 tons, the US federal GCW limit. Data is retrieved when a truck is in a workshop that is equipped to read out the onboard electronics and performs this procedure. It is then sent to the Volvo Headquarter in Gothenburg for storage and analysis. The data from each observation contains only informations from one of the trucks onboard electronic systems, the Engine Control Unit (ECU). From these data, two variables are mainly relevant for this thesis: Total distance driven Total amount of fuel consumed 1

24 . The Vehicle Application and Data Description Incremental Fuel Mileage [km/l] Distance Driven [km] Figure.1: This figure shows the distribution of the fuel consumption, when the fuel mileage is calculated only between two observations. The outliers visible in this figure can be explained by a high amount of idling between two close observations. When the fuel mileage is calculated accumulative, those outliers do not occur. These variables are not reset when the ECU was read out in the workshop and therefore behave accumulative. Using these variables as a basis to calculate the fuel consumption per distance or time has an averaging effect on itself as it includes all former mileage data. This is necessary because of the unevenly distributed data. If a truck was read out twice within a very short span of time, the fuel consumption in this interval is possibly vastly different from the normal fuel consumption behavior of the truck, possibly because the truck was not moved very far withhin this time span, but idling for some time. The outliers caused by this effect can be seen in Figure.1. These outliers are the reason for not using the difference in fuel amounts between two observations as a calculation basis in this thesis. The accumulative approach allows those outliers to remain in the dataset..1.1 Impurities in the Truck Data The raw data retrieved from the trucks contains irregular observations or changes in the truck data which result in some cases in a removal of specific observations or the whole truck from the data set. See Figure.2 for a plot of the raw fuel consumption

25 . The Vehicle Application and Data Description Fuel Mileage [km/l] Distance Driven [km] Figure.2: Fuel consumption plot generated from the raw data. The lines are linear interpolations between the observations. data. Incomplete Observations A truck is missing one of more variables that would be required for analysis. The observations from this individual can not be used for the calculations. Physically impossible changes in accumulative variables Between two observations of a single truck, accumulative variables changed to a smaller value. This means that a later observation in time has a smaller number of total driving distance than an earlier measurement for example. This is physically impossible, but observable if the ECU has been replaced or the contents of the ECU were erased during a software update. This criteria applies to 44 trucks. Although it is possible to use a subset of the observations from each of these trucks. This was not done, because the quality of the measurement might have been compromised and the manual effort of cleaning the data is a time consuming task for very few usable measurements. Empty and Duplicated Observations Some observations do not contain any new information, but only seem to be resubmits of earlier or empty observations with a different time stamp. These particular observations are removed from the

26 . The Vehicle Application and Data Description 16 final data, but the remaining observations of this truck are used. Phenomena like these might occur, when the data aquisition process in the workshop was interrupted, or a transmission error occurred. Early Observations These observations are too early in the life of the truck to give a meaningful information. The removal of these observations is motivated by the unusual fuel consumption of a truck in this state. The unusual fuel consumption is caused by a high number of short trips the truck has to travel before it can be put into regular service. Examples are drives to paint shops or truck customizers as well as transfers to the customer. The number of observations purged when this criteria is set to remove all measurements below km is 150, when all measurements before 1000km are deleted, the number of observations drops by 100. See Figure.. From the 269 initial individual trucks, 56 trucks are removed. In terms of observations, from originally observations 120 remained in the data set, when the lower border for observations is set to 1000km. See Figure.4 for a plot of the cleaned fuel consumption data. The most visible change to Figure.2 is the lower number of outliers at roughly 0 kilometers, which is mostly an effect of the removal of very early observations..1.2 Data structure Some properties of the data make the task of analyzing inherently difficult. Most of these properties stem from the sparsity of the data. Sparseness in this case means that every truck has been observed on average just times with a standard deviation of observations. The sparseness of the data is visualized in Figure.5. The data is not fully observed. The observations of a single truck often are not scattered over a very long distance in time or driven distance, but measured only within a short span. The average duration between the first observation of a truck and the last one, where measurements are taken, is kilometers with a standard deviation of kilometers. The mean focus of the observations 1 Excluding incomplete observations, as they are not usable at all.

27 . The Vehicle Application and Data Description Raw Data Cleaned Data 200 Number of Observations Distance Driven [km] Figure.: This comparison shows the number of observations on the raw data versus the cleaned data. The overall reduction in the number of observations as well as the lower amount of observations at the beginning is noticeable. 4.5 Fuel Mileage [km/l] Driven Distance [km] Figure.4: Fuel consumption plot generated from the clean data. Note the lack of outliers at the beginning of the data.

28 . The Vehicle Application and Data Description Fuel Mileage [km/l] Driven Distance [km] Figure.5: The scatter plot in this figure highlights the sparse and irregular distribution of the data. The histograms describe the distribution of the observations along the axes. is at 022 kilometers deviating by 1609 kilometers, which means that most of the trucks are not observed from the beginning, but observed later on in their life-cycle. The density of measurements varies. This implies that the placement of measurements is irregular throughout the duration of their observation. As the trucks are independent of each other, the times when observations happen are not correlated with each other. For a visual representation of the irregular duration between the measurements, see Figure.6. This figure indicates a non-normal distribution. The average distance between observations is kilometers with a standard deviation of kilometers. Unsupported curvature. The irregular placement and the sparsity of variables causes this property to occur. If a part of a curve has a high curvature, which can be approximated by d2 y dx 2 or ( d2 y dx 2 ) 2. When this is the case, the relative resolution of the data at the point of the high curvature should also be high to enable a good estimation of the underlying function [2].

29 . The Vehicle Application and Data Description Number of Observations Distance Driven between Observations [km] Figure.6: This figure shows the distribution of distances between two observations of the same truck..2 Approach The first part in analyzing truck data, which is described in section 4.1, is to establish results with basic multivariate analysis as a basis where the results of functional analysis can be compared to. This part shows pitfalls and difficulties when applying standard multivariate methods to the data. The first possible way for multivariate analysis is feature extraction. It is a difficult task to find relevant features to extract. A simple statistical feature will be extracted from the data to be able to give an idea how feature extraction works. The second possibility for multivariate analysis is to put the observations into bins. This is done in order to be able to align the data onto a vertical grid. The second way is necessary, because it is very hard to visualize and convert to the original data format from the extracted features. However, binning cannot easily be used for outlier detection. Usually, some of the bins are likely to have only a low number of observations which makes outlier determination in this bin very difficult. If the bins are made larger, multiple or even all observations of a single truck might be put into a single bin. This leads to increased difficulty in differentiating between normal and outlying observations.

30 . The Vehicle Application and Data Description 20 These steps should lead to two results: A simple outlier detection, based on a clustering of the extracted features and a variance and mean estimation for the data, based on the binned data. The task of estimating fuel consumption behavior for a single truck, outside of its observation duration using the extracted features is very hard. This is because the mapping between the values of the features and a function is not available. Additionally, information from other, similar trucks is not taken into consideration. The last step in Basic Analysis (Sect. 4.1) is a demonstration of the main problem of applying FDA on the data at hand: The difficulty of fitting a function to a single truck. The main task of this thesis is to apply the PACE algorithm to the data (Sect. 4.2), and to try out the various options within the PACE algorithm. In this section, the results of PACE in general will be assessed, the difference between PACE with different options in regard of the PACE generated functions as well as general statistical properties, such as the mean function. The first advantage in using the PACE algorithm in comparison to the basic methods is the lack of need to pre-process data, i.e. to extract features or otherwise process the data. This non-parametric input of the data is complimented by a number of options to tune the algorithm itself for various needs (amount of information retained, if the input data has measurement errors, etc.). The next step is to try out a number of methods which can be applied to the results of PACE. For example to calculate the possibility of the fuel consumption of a particular truck, given all the other trucks. PACE enables the user to analyse the sparse and irregular data at hand, enabling the use of additional techniques from FDA, whereas using only multivariate data analysis or normal FDA on the same data is very difficult to do and does not incorporate the information gathered from the other trucks. PACE makes outlier detection, estimation of the function outside the observation duration and the gathering of common statistical properties, like mean and variance in functional form, from sparse and irregular data a lot easier or even possible.

31 4 Results 4.1 Basic Data Analysis The aim of this section is to provide an overview of basic multivariate analysis possibilities with the available data. Functional methods are applied from Section Data Binning One approach, as described in the previous chapter, is the creation of a vertical grid for the data domain followed by binning the data into a limited number of buckets along the time or distance axis, similar to creating a histogram. If there is more than one observation of a truck in one of these bins, an average of these measurements is put into the bin. This has to be done to avoid biasing in case of dense observations of a truck within a short timespan. The size and the quantity of the bins is crucial for binning. With the data at hand, 25 bins were used, which results in a size of 6087 kilometers per bin. In Figure 4.1 the number of observations per bin, as well as an estimation of the mean function and the variance of the data can be seen. In Figure 4.2 a boxplot of the binned data and one of the results of bootstrapping [1] the mean value per bin (10000 bootstrap samples) are illustrated. 21

32 4. Results Histogram Mean and Deviation Observations Fuel Mileage [km/l] Bins Distance Driven [km] Figure 4.1: The histogram depicts the number of observations per bin. Especially the first and the last few bins have a very small number of observations, which leads to the abnormal results in these bins in the mean and standard deviation figure on the right. This figure shows the mean as well as the standard deviation estimated from the binned data.

33 4. Results 2 Binned Data Boxplot Bootstrapped Mean Boxplot Values Bin Values Bin Figure 4.2: The figures show boxplots for the binned data (left) and bootstrapped mean values (right). The left boxplot is a simple plot of the raw binned data, providing an easy visualization. The right boxplot is generated by bootstrapping the mean of each bin times. Bootstrapping should give an idea of how much the mean can vary, if new data has the same distribution as the data at hand.

34 4. Results Feature Extraction The features which are retrieved from all observations of a single truck, are used to construct a simple outlier detector with hierarchical clustering. The goal of this simple outlier detector is to find trucks, whose mean is deviating significantly from the mean of the entire data. A single extracted feature was used in this case: T ruck = (µ T ruck µ All ) 2 The data was then clustered with a hierarchical algorithm, using average distance linking. The outlying classes were subjectively selected by looking at the resulting dendrogram. For the results, see Figure 4..

35 4. Results Dendrogram.1 Plot Class Distance Fuel Mileage [km/l] Class Distance Driven [km] Figure 4.: Results of outlier detection based on feature extraction. The left figure shows the dendrogram of the clustering algorithm. This figure shows that the class 6 is an extreme outlier, whereas the classes and 7 are also quite different from the main part of the data. The basis for this classes being outliers is a vastly different mean from the rest of the data. In the other figure, the outlying clusters are highlighted. The extreme outlier is marked red, the normal outliers are marked green and the normal data is colored blue. The class has 5 members, whereas the other outlier classes have just 1 member. The classes 1 and 5 have 114 respective 52 members. Class 2 has 27 members, whereas 4 has 1 members.

36 4. Results Function Fitting Finding a plausible function that is fitting the data of the trucks well is difficult, because of the open-ended nature of the measurements. If a set of observations have a defined start and an end of their measurements, i.e. the data is fully observed, it is easy to interpolate the data in between, even if the data within this span is sparse. This property of the data at hand is also discussed in Section.1. If the set of data is not fully observed, it is almost impossible to get a reliable fit outside the observation span of a single entity. This reliable fit outside of this span is necessary for performing FDA on this data, as FDA needs the same set of basis functions, or in the case of spline interpolation, the same knots for all functions to work. It was not possible to get a good fit on this data with splines, where all of the knots are distributed the same for all truck entities. Also, polynomial fits, i.e. the approximation of the data with low (< 5) order did not result in a stable fit for the available data. The most reliable fit under these conditions were generated by fitting a linear function to the fuel consumption observations. These results in fitting the sparse and irregular data motivate the idea of combining the observations by the means of PACE, to be able to get better fits from the reconstructed trajectories. The results of fitting a straight line to the data can be seen in Figure 4.4.

37 4. Results Fuel Mileage [km/l] 2.5 Fuel Mileage [km/l] Distance Driven [km] 1.5 Distance Driven [km] Figure 4.4: On the left, all fitted straight lines are shown. The right figure shows the mean straight line along with the standard deviation of the slope and the offset (blue) and the standard deviation of just the offset (dashed). The main problem with this straight line fit are a number of fits with high gradients, which are not valid outside their observation span. However, the mean line shows a slight increase in fuel economy, just like the mean curve from PACE (Figure 4.5).

38 4. Results Application of PACE The goal of this section is to elaborate on the application of the PACE method on the truck data, focusing only on fuel consumption per kilometer over the distance axis. Along with the results of this first application, some options available for a fine-tuning of the method will be presented and a general estimate of variability will be given Baseline PACE Results The data in use for this initial run of the PACE method is the cleaned set, with all the trucks removed which have less than 2 observations. Additionally every observation, that happened before a threshold of km has been removed. The PACE method has some interchangeable sub-methods. For the baseline results, mostly the same parts as in the original method described in [4] were used. Thus, the kernel used for smoothing the mean function is the Epanechnikov kernel [4] and the input data is assumed to contain measurement errors. A small discrepancy to the original method is the choice of using Fraction of Variance Explained 1 (FVE), instead of the Akaike Information Criterion [1] (AIC) to select the number of PCs. The FVE threshold is set at 95 % of variance explained. Regarding Figure 4.5, the smoothed mean curve should be taken with a grain of salt, especially the variance plots and the measurement density plots in Figure. should be considered. The number of PCs selected by FVE is 8, which accounts for % of the total variation. The scree plot (Section 2.1.4) of the principal components from this analysis can be seen in Figure 4.6. The first, strong principal component is almost a straight line, which is basically shifting the mean from its starting point closer to the position of the measurements. The second and the fourth principal component seem to serve partially as corrective for trucks with a higher initial fuel economy than the average truck. The smoothed covariance matrix generated and used by PACE is visualized in Figure 4.7 by a color-matrix. 1 The sum of the eigenvalues of a certain number of eigenfunctions divided by the sum of all eigenvalues has to exceed a certain threshold. The first number of PCs which exceeds this threshold is subsequently used.

39 4. Results 29 Smooth mean curve 4 x 10 Principal Components % % 8.65 % 4.0 % Fuel Mileage [km/l] Distance Driven [km] 0 Distance Driven [km] Figure 4.5: The smooth mean function generated by PACE (left) is the basis for all other results. The four most significant PCs (right) are the strongest ways in which the individual trucks vary. The legend quantifies the strength of the PCs.

40 4. Results Scree Plot Fraction of variance explained (%) Number of principal components Figure 4.6: The scree plot, which highlights the trade-off between the number of PCs used versus the variance retained. The use of more than 10 PCs makes little sense, as the Fraction of Variance Explained (FVE) is not improving much.

41 4. Results 1 9 x 105 Smoothed Covariance Matrix Distance Driven [km] Distance Driven [km] 0.04 Figure 4.7: The smoothed covariance matrix generated by PACE. (The diagonal, which is the variance, has been removed prior to smoothing.) The main part of the matrix shows a small positive covariance (green).

42 4. Results 2.2 Vehicle # 14 Vehicle # 106 Vehicle # 92 Vehicle # 72 Vehicle # 4 Fuel Consumption [km/l] Figure 4.8: These plots exhibit the mean curve(red), the corresponding original observations(green) and the reconstructed curve(blue). Vehicle 14 and 106 have high values on all major PC scores, under opposite prefixes. Number 92 has the lowest PC scores overall; Trucks 72 and 4 have average PC scores. High PC scores lead to extreme values, especially on the strong first PC. From the estimated PCA scores, the mean function µ and the principal component functions, the individual traces of the trucks can be reconstructed, which should give a rough estimate on the behavior of the truck. A number of selected reconstructions can be viewed in Figure 4.8 and a collection of all traces and the original measurements can be seen in Figure 4.9. As a next step, for an analysis of the results, the goodness-of-fit of the original measurements versus the reconstructed traces is assessed. To estimate the goodness-of-fit, the mean squared error [1] between the discrete observation and the estimated reconstruction is considered. However, the irregular measurement intervals are making assessment of the results difficult. In Figure 4.10 some examples of bad fits are explained. Just taking the mean of the mean square error (MSE) of all observations of one truck is prone to skewing, as well as just summing up the MSE for each single truck. A more sensible approach to

43 4. Results.2 Fuel Mileage [km/l] Distance Driven [km] Figure 4.9: This graph shows all reconstructed traces (gray) and original measurements (blue). Note how the traces tend to follow the observations, especially when the relative occurrence of observations is low..2 Vehicle # 7 Vehicle # 102 Vehicle # 106 Vehicle # 202 Fuel Consumption [km/l] Figure 4.10: As described in the text, these figures depict misfitted trucks. Vehicle #7 and #106 show trucks which provide bad fits, whereas #102 is a truck which is only identifiable as misfit when median mean square error (MSE) is applied. Truck #202 is a counter-example, where the misfit is more noticeable when the mean MSE is used.